Professional Documents
Culture Documents
Abstract—We consider grid connected solar microgrid system Integrating renewable energy in microgrid is the way forward
which contains a local consumers, solar photo voltaic (PV) for economic and environmental optimization generating clean
systems, load and battery. The consumer as an agent continuously and green energy and thereby providing solution to the global
interacts with the environment and learns to take optimal actions
through a model-free Reinforcement Learning algorithm, namely warming [2]. A microgrid can connect and disconnect from
Q Learning. The aim of the agent is to optimally schedule the the grid to enable it to operate in both grid-connected or
battery to increase the utility of the battery and solar photo island mode. Microgrids can provide quantifiable benefits
voltaic system and thereby aims for the long term objective for utilities, their customers, and energy service companies,
of reducing the power consumption from grid. Multiple agents regardless of ownership. Solar Photo Voltaic System (PV)
sense the states of environment components and make collective
decisions about how to respond to randomness in load and uptake by customers is increasing, due to incentives, improved
intermittent solar power by using a Multi agent reinforcement performanceand lower cost. Large utility customers increas-
algorithm, namely Coordinated Q Learning (CQ Learning). ingly seek reliability, autonomy, cost certainty and resilience.
Each agent learns to optimize individually and contribute to The importance of having more reliable, efficient, smart sys-
global optimization. Grid power consumed when solar PV system tems is getting more public attention. In the present scenario
operates individually, by using Q learning is compared with
operation of many such solar PV systems in a distributed the consumer wants not only intelligent but smart machines,
environment using CQ learning and it is proved that the grid which can think and operate autonomously and optimally.
power requirement is considerably reduced in CQ learning than Smart energy management of microgrids using genetic al-
in Q learning. Simulation results using real numerical data are gorithm is discussed in [2] and [3]. Energy management
presented for a reliability test of the system. of microgrids using fuzzy logic is discussed in [4]. Energy
Index Terms—Solar microgrid; Reinforcement learning; Q-
learning; CQ Learning; Battery scheduling; Distributed Opti-
management of hybrid renewable energy generation using
mization. constrained optimization was proposed in [5]. An agent-based
modeling approach is used to model microgrids and by simu-
lation the interactions between individual intelligent decision-
I. I NTRODUCTION
makers are analyzed in [2], [3] and [8]. Expert system and
Electricity is a crucial input to propel socio-economic other classical and heuristic algorithms for energy management
growth of a nation. Over the past decade, the electricity gen- of microgrids are discussed in [6], [7] and [9]. Reinforcement
eration, transmission and distribution landscape has changed learning for optimization of wind energy system is discussed
significantly. Open electricity market, penetration of large in [10] and [15]. Coordination Q-Learning method in Multi-
scale renewable generation etc. have further increases the Agent Reinforcement Learning is discussed in [12] and [16].
complexity in the traditional electricity grid towards providing The aim of this paper is to introduce a smart decision making
secure and reliable supply with quality. The key element system using a multi-agent reinforcement learning method,
driving change is the emergence of smart grid. Renewable called Coordination Q-Learning (CQ Learning), for optimiza-
energy plays a significant role in building a green and sus- tion of the distributed energy management in the microgrid.
tainable environment. Solar and wind are the only solution The system behaves in strategic manner when dealing with
to the growing energy crisis in the world [1]. The smart operational scenarios, by aiming to achieve the lowest possible
grid paradigm represents a transition towards an intelligent, cost of power generation.
digitally enhanced, two way power delivery grids. The aim of The rest of the paper is organized as follows. In section II,
smart grid is to promote and enhance the efficient management solar microgrid is explained with the details of solar photo
of operation of the power generation and delivery facilities voltaic system. Section III presents the modeling framework
by incorporating advanced communications, Information tech- of consumer agent with comprehensive framework of rein-
nology, automation and control methodologies into power forcement learning. In section IV, Multi agent reinforcement
grid. A microgrid is a group of interconnected loads and learning framework is explained with CQ Learning. In section
distributed energy resources within clearly defined electrical V, the performance improvement of solar microgrid by the
boundaries that acts as a single controllable entity with respect optimal scheduling of the battery to increase the utility of
to the grid. Microgrid is the building block of smart grid. the battery and solar power, to reduce the power consumption
from grid is discussed in detail. Also the performance of Solar converts it into electric energy [13]. PV generation fits the load
PV systems, when operating individually using Q learning demand very well since solar irradiation is higher in daytime.
and when they operate in a distributed environment, using
CQ learning are analyzed and compared. Conclusions and
suggestion to possible improvements are given in the last
section.
making. In the MDP, the environment is modeled as set of the values of a certain number of its features, collectively
states and actions can be performed to control the system called its state, denoted by S(t) at time t. Every state has
state. The effect of an action taken in a state is dependent an intrinsic value, based on reward or cost, denoted by R(t)
only on that state and not on the prior history. The goal is at time t. The agents observe the environment and take an
to control the system in such a way that some performance action. It gets a reward or punishment from the environment.
criterion is maximized. This section presents the reinforcement The training information is used to evaluate actions (in terms
learning algorithm used by the consumer agent to interact, of reward or punishment received from the environment) taken
adapt, and take decisions towards its goal defined in the form by the agent. The agent takes the next action to optimize the
of reward functions in the MDP environment, characterized reward in the long run. After a number of interactions, with the
by the available solar power output Psp , the load Dt and the enough learning, the agent finds the optimal policy to achieve
level of battery charge Rt . long term objective. The agent’s choice of action is based on
its past experience of action taken in a certain state and the
A. Markov Decision Process concomitant reward/cost experienced, to update its decision
Markov Decision Process (MDP) is a way to model a making process for future actions. In Dynamic Programming
sequential decision making under uncertainty. We formalize (DP) value iteration and policy iterations are the two methods
an MDP, considering discrete states and actions. The initial to find the optimal policy. But these methods requires models
state is s0 and each state will have a reward r associated with of the environment so we choose a model free Reinforcement
it. The transition function T (s|a, s ) indicates the probability learning method [11].
of transitioning from state s to s when action a is taken.
A discount factor γ in the range 0 . . . 1 is applied to future C. Reward function
rewards. This represents the notion that a current reward is We optimize the battery scheduling of the solar microgrid
more valuable than one in the future. If it is near zero, future by reinforcement learning. This is a process of action-reward
rewards are almost ignored; a near one places great value dynamics, driven by quantitative performance indicators which
on future reward. The reward from a policy is the sum of evaluate the action or sequence of actions undertaken and
the discounted expected utility of each state visited by that feedback the value to adjust future scheduling decisions. The
policy. The optimal policy is the policy that maximizes the optimization of the numerical reward is achieved through the
total expected discounted reward. choice of the actions a0 and a1 of battery scheduling. The
consumer aims at increasing its performance by selecting an
B. Reinforcement Learning optimal sequence of actions. The reward functions are the
Reinforcement learning algorithm is used to model the response we get from the environment for the actions taken.
consumer’s adaptation to a dynamically changing environment If it is charging (a1 ) then the reward function is minimum
by performing actions of battery scheduling in an MDP envi- of Psp and Bdif f erence and if it is discharging (a0 ) then
ronment [10]. The agents observe the environment and take an the reward function is minimum of Dt and Blevel . Here,
action. It gets a reward or punishment from the environment. Bdif f erence is the difference between maximum possible
The agent takes the next action to optimize the reward in the charge and the current battery level (Blevel ). The optimal
long run. After a number of interactions, the agent finds the scheduling of the battery, and thus the increase of the solar
optimal policy to achieve long term objective. The goal of an microgrid performance with respect to the consumer goals, is
done by Q learning.
D. Q learning
Environment Q learning is a model-free reinforcement learning where
the agent explores the environment and finds the next reward
plus the best the agent can do from the next state. In Q
Action Reward State learning, the agent does not need to have any model of the
environment. It only needs to know what states exist and what
actions are possible in each state. We assign each state an
Action Value estimated value, called a Q value [11]. When we visit a state
selection estimation and take an action we receive a reward. We use this reward to
update our estimate of the value of that action in the long run.
Agent We visit the states infinitely often and the action values (Q
values) are continuously updated till it becomes convergent.
Fig. 3. Reinforcement Learning The Q learning algorithm is outlined in Algorithm 1 [11]. In
the algorithm, γ is the discount factor and α, learning rate.
agent is to find the optimal policy based on interactive learning The discounted factor contributes to determining the values of
with the environment. Fig. 3 shows a simple reinforcement future reward and the the learning rate influences the speed of
learning scheme. The environment can be characterized by convergence to Q values.
2014 IEEE International Conference on Computational Intelligence and Computing Research
Fig. 5. Solar power for 150 kw unit in a year Fig. 7. load pattern hostel
Fig. 8. Utility of battery in department Fig. 10. Grid power with QL in department
Fig. 12. Utility of battery with CQL Fig. 14. Grid power with individual Q Learning
Fig. 13. Utility of solar power with CQL Fig. 15. Grid power with CQ Learning