You are on page 1of 6

Reinforcement Learning

Reinforcement learning is learning what to do how to map situations to actions so as to


maximize a numerical reward signal. The learner is not told which actions to take, as in most
forms of machine learning, but instead must discover which actions yield the most reward by
trying them

Elements of Reinforcement Learning.

Beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system: a
policy, a reward function, a value function, and a model of the environment.

A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from

perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology

would be called a set of stimulus-response rules or associations. In some cases the policy may be a simple function or

lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core

of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may

be stochastic.

A reward function defines the goal in a reinforcement learning problem. Roughly speaking, it maps each perceived

state (or state-action pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that

state. A reinforcement learning agent's sole objective is to maximize the total reward it receives in the long run. The

reward function defines what are the good and bad events for the agent. In a biological system, it would not be

inappropriate to identify rewards with pleasure and pain. They are the immediate and defining features of the problem

faced by the agent. As such, the reward function must necessarily be unalterable by the agent. It may, however, serve

as a basis for altering the policy. For example, if an action selected by the policy is followed by low reward, then the

policy may be changed to select some other action in that situation in the future. In general, reward functions may be

stochastic.

A value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of

reward an agent can expect to accumulate over the future, starting from that state. Whereas rewards determine the

immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking

into account the states that are likely to follow, and the rewards available in those states. For example, a state might

always yield a low immediate reward but still have a high value because it is regularly followed by other states that

yield high rewards. Or the reverse could be true. To make a human analogy, rewards are like pleasure (if high) and

pain (if low), whereas values correspond to a more refined and farsighted judgment of how pleased or displeased we are

that our environment is in a particular state.

The fourth and final element of some reinforcement learning systems is a model of the environment. This is something

that mimics the behavior of the environment. For example, given a state and action, the model might predict the
resultant next state and next reward. Models are used for planning, by which we mean any way of deciding on a course

of action by considering possible future situations before they are actually experienced. The incorporation of models

and planning into reinforcement learning systems is a relatively new development. Early reinforcement learning

systems were explicitly trial-and-error learners; what they did was viewed as almost the opposite of planning.

Nevertheless, it gradually became clear that reinforcement learning methods are closely related to dynamic

programming methods, which do use models, and that they in turn are closely related to state-space planning

methods.

Q learning-

Q-Learning is one of the most important and simplest reinforcement learning algorithms. It uses
experience of each state transition to update one element of a table. The table contains entries for each
pair of states and actions. The table is then updated after each action and state change, reflecting
whether the reward was good or poor in the returning values. This can be ran as a greedy algorithm
where the state and action are selected to provide the greatest reward. The algorithm provides a way of
finding an optimal policy solely from experience.

The table can be represented by: Q


Each entry in Q has an entry: Q(s, a)
State is represented by: s
Action is represented by: a
Reward is represented by: r
Positive step-size parameter: A
Discount-rate parameter: R

A transition is represented by: st+1 having taken the action at and received the reward rt+1
The Q-Learning algorithm: Q(st, at) <- Q(st, at) + A[rt+1 + R max(Q(st+1, a) - Q(st, at)]

Model Based Learning

model-based learning where we completely know the environment model parameters,


p(rt+1|st,at) and P(st+1|st,at). In such a case, we do not need any exploration and can directly
solve for the optimal value function and policy using dynamic programming. The optimal value
function is unique and is the solution to the simultaneous equations given in equation . Once
we have the optimal value function, the optimal policy is to choose the action that maximizes
the value in the next state:

Temporal Difference Learning

Model is dened by the reward and next state probability distributions, and as we saw in section 18.4, when we know
these, we can solve for the optimal policy using dynamic programming. However, these methods are costly, and we
seldom have such perfect knowledge of the environment. The more interesting and realistic application of
reinforcement learning is when we do not have the model. This requires exploration of the environment to query the
model. We rst discuss how this exploration is done and later see model-free learning algorithms for deterministic and
nondeterministic cases. Though we are not going to assume a full knowledge of the environment model, we will
however require that it be stationary.

As we will see shortly, when we explore and get to see the value of the next state and reward, we use this information
to update the value of the current state. These algorithms are called temporal dierence algorithms

i. Exploration Strategies

To explore, one possibility is to use -greedy search where with probability, we choose one action uniformly randomly
among all possible actions, namely, explore, and with probability 1 , we choose the best action, namely, exploit. We
do not want to continue exploring indenitely but start exploiting once we do enough exploration; for this, we start with

P a | s

a high value and gradually decrease it. We need to make sure that our policy is soft, that is, the probability of choosing
any action a A in state s S is greater than 0. expQ s, a
expQs, b
A
b 1

expQs, a / T
We can choose probabilistically, using the softmax function to convert values to probabilities

P a | s
expQs, b / T
A
b 1

and then sample according to these probabilities. To gradually move from exploration to exploitation, we can use a
temperature variable T and dene the probability of choosing action a as When T is large, all probabilities are equal
and we have exploration. When T is small, better actions are favored. So the strategy is to start with a large T and
decrease it gradually, a procedure named annealing, which in this case moves from exploration to exploitation
smoothly in time.

ii. Deterministic reward and actions

Qst , at rt 1 max Qst 1 , at 1


the simpler deterministic case, where at any state-action pair, there is a single reward and next state possible.

at 1

and we simply use this as an assignment to update Q(st,at). When in state st, we choose action at by one of the
stochastic strategies we saw earlier, which returns a reward rt+1 and takes us to state st+1. We then update the value of

Q st , at rt 1 max Q st 1 , at 1
previous action as

at 1
Starting at zero, Q values increase, never decrease

Consider the value of action marked by *:

If path A is seen first, Q(*)=0.9*max(0,81)=73

Then B is seen, Q(*)=0.9*max(100,81)=90

Or,

If path B is seen first, Q(*)=0.9*max(100,0)=90

Then A is seen, Q(*)=0.9*max(100,81)=90

Q values increase but never decrease

Generalization
This is a supervised learning problem where we dene a regressor Q(s, a|), taking s and a as inputs and
parameterized by a vector of parameters, , to learn Q values. For example, this can be an articial neural
network with s and a as its inputs, one output, and its connection weights. A good function
approximator has the usual advantages and solves the problems discussed previously. A good
approximation may be achieved with a simple model without explicitly storing the training instances; it
can use continuous inputs; and it allows generalization. If we know that similar (s, a) pairs have similar Q
values, we can generalize from past cases and come up with good Q(s, a) values even if that state-action
pair has never been encountered before. To be able to train the regressor, we need a training set.

we would like Q(st,at) to get close to rt+1 + Q(st+1,at+1). So, we can form a set of training samples where the
input is the state-action pair (st,at) and the required output is rt+1 + Q(st+1,at+1). We can write the squared
error as ET(theta) and if we are using a gradient-descent method, as in training neural networks, the
parameter vector is updated as Del(theta)
E t rt 1 Qst 1 , at 1 Qst , at 2
rt 1 Qst 1 , at 1 Qst , at t Qst , at
Eligibility
t et
t rt 1 Qst 1 , at 1 Qst , at
et et 1 Qst , at with e 0 all zeros
t

Partially Observable States

In certain applications, the agent does not know the state exactly. It is equipped with sensors that return
an observation, which the agent then uses to estimate the state.

Example- MDP
A Markov decision process is a 5-tuple , where

is a finite set of states,

is a finite set of actions (alternatively, is the finite set of actions available from state ),

is the probability that action in state at time will lead to state at time ,
is the immediate reward (or expected immediate reward) received after transitioning

from state to state , due to action

is the discount factor, which represents the difference in importance between future
rewards and present rewards.

You might also like