Professional Documents
Culture Documents
Beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system: a
policy, a reward function, a value function, and a model of the environment.
A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from
perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology
would be called a set of stimulus-response rules or associations. In some cases the policy may be a simple function or
lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core
of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may
be stochastic.
A reward function defines the goal in a reinforcement learning problem. Roughly speaking, it maps each perceived
state (or state-action pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that
state. A reinforcement learning agent's sole objective is to maximize the total reward it receives in the long run. The
reward function defines what are the good and bad events for the agent. In a biological system, it would not be
inappropriate to identify rewards with pleasure and pain. They are the immediate and defining features of the problem
faced by the agent. As such, the reward function must necessarily be unalterable by the agent. It may, however, serve
as a basis for altering the policy. For example, if an action selected by the policy is followed by low reward, then the
policy may be changed to select some other action in that situation in the future. In general, reward functions may be
stochastic.
A value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of
reward an agent can expect to accumulate over the future, starting from that state. Whereas rewards determine the
immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking
into account the states that are likely to follow, and the rewards available in those states. For example, a state might
always yield a low immediate reward but still have a high value because it is regularly followed by other states that
yield high rewards. Or the reverse could be true. To make a human analogy, rewards are like pleasure (if high) and
pain (if low), whereas values correspond to a more refined and farsighted judgment of how pleased or displeased we are
The fourth and final element of some reinforcement learning systems is a model of the environment. This is something
that mimics the behavior of the environment. For example, given a state and action, the model might predict the
resultant next state and next reward. Models are used for planning, by which we mean any way of deciding on a course
of action by considering possible future situations before they are actually experienced. The incorporation of models
and planning into reinforcement learning systems is a relatively new development. Early reinforcement learning
systems were explicitly trial-and-error learners; what they did was viewed as almost the opposite of planning.
Nevertheless, it gradually became clear that reinforcement learning methods are closely related to dynamic
programming methods, which do use models, and that they in turn are closely related to state-space planning
methods.
Q learning-
Q-Learning is one of the most important and simplest reinforcement learning algorithms. It uses
experience of each state transition to update one element of a table. The table contains entries for each
pair of states and actions. The table is then updated after each action and state change, reflecting
whether the reward was good or poor in the returning values. This can be ran as a greedy algorithm
where the state and action are selected to provide the greatest reward. The algorithm provides a way of
finding an optimal policy solely from experience.
A transition is represented by: st+1 having taken the action at and received the reward rt+1
The Q-Learning algorithm: Q(st, at) <- Q(st, at) + A[rt+1 + R max(Q(st+1, a) - Q(st, at)]
Model is dened by the reward and next state probability distributions, and as we saw in section 18.4, when we know
these, we can solve for the optimal policy using dynamic programming. However, these methods are costly, and we
seldom have such perfect knowledge of the environment. The more interesting and realistic application of
reinforcement learning is when we do not have the model. This requires exploration of the environment to query the
model. We rst discuss how this exploration is done and later see model-free learning algorithms for deterministic and
nondeterministic cases. Though we are not going to assume a full knowledge of the environment model, we will
however require that it be stationary.
As we will see shortly, when we explore and get to see the value of the next state and reward, we use this information
to update the value of the current state. These algorithms are called temporal dierence algorithms
i. Exploration Strategies
To explore, one possibility is to use -greedy search where with probability, we choose one action uniformly randomly
among all possible actions, namely, explore, and with probability 1 , we choose the best action, namely, exploit. We
do not want to continue exploring indenitely but start exploiting once we do enough exploration; for this, we start with
P a | s
a high value and gradually decrease it. We need to make sure that our policy is soft, that is, the probability of choosing
any action a A in state s S is greater than 0. expQ s, a
expQs, b
A
b 1
expQs, a / T
We can choose probabilistically, using the softmax function to convert values to probabilities
P a | s
expQs, b / T
A
b 1
and then sample according to these probabilities. To gradually move from exploration to exploitation, we can use a
temperature variable T and dene the probability of choosing action a as When T is large, all probabilities are equal
and we have exploration. When T is small, better actions are favored. So the strategy is to start with a large T and
decrease it gradually, a procedure named annealing, which in this case moves from exploration to exploitation
smoothly in time.
at 1
and we simply use this as an assignment to update Q(st,at). When in state st, we choose action at by one of the
stochastic strategies we saw earlier, which returns a reward rt+1 and takes us to state st+1. We then update the value of
Q st , at rt 1 max Q st 1 , at 1
previous action as
at 1
Starting at zero, Q values increase, never decrease
Or,
Generalization
This is a supervised learning problem where we dene a regressor Q(s, a|), taking s and a as inputs and
parameterized by a vector of parameters, , to learn Q values. For example, this can be an articial neural
network with s and a as its inputs, one output, and its connection weights. A good function
approximator has the usual advantages and solves the problems discussed previously. A good
approximation may be achieved with a simple model without explicitly storing the training instances; it
can use continuous inputs; and it allows generalization. If we know that similar (s, a) pairs have similar Q
values, we can generalize from past cases and come up with good Q(s, a) values even if that state-action
pair has never been encountered before. To be able to train the regressor, we need a training set.
we would like Q(st,at) to get close to rt+1 + Q(st+1,at+1). So, we can form a set of training samples where the
input is the state-action pair (st,at) and the required output is rt+1 + Q(st+1,at+1). We can write the squared
error as ET(theta) and if we are using a gradient-descent method, as in training neural networks, the
parameter vector is updated as Del(theta)
E t rt 1 Qst 1 , at 1 Qst , at 2
rt 1 Qst 1 , at 1 Qst , at t Qst , at
Eligibility
t et
t rt 1 Qst 1 , at 1 Qst , at
et et 1 Qst , at with e 0 all zeros
t
In certain applications, the agent does not know the state exactly. It is equipped with sensors that return
an observation, which the agent then uses to estimate the state.
Example- MDP
A Markov decision process is a 5-tuple , where
is a finite set of actions (alternatively, is the finite set of actions available from state ),
is the probability that action in state at time will lead to state at time ,
is the immediate reward (or expected immediate reward) received after transitioning
is the discount factor, which represents the difference in importance between future
rewards and present rewards.