You are on page 1of 2

****************************************************************

Reinforcement Learning
****************************************************************

**Elements of Reinforcement Learning**

policy is a mapping from states of environment to actions to be taken.


policies stochastically specifying the probabilities of each action.
The policy is the core of a reinforcement learning agent

Reward defines the goal of a reinforcement learning problem. where in each time
step the environment sends a single reward to the agent which it's objective is to
maximize the total reward over the long run.

IMP
If an action selected by the policy is followed by low reward then the policy might
be changed to select other action in that situation in the future.

Value function specifies the good actions in the long run.

The value of a state is the total amount of reward that an agent can expect to
acumulate over the future starting from that state.

Value estimation is the most important thing that has been learned about
reinforcement learning in the last decades.

Model is something that mimics the behavior of the environment.

Planning is called model-based methods.

**Limitations and Scope**

1-Evolutionary methods ignore much of the useful structure of the


reinforcement learning problem: they do not use the fact that the policy they are
searching for is a function from states to actions.

2- Evolutionary methods do not notice which states an individual passes through


during its lifetime, or which actions it selects. In some cases

1.5 An Extended Example: Tic-Tac-Toe

First: We would set up a table of numbers, one for each possible state of the game.

Second: Each number will be the latest estimate of the probability of our winning
from that state.

Third: We treat this estimate as the state’s value, and the whole table is the
learned value function.

Part I: Tabular Solution Methods


**************************
In this part of the book we describe almost all the core ideas of reinforcement
learning
algorithms in their simplest forms.

Chapter 2
*********
Multi-armed Bandits
******************

At the end of this chapter, we take a step closer to the full reinforcement
learning problem by discussing what happens when the bandit problem
becomes associative, that is, when actions are taken in more than one situation.

2.1 A k-armed Bandit Problem

This is the original form of the k-armed bandit problem, so named by analogy to a
slot
machine, or “one-armed bandit,” except that it has k levers instead of one. Each
action
selection is like a play of one of the slot machine’s levers, and the rewards are
the payo↵s
for hitting the jackpot. Through repeated action selections you are to maximize
your
winnings by concentrating your actions on the best levers. Another analogy is that
of
a doctor choosing between experimental treatments for a series of seriously ill
patients.
Each action is the selection of a treatment, and each reward is the survival or
well-being
of the patient. Today the term “bandit problem” is sometimes used for a
generalization
of the problem described above, but in this book we use it to refer just to this
simple case.

Equation 1 in text book

The action with the greatest estimated value is called greedy action.

Exploiting: is selecting the greedy action with the current highest estimated
value.

Exploring: is selecting non greedy action because this enables you to improve your
estimate of the non greedy action's value.

2.2 Action-vakue Methods

You might also like