Professional Documents
Culture Documents
RL Book Summary
RL Book Summary
Reinforcement Learning
****************************************************************
Reward defines the goal of a reinforcement learning problem. where in each time
step the environment sends a single reward to the agent which it's objective is to
maximize the total reward over the long run.
IMP
If an action selected by the policy is followed by low reward then the policy might
be changed to select other action in that situation in the future.
The value of a state is the total amount of reward that an agent can expect to
acumulate over the future starting from that state.
Value estimation is the most important thing that has been learned about
reinforcement learning in the last decades.
First: We would set up a table of numbers, one for each possible state of the game.
Second: Each number will be the latest estimate of the probability of our winning
from that state.
Third: We treat this estimate as the state’s value, and the whole table is the
learned value function.
Chapter 2
*********
Multi-armed Bandits
******************
At the end of this chapter, we take a step closer to the full reinforcement
learning problem by discussing what happens when the bandit problem
becomes associative, that is, when actions are taken in more than one situation.
This is the original form of the k-armed bandit problem, so named by analogy to a
slot
machine, or “one-armed bandit,” except that it has k levers instead of one. Each
action
selection is like a play of one of the slot machine’s levers, and the rewards are
the payo↵s
for hitting the jackpot. Through repeated action selections you are to maximize
your
winnings by concentrating your actions on the best levers. Another analogy is that
of
a doctor choosing between experimental treatments for a series of seriously ill
patients.
Each action is the selection of a treatment, and each reward is the survival or
well-being
of the patient. Today the term “bandit problem” is sometimes used for a
generalization
of the problem described above, but in this book we use it to refer just to this
simple case.
The action with the greatest estimated value is called greedy action.
Exploiting: is selecting the greedy action with the current highest estimated
value.
Exploring: is selecting non greedy action because this enables you to improve your
estimate of the non greedy action's value.