Reinforcement Learning

Reinforcement Learning
1
Where does reinforcement learning stand?
▪ Supervised Learning: using a labelled training set to train a model, to then make predictions
on unlabelled data.
▪ Unsupervised Learning: giving a model an unlabelled data-set, the model has then to try to
find patterns in the data to make predictions.
▪ Reinforcement Learning: training a model trough a reward mechanism to encourage

positive behaviours in case of good performance (particularly used in agent-based
simulations, gaming and robotics).
2
Reinforcement Learning (RL): #1
▪ RL enables an agent to learn in an interactive environment by trial and error using feedback from its
own actions and experiences. Examples include: self-driving car & chess master alphaGo.
▪ Difference between supervised ML and RL? Though both supervised and reinforcement learning
use mapping between input and output, unlike supervised learning where feedback provided to the
agent is correct set of actions for performing a task, reinforcement learning uses rewards and
punishment as signals for positive and negative behaviour.
▪ Difference between unsupervised ML and RL? As compared to unsupervised learning,

reinforcement learning is different in terms of goals. While the goal in unsupervised learning is to
find similarities and differences between data points, in reinforcement learning the goal is to find a
suitable action model that would maximize the total cumulative reward of the agent.
https://www.kdnuggets.com/2018/03/5-things-reinforcement-learning.html 3
Reinforcement Learning (RL): #1.1
▪ Task: given a million coins of mixed denominations, find the number of coins in each
denomination.
▪ Supervised: I give you 1 million coins and say, “There are three types of coins in this - 10s,
20s and 50s”. I also give you the standard weights of the three denominations
▪ Unsupervised learning: I give you 1 million coins and that is it! No clue about how many
'categories' of denominations are there, nor any weights.
▪ Reinforcement Learning: I give you 1 million coins and tell you “There are 3 categories”. But
I don't specify which are the categories, nor the standard weights of any!
https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms 4
Reinforcement Learning (RL): #1.2
▪ RL is a subset of machine learning that attempts to find the maximum reward for a so-called
“agent” that interacts with an “environment.” RL is suitable for solving tasks that involve
deferred rewards, especially when those rewards are greater than intermediate rewards.
▪ RL refers to goal-oriented algorithms for reaching a complex goal, such as winning games
that involve multiple moves (e.g., chess or Go). RL algorithms are penalized for incorrect
decisions and rewarded for correct decisions: this reward mechanism is reinforcement.
▪ These algorithms are useful when the problem involves making decisions or taking actions.
python 3 for machine learning oswald campesato 5

Reinforcement learning: learn to select

an action to maximize payoff.
https://www.kdnuggets.com/2018/03/5-things-reinforcement-learning.html 6
4 core components
▪ Policy: defines the agent behaviour (maps the different states to actions). Policies are most likely to
be stochastic since each specific action is associated with a probability to be selected.
▪ Reward: is a signal used to alert the agent how should be best to modify its policy in order to achieve the
defined objectives (in the short time period). A reward is received to the agent from the environment
each time an action is performed.
▪ Value Function: is used in order to get a feeling of what actions can bring a greater return in the long run.
It works by assigning values to the different states to asses what kind of reward should an agent expect if
starting from any specific state.
▪ Environment Model: simulates the dynamics of the environment the agent is placed in and how the
environment should respond to the different actions taken by the agent. Depending on the application,
some RL algorithms do not necessarily require an environment model (model-free approach) since they
can be approached using a trial-error approach. Although, model-based approaches can enable RL
algorithms to tackle more complicated tasks which require planning.
https://www.kdnuggets.com/2021/04/getting-started-reinforcement-learning.html 7
▪ A common goal of a reinforcement learning algorithm is to learn an optimal policy
▪ Policy: Method to map agent’s state to actions
▪ In order to build an optimal policy, the agent faces the dilemma of exploring new states while maximizing its
reward at the same time. This is called Exploration vs Exploitation trade-off.
▪ The game PacMan can be programmed using RL.
▪ An optimal policy is a function (similar to the model in supervised learning) that takes the feature vector of a
state as input and outputs an optimal action to execute in that state.
▪ The action is optimal if it maximizes the expected average long-term reward. Reinforcement learning solves a
particular problem where decision making is sequential, and the goal is long-term, such as game playing, robotics,
resource management, or logistics.
https://www.dropbox.com/s/wxybbtbiv64yf0j/Chapter1.pdf?dl=0 8
Reinforcement learning: #4
▪ In reinforcement learning, the output is an action or sequence of actions and the only
supervisory signal is an occasional scalar reward.
▪ The goal in selecting each action is to maximize the expected sum of the future
rewards.
▪ We usually use a discount factor for delayed rewards so that we don’t have to look too
far into the future.
▪ Reinforcement learning is difficult:

▪ The rewards are typically delayed so its hard to know where we went wrong (or right).
▪ A scalar reward does not supply much information.
https://www.cs.toronto.edu/~hinton/coursera_slides.html
▪ Reinforcement learning is an important technique when you are deploying a piece of
software that aims to be “intelligent” but for which there is no training data at first.
▪ It is laborious and error-prone to have human data scientists hurriedly re-analyzing the data
as it comes in so that they can constantly deploy updated models. Reinforcement learning
makes it possible to have the system simply run itself, learning and improving without a
human in the loop.
▪ Of course you still want data scientists to carefully inspect the system’s performance after
the fact and look for patterns that the algorithm might have missed – there is no substitute
for human insight.
Data Science: The Executive Summary – A Technical Book for Non–Technical Professionals 10
▪ The key concept in reinforcement learning is the “exploration-exploitation tradeoff”: we

want to balance trying out different strategies to see what works against milking the
strategies that have worked the best so far.
▪ A typical reinforcement learning algorithm will start by always making random decisions.
▪ As the data starts to accumulate it will sometimes make decisions at random, and other
times it will make its best-guess at the optimal decision.
Two main challenges which characterize
Reinforcement Learning
▪ The exploration-exploitation dilemma: if an agent finds an action which can give him a
moderately high reward might be tempted to not try any other available action because
afraid it might be less successful. At the same time, if the agent doesn’t even attempt to try
a different action it might never find out that better rewards were possible to be achieved.
▪ Processing of delayed rewards: agents are not told what actions to try, but should instead
come up with different solutions, test them and finally evaluate them based on the received
reward. Agents should not evaluate their actions just on their immediate rewards. Choosing
some type of actions might, in fact, provide greater rewards not immediately but in the long
run.
https://www.kdnuggets.com/2021/04/getting-started-reinforcement-learning.html 12
Reinforcement learning: #6.1
▪ The agent is the core of any RL problem. It is the part of the RL algorithm that processes
input information in order to perform an action. It explores and exploits knowledge from
repeated trials in order to learn how to maximize the reward. The scenario that the agent
has to deal with is known as the environment, while actions are the possible moves that an
agent can make in a given environment.
▪ The return from an environment upon taking an action is called the reward, and the course
of action that the agent applies to determine the next action based on the current state is
called the policy. The expected long-term return with a discount, as opposed to the
short-term reward, is called the value. The q-value is similar to the value but has an
additional current action parameter.
Jibin Mathew, PyTorch Artificial Intelligence Fundamentals 13

▪ There are two methods:
▪ Multi-armed bandit
▪ Q-learning
Reinforcement learning: #8 [Multi-armed bandit]
▪ The simplest reinforcement learning model is called the “multi-armed bandit.”
▪ There are many situations where a multi-armed bandit would be a reasonable way to model the world.
The “levers” could be different ads that you show visitors to a website with the hope that they will click
on them
▪ The key assumption in the multi-armed bandit is that all of the pulls are independent. Which levers you
have pulled previously, or how many pulls you have made total, has no bearing on the potential outcome
of the next pull. ??
▪ The simplest reinforcement learning algorithm for a multi-armed bandit is called 𝜀-greedy. In this case 𝜀
measures how much we value exploration rather than exploitation.
Reinforcement learning: #9 [Multi-armed bandit]
▪ At every point in time you pull a random lever with probability 𝜀, and with probability 1 − 𝜀
you pull whichever lever has given the best average returns so far.
▪ A higher value for 𝜀 means that you find the best lever faster, but at the cost that you are
regularly pulling sub-optimal levers.
▪ In many production systems 𝜀 slowly decays to 0 over time, so that you are eventually
pulling only the best lever.
Reinforcement learning: #10 [Q-learning]
▪ A key limitation of the multi-armed bandit model is that all of the pulls are completely
independent of each other. This works well if every “pull” is a completely different
real-world event, but it breaks down if we are interacting with a single system that
changes over time.
▪ A more sophisticated alternative is called a “Markov decision process” (MDP). An MDP is

like a multi-armed bandit, except that the slot machine is in one of sev- eral different
internal states. Pulling a lever generates a reward, but it also changes the state of the
machine. The behaviour of each lever depends on what state you are in.
▪ This introduces the concept of delayed gratification: you may want to pull a low-reward
lever that puts the machine into a state that will be more profitable in later pulls. In the
multi-armed bandit the value of a lever was the average of the reward it gave you. In an
MDP it will be the average reward plus a measure of how valuable all your future rewards
will be.
▪ There is an algorithm called Q-learning that is used to measure how valuable every lever is,
in every state of the system. The value of a lever is defined to be the average of the reward
you get by pulling it, plus the “discounted” sum of all rewards you will receive in future
pulls.
▪ When I say “discounted” I mean that the value of a reward decays exponentially with the
number of steps until you get it; for example, a reward might be worth 90% of its value if
you get it one step later, 90% × 90% = 81% if you get it in two steps, and so on.
▪ Adding up the current reward, plus the discounted sum of all future rewards, puts all parts
of a lever’s value into a single number. The same idea is called “discounted cash flow” in
economics. The discount factor (0.9 in our example) is a parameter that gets set manually
and reflects how much you value future rewards.
▪ If you set it to zero then this all reduces to a multi-armed bandit.
▪ Crucially Q-learning does not tell you which lever you should actually pull; it estimates each
lever’s value, but the actual decision is often made in an 𝜀-greedy fashion.
Why DRL (Deep RL)?
▪ An RL solution involves storing the results from trial and error in a lookup table, which
becomes huge when the environment becomes more and more complex.
▪ A deep neural network might learn to recognize the same high-level features a programmer
would have to hand-engineer in a lookup table approach on its own.

DQN = Deep Q-learning Network
▪ In the traditional RL algorithm, this q-value comes from a q-table, which is a lookup table,
where it is a table holding q-values. This lookup table is updated iteratively by playing the
game over and over and using the reward to update the table. The q-learning algorithm
learns the optimum values to be populated in this table. We can simply look at the table for
a given state and select the action with the maximum q-value in order to maximize the
chance of winning the game.
▪ With Deep Q-learning, instead of using a Q table to look up the action with a maximum
possible q-value for a given state, we use a deep neural network to predict the Q-values for
the actions and pick the action with the maximum q-value for a given action.

Applications
▪ Autonomous (proposed)
▪ Industry automation with Reinforcement Learning (Deepmind used to cool Google data
centre)
▪ Trading and finance
▪ In NLP, RL can be used in text summarization, question answering, and machine
translation just to mention a few.
▪ Healthcare
▪ User recommendation
▪ Learning gaming
https://www.kdnuggets.com/2021/04/10-real-life-applications-reinforcement-learning.html
22
Resources
▪ https://spinningup.openai.com/en/latest/
23

Reinforcement Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Reinforcement Learning

▪ Reinforcement Learning: training a model trough a reward mechanism to encourage

▪ Difference between unsupervised ML and RL? As compared to unsupervised learning,

python 3 for machine learning oswald campesato 5

Reinforcement learning: learn to select

▪ Policy: Method to map agent’s state to actions

▪ The game PacMan can be programmed using RL.

▪ Reinforcement learning is difficult:

▪ The key concept in reinforcement learning is the “exploration-exploitation tradeoff”: we

Jibin Mathew, PyTorch Artificial Intelligence Fundamentals 13

▪ There are two methods:

▪ A more sophisticated alternative is called a “Markov decision process” (MDP). An MDP is

▪ If you set it to zero then this all reduces to a multi-armed bandit.

Jibin Mathew, PyTorch Artificial Intelligence Fundamentals 20

Jibin Mathew, PyTorch Artificial Intelligence Fundamentals 21

You might also like