08 Reinforcement Learning

Machine Learning
Course Number: 41021903
AlBaha University
Faculty of Computer Science and Information Technology
Semester/Year: 2nd Year, 6thLevel
Lecture-8: Reinforcement Learning
Dr. Fahad Ghamdi

Dr. Maha Alamri
AlBaha University Faculty of Computer Science and Information Technology Course: Machine Learning 1
References
• Main Textbook:
• Chris Bishop; Pattern Recognition and Machine Learning
• Alternate References:
• T. Poggio and S. Smale. The Mathematics of Learning: Dealing with Data. Notices of the AMS, 2003
• Pedro Domingos. A few useful things to know about machine learning. Communications of the

ACM CACM Homepage archive. Volume 55 Issue 10, October 2012 Pages 78-87.
• T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Prediction, Inference and
Data Mining. Second Edition, Springer Verlag, 2009 (available for free from the author's website).
• NOTE:
• Some material was derived from online sources or authored by other Faculty members.
AlBaha University Faculty of Computer Science and Information Technology Course Coordinator: Dr. Abdulkareem Aodah 1.2
Outline
• Definition of Reinforcement Learning
• How Reinforcement Learning works
• Reinforcement Learning Algorithms
• Characteristics of Reinforcement Learning
• Learning Models of Reinforcement
Learning Types
• Supervised learning:
• A situation in which sample (input, output) pairs of the function to be learned can be
perceived or are given
• You can think it as if there is a kind teacher
• Reinforcement learning:
• In the case of the agent acts on its environment, it receives some evaluation of its action
(reinforcement), but is not told of which action is the correct one to achieve its goal
• RL emphasizes learning feedback that evaluates the learner's performance without

providing standards of correctness in the form of behavioral targets Example: bicycles
learning
What is Reinforcement Learning?
• Reinforcement Learning is defined as a Machine Learning method that is
concerned with how software agents should take actions in an environment.
• Problems involving an agent interacting with an environment, which provides
numeric reward signals
• Goal: Learn how to take actions in order to maximize reward
Reinforcement Learning
• Task
Learn how to behave successfully to achieve a goal while interacting with an
external environment
• Learn via experiences!
• Examples
• Game playing: player knows whether it win or lose, but not know how to move
at each step
• Control: a traffic system can measure the delay of cars, but not know how to
decrease it.
Elements of Reinforcement Learning
Here are some important terms used in Reinforcement AI:
Agent: It is an assumed entity which performs actions in an environment to gain some reward.
Environment (e): A scenario that an agent has to face.
Reward (R): An immediate return given to an agent when he or she performs specific action or
task.
State (s): State refers to the current situation returned by the environment.
Policy (π): It is a strategy which applies by the agent to decide the next action based on the
current state.
Value (V): It is expected long-term return with discount, as compared to the short-term reward.
Value Function: It specifies the value of a state that is the total amount of reward. It is an agent
which should be expected beginning from that state.
Elements of Reinforcement Learning
• Model of the environment: This mimics the behavior of the environment. It helps you
to make inferences to be made and also determine how the environment will behave.
• Model based methods: It is a method for solving reinforcement learning problems
which use model-based methods.
• Q value or action value (Q): Q value is quite similar to value. The only difference
between the two is that it takes an additional parameter as a current action.
Examples: Robot Locomotion
Objective: Make the robot move forward
State: Angle and position of the joints

Action: Torques applied on joints
Reward: 1 at each time step upright +
forward movement
Examples: Atari Games
Objective: Complete the game with the highest score
State: Raw pixel inputs of the game state

Action: Game controls e.g. Left, Right, Up, Down
Reward: Score increase/decrease at each time step
Examples: Go
Objective: Win the game!
State: Position of all pieces

Action: Where to put the next piece down
Reward: 1 if win at the end of the game, 0
otherwise
How Reinforcement Learning works
Let's see some simple example which helps us to illustrate the reinforcement
learning mechanism. Consider the scenario of teaching new tricks to our cat
• As cat doesn't understand English or any other

human language, we can't tell her directly what
to do. Instead, we follow a different strategy.
• We emulate a situation, and the cat tries to respond
in many different ways. If the cat’s Response is the
desired way, we will give her fish.
• Now whenever the cat is exposed to the same situation, the
cat executes a similar action with even more enthusiastically
in expectation of getting more reward (food).
• That's like learning that cat gets from "what to do" from
positive experiences.
• At the same time, the cat also learns what not do when
faced with negative experiences.
In this case:
• Our cat is an agent that is exposed to the environment.
• In this case, it is your house. An example of a state could be your cat sitting, and
you use a specific word in for cat to walk.
• Our agent reacts by performing an action transition from one "state" to
another "state.”
• For example, your cat goes from sitting to walking.
• The reaction of an agent is an action, and the policy is a method of
selecting an action given a state in expectation of better outcomes.
• After the transition, they may get a reward or penalty in return.
Reinforcement Learning Algorithms
There are three approaches to implement a Reinforcement Learning algorithm.
• Value-Based: This method aims to maximize a value function V(s).
• The agent is expecting a long-term return of the current states under policy π.
• Policy-based: This method aims to develop a policy that the action performed in
every state helps you to gain maximum reward in the future. Two types of policy-
based methods are:
o Deterministic: For any state, the same action is produced by the policy π.
o Stochastic: Every action has a certain probability.
• Model-Based: This method aims to create a virtual model for each environment.
The agent learns to perform in that specific environment.
Characteristics of Reinforcement Learning
• There is no supervisor, only a real number or reward signal
• Sequential decision making
• Time plays a crucial role in Reinforcement problems
• Feedback is always delayed, not instantaneous
• Agent's actions determine the subsequent data it receives
Types of Reinforcement Learning
Two kinds of reinforcement learning methods are:
• Positive:
• It is defined as an event, that occurs because of specific behavior. It increases the
strength and the frequency of the behavior and impacts positively on the action
taken by the agent.
• This type of Reinforcement helps you to maximize performance and sustain
change for a more extended period. However, too much Reinforcement may lead
to over-optimization of state, which can affect the results.
Types of Reinforcement Learning
Two kinds of reinforcement learning methods are:
• Negative:
• Negative Reinforcement is defined as strengthening of behavior that occurs
because of a negative condition which should have stopped or avoided. It helps
you to define the minimum stand of performance. However, the drawback of
this method is that it provides enough to meet up the minimum behavior.
Learning Models of Reinforcement
There are two learning models in reinforcement learning:
• Markov Decision Process
• Q learning
Markov Decision Process
• Mathematical formulation of the RL problem
• Markov property: Current state completely characterises the state of the world
• Defined by:
• : set of possible states

• : set of possible actions
• : distribution of reward given (state, action) pair
• : transition probability i.e. distribution over next state given (state, action) pair
• : discount factor
Markov Decision Process
• At time step t=0, environment samples initial state
• Then, for t=0 until done:
• Agent selects action
• Environment samples reward
• Environment samples next state
• Agent receives reward and next state
• A policy π is a function from S to A that specifies what action to take in each state
• Objective: find policy π * that maximizes cumulative discounted
reward:
Markov Decision Process Example:
A simple MDP: Grid World
Set a negative “reward” for

each transition (e.g. r=-1)
Objective: reach one of terminal states (greyed out) in

least number of actions
Markov Decision Process Example:
A simple MDP: Grid World
Random Policy Optimal Policy
The optimal policy π *
• We want to find optimal policy π * that maximizes the sum of rewards.

• How do we handle the randomness (initial state, transition probability…)?
Maximize the expected sum of rewards!
Q learning
• Q learning is a value-based method of supplying information to inform which action an agent
should take.
• The Q-Learning algorithm:
1. Set the gamma parameter, and environment rewards in matrix R.
2. Initialize matrix Q to zero.
3. For each episode:
Select a random initial state.
Do While the goal state hasn't been reached.
• Select one among all possible actions for the current state.
• Using this possible action, consider going to the next state.
• Get maximum Q value for this next state based on all possible actions.
• Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
• Set the next state as the current state.
End Do
End For
Example: Q learning
• There are five rooms in a building which are connected by doors.
• Each room is numbered 0 to 4
• The outside of the building can be one big outside area (5)
• Doors number 1 and 4 lead into the building from room 5
• We can represent the rooms on a graph, each room as a node, and each door as a
link.
Example: Q learning
Next, associate a reward value to each door:
• Doors which lead directly to the goal have a reward of 100
• Doors which is not directly connected to the target room
gives zero reward
• As doors are two-way, and two arrows are assigned for each
room
• Every arrow in the image contains an instant reward value
• Each room, including outside, a "state", and the agent's
movement from one room to another will be an "action".
In our diagram, a "state" is described as a node, while
"action" is represented by the arrows.
Example: Q learning
For example, an agent traverse from room number 2 to 5
• Initial state = state 2
• State 2-> state 3
• State 3 -> state (2,1,4)
• State 4-> state (0,5,3)
• State 1-> state (5,3)
• State 0-> state 4
Example: Q learning
• We can put the state diagram and the instant reward values into the following reward table,
"matrix R".
• The -1's in the table represent null values (i.e.; where there isn't a link between nodes). For
example, State 0 cannot go to State 1.
Example: Q learning
• We'll start by setting the value of the learning parameter Gamma = 0.8, and the
initial state as Room 1.
• Initialize matrix Q as a zero matrix:
• Look at the second row (state 1) of matrix R. There are two possible actions for
the current state 1: go to state 3, or go to state 5. By random selection, we
select to go to 5 as our action.
• if our agent were in state 5. Look at the sixth row of the reward matrix R (i.e.
state 5). It has 3 possible actions: go to state 1, 4 or 5.
• Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
• Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100 Since matrix Q is still
initialized to zero, Q(5, 1), Q(5, 4), Q(5, 5), are all zero. The result of this computation for Q(1, 5) is 100 because of the instant reward from R(5, 1).
• The next state, 5, now becomes the current state. Because 5 is the goal state,
we've finished one episode.
Example: Q learning
• Once the matrix Q gets close enough to a state of
convergence, we know our agent has learned the most
optimal paths to the goal state. Tracing the best
sequences of states is as simple as following the links with
the highest values at each state.
• For example:
• From initial State 2, the agent can use the matrix Q as a guide:
• From State 2 the maximum Q values suggests the action to go to
state 3.
• From State 3 the maximum Q values suggest two alternatives: go
to state 1 or 4. Suppose we arbitrarily choose to go to 1.
• From State 1 the maximum Q values suggests the action to go to
state 5.
• Thus the sequence is 2 - 3 - 1 - 5.
Why use Reinforcement Learning?
• It helps to find which situation needs an action
• Help in discovering which action yields the highest reward over the
longer period.
• Reinforcement Learning also provides the learning agent with a reward
function.
• It also allows it to figure out the best method for obtaining large
rewards.
Thank You!
AlBaha University Faculty of Computer Science and Information Technology Course: Machine Learning 1.33

08 Reinforcement Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

08 Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Machine Learning

Course Number: 41021903

Lecture-8: Reinforcement Learning

Dr. Fahad Ghamdi

• Pedro Domingos. A few useful things to know about machine learning. Communications of the

• RL emphasizes learning feedback that evaluates the learner's performance without

Objective: Make the robot move forward

State: Angle and position of the joints

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game state

Objective: Win the game!

State: Position of all pieces

• As cat doesn't understand English or any other

• : set of possible states

Set a negative “reward” for

Objective: reach one of terminal states (greyed out) in

Random Policy Optimal Policy

• We want to find optimal policy π * that maximizes the sum of rewards.

You might also like