Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)

UNIT-5 PART C
1)Explain the Q function and Q learning algorithm assuming

deterministic rewards and actions with example.
ans)
https://www.freecodecamp.org/news/an-introduction-to-q-learn
ing-reinforcement-learning-14ac0b4493cc/
2) Explain the K – nearest neighbor algorithm for

approximating a discrete– valued function f : Hn→ V with
pseudo code.

3)Compare unsupervised learning and reinforcement learning
with examples.

4)Develop a Q learning task for the recommendation system of
an online shopping website. What will be the environment of the
system? Write the cost function and value function for the
system
A Q-learning task has these components: Agent, Environment,

State, Reward Function, Value Function and Policy.
1)To simplify the problem, we assume a hypothetical user whose

experience on an online shopping store is pooled from all the
actual users.
2)Our recommender model is going to be the agent of the

system that processes products to this hypothetical user that
will buy/not-buy the recommendation.
3)The user behaves as the system’s environment, responding to

the system’s recommendation depending to the state of the
system.
4)User feedback determines our reward, that is, one score only
if the user buys.
5)Action of the agent is product recommended.
6)Our state is defined as the product features and

corresponding user reactions of past 5 steps, excluding the
current step.
7)Therefore, feedback and action together give us the next

state.
The goal of the agent is to learn a policy that maximizes

accumulated rewards.

5)Identify the suitable learning method for training a robotic

arm and explain it.
ans) Industrial robots deployed today across various industries

are mostly doing repetitive tasks. Basically, moving or putting
objects in predefined trajectories. But the reality is that the
ability of robots to handle different or complex environments is
really limited in today’s manufacturing.The main challenge that
we have to overcome is designing adaptable control algorithms
that can easily adapt to new environments.
Reinforcement learning (RL) is a type of Machine Learning where

we can teach an agent how to behave in an environment by
performing actions and seeing the results.
The concept of Reinforcement Learning has been around for a

while but the algorithm was not very adaptable and was
incapable of doing continuous tasks.
For RL, we use a framework called the Markov Decision Process
(MDP) which produces an easy framework for a really complex
problem. An agent (e.g. robotic arm) would first observe the
environment it’s in and take actions accordingly. Rewards are
given out according to the result.

For robotic control the state is measured by using sensors to
measure the joint angles, velocity, and the end-effector pose:
Policy
The main objective is to find a policy. A policy is something that
tells us how to act in a particular state. The objective is to find a
policy that makes the most rewarding decisions.
Now, you put the objective together. We want to find a sequence
of actions that maximize expected rewards or minimize cost
Q-Learning
Q-learning is a model-free reinforcement learning algorithm
which means that it does not require a model of the
environment. It’s especially effective because it can handle
problems with random transitions and rewards, without
requiring adaptations.The most common Q-learning method
consists of these steps:

1. Sample an action.
2. Observe the reward and the next state.
3. Take the action with the highest Q.
7 . How does Q function become able to learn with and without
complete knowledge of reward function and state transition
function.
Q-learning is a model-free reinforcement learning algorithm
which means that it does not require a model of the
environment. It’s especially effective because it can handle
problems with random transitions and rewards, without
requiring adaptations.
Q-learning is an off-policy learner. Means it learns the value of the
optimal policy independently of the agent’s actions. On the other
hand, an on-policy learner learns the value of the policy being carried
out by the agent, including the exploration steps and it will find a
policy that is optimal, taking into account the exploration inherent in
the policy.
8 . How does setting a Reinforcement Learning problem require

an understanding of the following parameters of the problem?
(a) Delayed reward
(b) Exploring unknown or exploiting already learned states and

actions.
(c) Number of old states should be considered to decide action
Ans. (a) Delayed reward:
In the general case of the reinforcement learning problem, the
agent's actions determine not only its immediate reward, but
also the next state of the environment. The agent must take into
account the next state as well as the immediate reward when it
decides which action to take. The model of long-run optimality
the agent is using determines exactly how it should take the

value of the future into account. The agent will have to be able
to learn from delayed reinforcement: it may take a long
sequence of actions, receiving insignificant reinforcement, then
finally arrive at a state with high reinforcement. The agent must
be able to learn which of its actions are desirable based on
reward that can take place arbitrarily far in the future.
(b)The agents have to explore in order to improve the state
which potentially yields higher rewards in the future or exploit
the state that yields the highest reward based on the existing
knowledge. Pure exploration degrades the agent’s learning but
increases the flexibility of the agent to adapt in a dynamic
environment. On the other hand pure exploitation drives the
agent’s learning process to locally optimal solutions.
© The state is the current board position, the actions are the
different places in which you can place an ‘X’ or ‘O’ in a game of
Tic Tac Toe, and the reward is +1 or -1 depending on whether you

win or lose the game. The “state space” is the total number of
possible states in a particular RL setup. Tic tac toe has a small
enough state space (one reasonable estimate being 593) that we
can actually remember a value for each individual state, using a
table. This is called a tabular method for this reason. For models
like playing chess we use value function approximation as the
total number of possibilities is around 1049

Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)

Uploaded by

Copyright:

Available Formats

UNIT-5 PART C

1)Explain the Q function and Q learning algorithm assuming

2) Explain the K – nearest neighbor algorithm for

A Q-learning task has these components: Agent, Environment,

1)To simplify the problem, we assume a hypothetical user whose

2)Our recommender model is going to be the agent of the

3)The user behaves as the system’s environment, responding to

6)Our state is defined as the product features and

7)Therefore, feedback and action together give us the next

The goal of the agent is to learn a policy that maximizes

5)Identify the suitable learning method for training a robotic

ans) ​Industrial robots deployed today across various industries

Reinforcement learning (RL) is a type of Machine Learning where

The concept of Reinforcement Learning has been around for a

For RL, we use a framework called the ​Markov Decision Process

(MDP) ​which produces an easy framework for a really complex

problem. An ​agent​ (e.g. robotic arm) would first observe the

environment it’s in and take ​actions ​accordingly. ​Rewards​ are

given out according to the result.

measure the joint angles, velocity, and the end-effector pose:

The main objective is to find a ​policy. ​A​ ​policy is something that

tells us how to act in a particular state. The objective is to find a

policy that makes the most rewarding decisions.

Now, you put the objective together. We want to find a sequence

of actions that maximize expected rewards or minimize cost

Q-learning is a ​model-free​ reinforcement learning algorithm

which means that it does not require a model of the

environment. It’s especially effective because it can handle

problems with random transitions and rewards, without

requiring adaptations.The most common Q-learning method

consists of these steps:

2. Observe the reward and the next state.

3. Take the action with the highest Q.

7 . How does Q function become able to learn with and without

complete knowledge of reward function and state transition

Q-learning is a ​model-free​ reinforcement learning algorithm

which means that it does not require a model of the

environment. It’s especially effective because it can handle

problems with random transitions and rewards, without

Q-learning is an ​off-policy learner​. Means it learns the value of the

optimal policy independently of the agent’s actions. On the other

8 . How does setting a Reinforcement Learning problem require

(a) Delayed reward

(b) Exploring unknown or exploiting already learned states and

(c) Number of old states should be considered to decide action

Ans. (a) Delayed reward:

In the general case of the reinforcement learning problem, the

agent's actions determine not only its immediate reward, but

decides which action to take. The model of long-run optimality

the agent is using determines exactly how it should take the

to learn from delayed reinforcement: it may take a long

sequence of actions, receiving insignificant reinforcement, then

finally arrive at a state with high reinforcement. The agent must

be able to learn which of its actions are desirable based on

reward that can take place arbitrarily far in the future.

(b)​The agents have to explore in order to improve the state

which potentially yields higher rewards in the future or exploit

knowledge. Pure exploration degrades the agent’s learning but

increases the flexibility of the agent to adapt in a dynamic

environment. On the other hand pure exploitation drives the

agent’s learning process to locally optimal solutions.

different places in which you can place an ‘X’ or ‘O’ in a game of

Tic Tac Toe, and the reward is +1 or -1 depending on whether you

ans) Industrial robots deployed today across various industries

For RL, we use a framework called the Markov Decision Process

(MDP) which produces an easy framework for a really complex

problem. An agent (e.g. robotic arm) would first observe the

environment it’s in and take actions accordingly. Rewards are

The main objective is to find a policy. A policy is something that

Q-learning is a model-free reinforcement learning algorithm

Q-learning is a model-free reinforcement learning algorithm

Q-learning is an off-policy learner. Means it learns the value of the

(b)The agents have to explore in order to improve the state

enough state space (one reasonable estimate being 593) that we

like playing chess we use value function approximation as the

total number of possibilities is around 1049