You are on page 1of 11

UNIT-5 PART C 

1)Explain the Q function and Q learning algorithm assuming 


deterministic rewards and actions with example. 

ans) 
https://www.freecodecamp.org/news/an-introduction-to-q-learn
ing-reinforcement-learning-14ac0b4493cc/  

2) Explain the K – nearest neighbor algorithm for 


approximating a discrete– valued function f : Hn→ V with 
pseudo code. 

 
 
 

 
 
3)Compare unsupervised learning and reinforcement learning 
with examples. 
 
4)Develop a Q learning task for the recommendation system of 
an online shopping website. What will be the environment of the 
system? Write the cost function and value function for the 
system  

A Q-learning task has these components: Agent, Environment, 


State, Reward Function, Value Function and Policy. 

1)To simplify the problem, we assume a hypothetical user whose 


experience on an online shopping store is pooled from all the 
actual users.  

2)Our recommender model is going to be the agent of the 


system that processes products to this hypothetical user that 
will buy/not-buy the recommendation.  

3)The user behaves as the system’s environment, responding to 


the system’s recommendation depending to the state of the 
system.  
4)User feedback determines our reward, that is, one score only 
if the user buys.  
5)Action of the agent is product recommended. 

6)Our state is defined as the product features and 


corresponding user reactions of past 5 steps, excluding the 
current step.  

7)Therefore, feedback and action together give us the next 


state.  

The goal of the agent is to learn a policy that maximizes 


accumulated rewards. 

 
 

5)Identify the suitable learning method for training a robotic 


arm and explain it.  

ans) ​Industrial robots deployed today across various industries 


are mostly doing repetitive tasks. Basically, moving or putting 
objects in predefined trajectories. But the reality is that the 
ability of robots to handle different or complex environments is 
really limited in today’s manufacturing.The main challenge that 
we have to overcome is designing adaptable control algorithms 
that can easily adapt to new environments. 

Reinforcement learning (RL) is a type of Machine Learning where 


we can teach an agent how to behave in an environment by 
performing actions and seeing the results. 

The concept of Reinforcement Learning has been around for a 


while but the algorithm was not very adaptable and was 
incapable of doing continuous tasks. 

For RL, we use a framework called the ​Markov Decision Process 

(MDP) ​which produces an easy framework for a really complex 

problem. An ​agent​ (e.g. robotic arm) would first observe the 

environment it’s in and take ​actions ​accordingly. ​Rewards​ are 

given out according to the result. 

 
For robotic control the state is measured by using sensors to 

measure the joint angles, velocity, and the end-effector pose: 

Policy 

The main objective is to find a ​policy. ​A​ ​policy is something that 

tells us how to act in a particular state. The objective is to find a 

policy that makes the most rewarding decisions. 

Now, you put the objective together. We want to find a sequence 

of actions that maximize expected rewards or minimize cost 

Q-Learning 

Q-learning is a ​model-free​ reinforcement learning algorithm 

which means that it does not require a model of the 

environment. It’s especially effective because it can handle 

problems with random transitions and rewards, without 

requiring adaptations.The most common Q-learning method 

consists of these steps: 


1. Sample an action. 

2. Observe the reward and the next state. 

3. Take the action with the highest Q. 

7 . How does Q function become able to learn with and without 

complete knowledge of reward function and state transition 

function.   

Q-learning is a ​model-free​ reinforcement learning algorithm 

which means that it does not require a model of the 

environment. It’s especially effective because it can handle 

problems with random transitions and rewards, without 

requiring adaptations. 

Q-learning is an ​off-policy learner​. Means it learns the value of the

optimal policy independently of the agent’s actions. On the other

hand, an ​on-policy learner ​learns the value of the policy being carried

out by the agent, including the exploration steps and it will find a
policy that is optimal, taking into account the exploration inherent in

the policy. 

8 . How does setting a Reinforcement Learning problem require 


an understanding of the following parameters of the problem?  

(a) Delayed reward  

(b) Exploring unknown or exploiting already learned states and 


actions.  

(c) Number of old states should be considered to decide action 

Ans. (a) Delayed reward: 

In the general case of the reinforcement learning problem, the 

agent's actions determine not only its immediate reward, but 

also the next state of the environment. The agent must take into 

account the next state as well as the immediate reward when it 

decides which action to take. The model of long-run optimality 

the agent is using determines exactly how it should take the 


value of the future into account. The agent will have to be able 

to learn from delayed reinforcement: it may take a long 

sequence of actions, receiving insignificant reinforcement, then 

finally arrive at a state with high reinforcement. The agent must 

be able to learn which of its actions are desirable based on 

reward that can take place arbitrarily far in the future. 

(b)​The agents have to explore in order to improve the state

which potentially yields higher rewards in the future or exploit

the state that yields the highest reward based on the existing

knowledge. Pure exploration degrades the agent’s learning but

increases the flexibility of the agent to adapt in a dynamic

environment. On the other hand pure exploitation drives the

agent’s learning process to locally optimal solutions.

© ​The state is the current board position, the actions are the

different places in which you can place an ‘X’ or ‘O’ in a game of

Tic Tac Toe, and the reward is +1 or -1 depending on whether you


win or lose the game. The “state space” is the total number of

possible states in a particular RL setup. Tic tac toe has a small

enough state space (one reasonable estimate being ​593​) that we

can actually remember a value for each individual state, using a

table. This is called a tabular method for this reason.​ For models

like playing chess we use ​value function approximation as the

total number of possibilities is around 10​49

You might also like