Professional Documents
Culture Documents
q3 Exam
q3 Exam
EXAM SOLUTION
ANSWER NO: 3
I) FORMULATE THE PROBLEM AS MDP
In the given cliff walking environment there are two policies during the iteration.
When choosing the action at+1 from q(s,a) given st +1, Sarsa uses ϵ - greedy policy
while Q learning uses greedy policy. But both of them are choosing at with ϵ - greedy
policy. Considering the given problem, every transition in the environment will
get −1 reward except next state is the cliff which the agent will get −100 reward,
Sarsa is more likely to choose the safe path while Q learning tends to choose the
optimal path with ϵ - greedy policy. But both of them can reach the optimal policy if
reducing the value of ϵ.
i) -1
ii) -0.1
iii) -0.001
iv) -1
v) -0.1
vi) 1
vii) -0.1
viii) -1
ix) 0.01
x) -0.01
def
NewPosition(self,
action):
if action == "up":
NewPos = (CurrentPos[0]-1, CurrentPos[1])
elif action == "down":
NewPos = (CurrentPos[0]+1, CurrentPos[1])
elif action == "left":
NewPos = (CurrentPos[0], CurrentPos[1]-1)
else:
NewPos = (CurrentPos[0], CurrentPos[1]+1)
# check legitimacy
if NewPos[0] >= 0 and NewPos[0] <= 3:
if NewPos[1] >= 0 and NewPos[1] <= 11:
CurrentPos = NewPos
if CurrentPos == Goal:
self.end = True
print("Successfully reached to Goal")
if self.yard[CurrentPos] == -1:
self.end = True
print("Game Over") //Falling off cliff
return CurrentPos
def RewardFun(self):
if CurrentPos == Goal:
return -1
if self.yard[CurrentPos] == 0:
return -1
return -100