You are on page 1of 2

ARTIFICIAL INTELLIGENCE SPRING-20

EXAM SOLUTION

NAME: MUHAMMAD MOIZ MOOSANI


REG NO: 39818
SECTION: MONDAY (6:00 PM – 9:00 PM)
FACULTY: DR. AARIJ MEHMOOD

ANSWER NO: 3
I) FORMULATE THE PROBLEM AS MDP

The temporal difference of the given problem is defined as:


q(st,at)←q(st,at)+α(rt+1+γq(st+1,at+1)−q(st,at))

However the temporal difference defined in Q learning is:


q(st,at)←q(st,at)+α(rt+1+γmaxaq(st+1,a)−q(st,at))

II) USE POLICY ITERATION TO FIND THE OPTIMAL POLICY

In the given cliff walking environment there are two policies during the iteration.
When choosing the action at+1 from q(s,a) given st +1, Sarsa uses ϵ - greedy policy
while Q learning uses greedy policy. But both of them are choosing at with ϵ - greedy
policy. Considering the given problem, every transition in the environment will
get −1 reward except next state is the cliff which the agent will get −100 reward,
Sarsa is more likely to choose the safe path while Q learning tends to choose the
optimal path with ϵ - greedy policy. But both of them can reach the optimal policy if
reducing the value of ϵ.

III) CALCULATE THE TD ESTIMATE OF ALL THE STATES IN EPISODE 1

i) -1
ii) -0.1
iii) -0.001
iv) -1
v) -0.1
vi) 1
vii) -0.1
viii) -1
ix) 0.01
x) -0.01

IV) CODE FOR THE GIVEN SCENARIO

def
NewPosition(self,
action):
if action == "up":
NewPos = (CurrentPos[0]-1, CurrentPos[1])
elif action == "down":
NewPos = (CurrentPos[0]+1, CurrentPos[1])
elif action == "left":
NewPos = (CurrentPos[0], CurrentPos[1]-1)
else:
NewPos = (CurrentPos[0], CurrentPos[1]+1)
# check legitimacy
if NewPos[0] >= 0 and NewPos[0] <= 3:
if NewPos[1] >= 0 and NewPos[1] <= 11:
CurrentPos = NewPos

if CurrentPos == Goal:
self.end = True
print("Successfully reached to Goal")
if self.yard[CurrentPos] == -1:
self.end = True
print("Game Over") //Falling off cliff

return CurrentPos

def RewardFun(self):

if CurrentPos == Goal:
return -1
if self.yard[CurrentPos] == 0:
return -1
return -100

You might also like