You are on page 1of 14

Title:Artificial Intelligence Individual Exercise Report

Name: Zeyu Tan ID:2467100T

Introduction
The problem is a grid-world with a start position (S), some obstacles (H) and a final goal 
(G). There are 16 problems, eight of them are 8x8 grids, and others are 4x4 grids. It is 
expected the training time of an algorithm on 8x8 grids will be much longer and the 
performance is not guaranteed given that it has 4 times as many states as 4x4 grids have. 
 
● Data analysis
The problems waited to be solved is a grid-world with a start position (S), some 
obstacles (H) and a final goal (G). There are 16 problems, eight of them are 8x8 grids and 
others are 4x4 grids. It is expected the training time of algorithm on 8x8 grids will be much 
longer and the performance is not guaranteed given that it has 4 times as many states as 4x4 
grids have. 
 
● PEAS analysis
When we are going to design an agent to tackle some problems, we need to convert a
description of an application domain to a PEAS description.
○ Performance Measure: average reward, average steps to solve the problem,
success rate
○ Environment: Start position, ​obstacle,final goal, matrix
○ Actuators: Direction of next move
However, these three agents have different sensors as below.
○ Sensors of senseless / random agent: None
○ Sensors of simple (A Star) agent: Oracle and have the whole information including
locations of start point, all obstacles and final goal
○ Sensors of reinforcement learning agent: only have perfect information about the
current state and no prior knowledge about the state-space

Methods and Design


In these part, we are going to focus on the methods I implemented and the third-part
methods I used in this experiment.

● AI Gym
AI Gym is a toolkit including several environments applied to test, develop and compare
reinforcement algorithms. The model can learn by interacting with the environment and reward
from completing specific target.
○ env.desc
In my opinion, desc means “describe”. This attribute return the whole map of the problem
and can be helpful in searching the best path in A Star agent.
○ env.action_space.sample()
As its name, this function samples new action from action space. In this environment, it
randomly returns zero to three, which represent left, down, right, up, respectively.
○ env.reset()
This function can reset the environment. In this case, the point returns to start position and
give a chance to Q-learning agent to practice it again and again.
○ env.step(action)
The env.step is the only way we can interact with the environment. In this environment, it
moves one step partly according to the “action” we give it. As the source code, the probability it will
make the right move is around 33.3%. In my view, the main reason of this rule is to force a
reinforcement model to learn valuable information from noisy data that is common in our daily life.

● Random Agent
The random agent is quite simple. It uses the env.action_space.sample() to randomly get
next direction to move. And the sensors of this it always gain northing. Thus, this is the worst
version of Q-learning agent, the one with the learning rate set to zero, which make it impossible to
learn any knowledge from interacting with the environment.

● A-star Agent
While random agent is the worst version of Q-learning agent, the A-star agent is the best
version of Q-learning agent. An A-star agent has oracle censors and noise-free actions. In this
ideal circumstance, it has all the information in the environment and find the optimal path to reach
goal. Moreover, A-star can be used to evaluate the performance of Q-learning in some ways.
The main theory in the a-star agent is a little the Dijkstra algorithm. With the heuristic
method, its efficiency rise and guarantee to find the best path based on the evaluation function
below. The f function, g function, h function separately computes the estimated value from start
position to goal position, the actual value from start position to current state, and the estimated
value from current state to target. The H function has many choice, including Euclidean distance,
Manhattan distance, and I am going to use the former.
F (n) = G(n) + H (n)
In the beginning, the algorithm set up two lists, open_list and close_list, to restore the
positions that has been examined and has not been examined respectively, and put the start point
with zero f_score in open_list. Then, the point stored in open_list with lowest F-score will be taken
out, and its neighbour points will be evaluated. The procedure repeats until it reaches the goal
position.

● Q-learning Agent
Q-Learning is a value-based reinforcement learning model which is applied to find the
optimal solution from the original state to goal state. Unlike A-star agent, Q-Learning agent only
has complete information about the current state and actions base on the knowledge learned from
previous experience. Its crucial concept is to, in training episodes, maintenance as well as update
a Q table recording reward from each action in each state. For example, in the 8x8 base
environment, the table has shape 64(states)x4(actions). With the other name, action-utility function,
it can estimate the consequence of possible action in a particular state. Furthermore, the goal of
Q-Learning is to maximize the Q table to learn expected long-term rewards, which will finally lead
to better solution.
In calculating Q tables, it uses the Bellman equation as below and has two arguments: state
and action. In this formula, the reward can be 0, 1.0, -0.01(frozen surface/ starting point, hole, goal
respectively ). The learning rate determines how much the algorithm lean on old-time experience,
while the discount rate allows us to control the relative importance of possible future returns and
current reward. The higher discount rate will make the prior experience more critical.
Q(state, action) =
Q(state, action) − learning − rate * (reward + discount− rate * M ax(Q(next − state) − Q(state, action))
Another critical point of Q-Learning model is the exploration which is the threshold of
between moving in randomly direction and actions according to the Q table. The equation of
exploration rate is as below. Given that the action can be extremely noisy, at the start, the value in
the Q table is possibly misleading. If the algorithm keeps actioning just as the indication from the Q
table, it would probably never find the optimal strategy. So we add an exploration threshold to
encourage Q-Learning to explore and learn a more rewarding strategy.
exploration− rate = min− exploration− rate +
(max− exploration− rate − min− exploration− rate) * E xp(− exploration_decay_rate * episode)

Experiment Design and Result

● Experimental Environment
In this experiment, most of the code are written and tested on macOS 10.15.1. But because
training a reinforcement model can be truly time-consuming, some experiments were taken on a
virtual machine(Ubuntu 16.4.4) provided by Google Cloud Platform.
In the experiment below, the random seed is set to 12. If not mentioned, the default
parameters of Q-Learning agent is as below.
Learning rate: 0.2
Discount factor: 0.7
Max number of training episodes: 200,000
Max number of test episodes: 10,000
Max number of steps each episode: 10,000
Reward of hole: -0.01
Max exploration rate: 1
Min exploration rate: 0.15
Exploration decay rate:0.0005

● Random agent
In the evaluation of random agent, we will only consider the average reward related to the
grid size. This can be recognized as the Q-Learning agent without the ability to learn and
remember, which can be used to assess the Q-Learning agent’s performance .

○ 4x4 base
○ 8x8 base

● Simple agent
Here is the optimal path of problem 0 found by simple agent, which seems quite promising.
In the 10,000 test episodes, they all get a full mark. And the average number of steps is 4.25 and
10.1 in 4x4 base environment and 8x8 base environment respectively.

○ 4x4 base

○ 8x8 base

○ Simple agent in a slippery surface


But the above result is not practical. In the actual world, the data is noisy, and action taken
by machine can be wrong. Hence, to compare with the Q-Learning agent, I set the ‘is_stochastic’ to
Ture, and re-evaluate this agent. The frozen surface will become slippery, which means it will have
a 66.6% probability to take random action. Thus, the optimal path needs to be re-calculated every
step.
With the high-probability to move in the wrong direction, the simple agent can not reach the
goal in most of the 5000 episodes. This result indicates that instead of a “great” solution, we may
need a “good” solution.

● Q-Learning agent
Training a Q-Learning agent will spend a significant amount of time. In-kind, during
parameters tuning procedure, I only evaluate the model in four (id=[0,1,2,3]) 8x8 base
environments.
○ Learning rate
The learning rate determines From the figure above, we can see that when the learning rate
is set to 0.2 and 0.6, the agent has reliable performance, but the former one is more general. Given
that learning rate determine how much the Q-Learning algorithm value previous experience, too
high learning rate will lead to “amnesia” and has a bad performance.

○ Discount factor
The model with the discount factor equals 0.8 got the best performance in problem 2. But
when discount factor is 0.7, the model is more robust in other problems. The agents with discount
factor set to 0.5 have the worst performance. In my view, this is because it does not think a lot of
expected long-term returns.

○ Reward hole
When reward hole is set to -0.01, the agent got the highest score in these four problems.
Personally, I think a high ratio of the hole-reward to the goal-reward will make RL agent more likely
to avoid approaching hole position. But if the ratio is too large, it may force an agent to move away
from those goal positions which are near holes.

○ Result (4x4 base)


○ Result (8x8 base)

The result is way better than the simple agent in “Slippery environment”, which proves that
we need a “good” solution but not the “best” solution. The agent not only have to learn the best
path but also the ‘technique’ to avoid getting stuck in an ice hole.
● Agents Compare
○ Agent Performance for RL agents with different training episodes

From above graphs, it is a little challenging to declare which training episodes-number is


better. But 200,000 episodes training did not make agent overfitted. We may need more training
time to evaluate this problem.

○ Agent Performance for RL agents with different max number of iterations per
episode
When episodes is set to 1000 and 10000, agent has better performance. From my
perspective, this is because the exploration decay rate is not too large. With larger number of
episodes, we may want our agent to explore more states at the start.

○ number of steps/actions required (on average, worst case, best case)


From the above graph, in some cases, even though the agent moved back and
forth because of the slippery surface, it finally reaches the goal, which proves its
robustness.

Discussion

● Experiment time
The whole experiment training time is about two or three days using 2 virtual machines and
my laptop. Training a reinforcement model is sincerely time-consuming. And I may spend much
time on training models with same parameters. Delicate code styles and great experiment design
can save time which is the most important thing I learned in this experiment. In most cases, testing
and evaluating a model on the smaller dataset first is the right choice.

● Q-Learning vs A-star
Although A-star has all the information in the environment, with noise actions, its
performance becomes much worse than that of Q-Learning agent. This is an excellent advantage
of Q-Learning agent, the ability to fit the noise actions and data. Moreover, changing the H function
of A-Star algorithm to other choices like Manhattan distance may improve its score.

You might also like