Professional Documents
Culture Documents
net/publication/261279306
CITATIONS READS
4 883
3 authors, including:
Faisal Alvi
King Fahd University of Petroleum and Minerals
8 PUBLICATIONS 60 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Faisal Alvi on 16 December 2020.
Abstract— Reinforcement learning is a popular machine Utilizing reinforcement learning techniques in board
learning technique whose inherent self-learning ability games AI is a popular trend. More specifically, the TD
has made it the candidate of choice for game AI. In this learning method is preferred for the board game genre
work we propose an expert player based by further because it can predict the value of the next board positions.
enhancing our proposed basic strategies on Ludo. We The latter has been exemplified by the great success of
then implement a TD(λ)based Ludo player and use our Tesauro’s TD-Gammon player for Backgammon [6], which
expert player to train this player. We also implement a competes with master human players. However, RL is not
Q-learning based Ludo player using the knowledge the silver-bullet for all board games, as there are many
obtained from building the expert player. Our results design considerations that need to be crafted carefully in
show that while our TD(λ) and Q-Learning based Ludo order to obtain good results [7]. For example, RL application
players outperform the expert player, they do so only to Chess [8] and Go [9] yielded modest results compared to
slightly suggesting that our expert player is a tough skilled human players and other AI techniques
opponent. Further improvements to our RL players may implementations.
lead to the eventual development of a near-optimal Red player start area Safe squares
player for Ludo.
I. INTRODUCTION
Ludo [1] is a board game played by 2-4 players. Each
player is assigned a specific color (usually Red, Green, Blue
and Yellow), and given four pieces. The game objective is
for players to race around the board by moving their pieces
from start to finish. The winner is the first player who moves
all his pieces to finish area. Pieces are moved according to
die rolls, and they share the same track with opponents.
Challenges arise when pieces knock other opponent pieces
or form blockades. Figure 1 depicts Ludo game board and
describes different areas and special squares.
In our previous work [2], we evaluated the state-space
complexity of Ludo, and proposed and analyzed strategies
based on four basic moves: aggressive, defensive, fast and
random. We also provided an experimental comparison of
pure and mixed versions of these strategies.
Reinforcement learning (RL) is an unsupervised
Blue player Blue player
machine learning technique in which the agent has to learn a start square home square
task through trial and error, in an environment which might
be unknown. The environment provides feedback to the
agent in terms of numerical rewards [3]. TD( ) learning is a Fig. 1. Ludo Board
RL method that is used for prediction by estimating the In a broader sense, RL has a growing popularity across
value function of each state [3]. Q-Learning is an RL method video games genres since it helps cutting down development
that is more suited for control by estimating the action-value and testing costs, while reflecting more adaptive and
quality function which is used to decide what action is challenging gameplay behaviors. Complexity in video games
best to take in each state [4]. Another method uses rises when AI players need to learn multiple skills or
evolutionary algorithms (EA) to solve RL problems, as a strategies (e.g. navigation, combat and item collection)
policy space search approach [5]. which are usually learned separately before getting
combined in one hierarchy, to help fine-tune the overall
performance. RL has shown great success in learning some
The authors are affiliated with the Information and Computer Science tasks like combat in first person shooter (FPS) games [10]
Department, King Fahd University of Petroleum and Minerals, Dhahran, and overtaking an opponent in a car-racing game [11], while
Saudi Arabia. Email: {g201001880, alvif, moataz}@kfupm.edu.sa.
The authors acknowledge the support of King Fahd University of Petroleum producing less than satisfactory results for other tasks [10].
and Minerals for this research, and Hadhramout Establishment for Human Evolutionary algorithms approach to RL was used in [12] to
Development for support and travel grant.
Winning %
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
2500 22500 42500 62500 82500 2500 22500 42500 62500 82500
Learning episodes Learning episodes
(a) Winning rate against 3 random players (b) Winning rate against 3 expert players
Winning %
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
2500 22500 42500 62500 82500 2500 22500 42500 62500 82500
Learning episodes Learning episodes
(a) Winning rate against 3 random players (b) Winning rate against 3 expert players
Fig. 3. Performance graphs for TD-player trained using 2 TD-players and 2 expert players
Winning %
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
2500 22500 42500 62500 82500 2500 22500 42500 62500 82500
Learning episodes Lerning episodes
(a) Winning rate against 3 random players (b) Winning rate against 3 expert players
Fig. 4. Performance graphs for TD-player trained using 2 TD-players and 2 random players
Hence, the total number of inputs using this representation 3) 2 QL-based players and 2 random players.
is 59 4 236. All QL-based players share and update the same neural
network for faster learning. Each setting is trained 100000
C. Rewards
times, using a neural network of 20 hidden layers. To boost
The rewards were designed to encourage the player to the learning process, we set the learning parameter 0.5,
pick moves that achieve a combination of the following discount factor 0.95. We used an -greedy [18] policy
objectives (in descending order): which selects an action at random with probability 1 to
• Win the game encourage exploration. The value of is set to 0.9 and
• Release a piece. decayed linearly to reach 0 after 30000 learning episodes.
• Defend a vulnerable piece. The rewards and penalties are defined as following:
• Knock an opponent piece. • 0.25 for releasing a piece.
• Move pieces closest to home. • 0.2 for defending a vulnerable piece.
• Form a blockade. • 0.15 for knocking an opponent piece.
On the other hand, the agent is penalized in the following
• 0.1 for moving the piece closest to home.
situations:
• Getting one of its pieces knocked in the next turn. • 0.05 for forming a blockade.
• Losing the game. • 1.0 for winning the game.
The rewards are cumulated in order to encourage the • -0.25 for getting one of its pieces knocked.
player to achieve more than one objective of higher values. • -1 for losing the game.
D. Learning process These values are directly influenced by the knowledge we
obtained from building the expert player.
We experimented with 3 different settings of player
combinations, in order to test the effect of learning E. Experiment design
opponents: We used a testing setup similar to section 5.5 for each
1) 4 QL-based players (self-play). player combination. Since we’re using subjective
2) 2 QL-based players and 2 expert players. representation, the QL players do not suffer from the effect
80% 80%
70% 70%
60% 60%
Winning %
Winning %
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
2500 22500 42500 62500 82500 2500 22500 42500 62500 82500
Learning episodes Learning episodes
(a) Winning rate against 3 random players (b) Winning rate against 3 expert players
Winning %
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
2500 22500 42500 62500 82500 2500 22500 42500 62500 82500
Learning episodes Learning episodes
(a) Winning rate against 3 random players (b) Winning rate against 3 expert players
Fig. 6. Performance graphs for QL-player trained using 2 QL-players and 2 expert players
of changing sides during game play, so we opted to record not indicate any disadvantage when learning against random
the mean performance only. player, in terms of winning rate, but the negative
performance strikes increased. The maximum winning rate
F. Results
is also 27 1% against 3 expert players.
1) 4 QL-players (self-play):
Figure 5 (a) shows the winning rate of self-play trained G. TD-players against QL-players
QL-player against 3 random players. We observe the noisy We performed one final test for the best 2 TD-players
learning curve due to the high value of learning parameter . against the best 2 QL-players. The results are summarized in
However, the player still manages to learn good game play Table III below. We observe that the TD-player outperforms
after 75000 episodes (63 1% wins), and the performance the QL-player by a margin of 5% (27.3% wins for TD vs.
relatively stabilizes afterwards. 22.3% wins for QL), which suggests that the TD-player is
Figure 5 (b) illustrates direct comparison against 3 expert somewhat better than the QL-player.
players. The player manages to slightly outperform 3 expert
players at 27 1% wins after 75000 episodes. TABLE III
PERFORMANCE STATISTICS OF TD-PLAYER VS. QL-PLAYER
Player 1 Player 2 Player 3 Player 4
2) 2 QL and 2 Expert Players (% wins) (% wins) (% wins) (% wins)
Figure 6 (a) and (b) shows the winning rate of this player Both TD-Players Both QL-Players
against 3 random players and 3 expert players, respectively. 27.37 0.22% 22.37 0.22%
The results did not indicate any advantage when learning
against expert player, in terms of winning rate. They show, VI. CONCLUSIONS
however, faster learning than self-play. Maximum
In this work, we built an expert Ludo player by enhancing
performance against 3 expert players is still 27 1%.
the basic strategies we proposed earlier. We used this expert
player for training and evaluation of two RL-based players.
3) 2 QL-players and 2 Random Players
which were based on TD-Learning and Q-Learning.
Once again, the player managed to slightly outperform the
expert player.as seen in figures 7 (a) and (b). The results did
80% 80%
70% 70%
60% 60%
Winning %
Winning %
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
2500 22500 42500 62500 82500 2500 22500 42500 62500 82500
Learning episodes Learning episodes
(a) Winning rate against 3 random players (b) Winning rate against 3 expert players
Fig. 7. Performance graphs for QL-player trained using 2 QL-players and 2 random players