Graph-Based Skill Acquisition For Reinforcement Learning

Graph-Based Skill Acquisition For Reinforcement Learning
MATHEUS R. F. MENDONÇA, ARTUR ZIVIANI, and ANDRÉ M. S. BARRETO, National

Laboratory for Scientific Computing (LNCC), Brazil
In machine learning, Reinforcement Learning (RL) is an important tool for creating intelligent agents that
learn solely through experience. One particular subarea within the RL domain that has received great atten-
tion is how to define macro-actions, which are temporal abstractions composed of a sequence of primitive
actions. This subarea, loosely called skill acquisition, has been under development for several years and has
led to better results in a diversity of RL problems. Among the many skill acquisition approaches, graph-based
methods have received considerable attention. This survey presents an overview of graph-based skill acqui-
sition methods for RL. We cover a diversity of these approaches and discuss how they evolved throughout
the years. Finally, we also discuss the current challenges and open issues in the area of graph-based skill
acquisition for RL.
CCS Concepts: • Computing methodologies → Reinforcement learning; Cluster analysis; • Mathemat-
ics of computing → Spectra of graphs; Graph algorithms;
Additional Key Words and Phrases: Skill acquisition, reinforcement learning, graph analytics, centrality,
clustering
ACM Reference format:
Matheus R. F. Mendonça, Artur Ziviani, and André M. S. Barreto. 2019. Graph-Based Skill Acquisition For
Reinforcement Learning. ACM Comput. Surv. 52, 1, Article 6 (February 2019), 26 pages.
https://doi.org/10.1145/3291045
1 INTRODUCTION
Reinforcement Learning (RL) is a machine-learning approach in which an agent learns to accom-
plish a given task by simply interacting with the environment through a set of actions and receiving
reward signals. This process allows the RL agent to learn how to interact with the environment
such that it maximizes the expected discounted sum of rewards received. Different challenges arise
in this context, such as delayed rewards and problems with a very large or an infinite number of
possible states to analyze, just to name a few. The former challenge mentioned is related to scenar-
ios where the agent receives reward signals several steps after a key action is performed, making
it difficult for RL algorithms to associate such rewards with the corresponding action. The latter
challenge refers to problems with too many or even an infinite number of states (e.g., continuous 6
state spaces), making it infeasible to use exact representations.
M. R. F. Mendonça, A. Ziviani and A. M. S. Barreto thank CNPq for its support (grants 141.761/2016-4, 308.729/2015-3 and
461.739/2014-3, respectively) and also acknowledge the INCT in Data Sciences – INCT-CiD (CNPq no. 465.560/2014-8). A.
M. S. Barreto is currently with DeepMind, London, UK.
Authors’ addresses: M. R. F. Mendonça, A. Ziviani, and A. M. S. Barreto, National Laboratory for Scientific Computing
(LNCC), Petrópolis, RJ, Brazil; emails: {mrfm, ziviani, amsb}@lncc.br.
ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national
government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to
allow others to do so, for Government purposes only.
© 2019 Association for Computing Machinery.
0360-0300/2019/02-ART6 $15.00
https://doi.org/10.1145/3291045
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:2 M. R. F. Mendonça et al.
Fig. 1. Skill chaining example based on a figure from Konidaris and Barto [32]. The goal is represented by
the large circle pointed by the arrows. In the left figure, the agent hasn’t learned any skills yet. In the middle
figure, the agent learned two different skills that are able to reach the goal. In the sequence, the right figure
shows several skills chained together that allow the agent to reach the goal.
It is believed that temporal abstraction may help tackle some of these issues [59, 60, 67]. Tem-
poral abstractions encapsulate a sequence of actions into a single macro-action, allowing the RL
agent to act at different time scales. A macro-action can also be interpreted as a high-level action
that achieves a subgoal within an RL problem. In this survey, we focus on temporal abstraction
approaches combined with RL and Deep Reinforcement Learning (DRL). More specifically, we fo-
cus on the automatic definition of macro-actions, also known as automatic temporal abstractions
or automatic skill acquisition.
The particular focus on automatic skill acquisition is justified given its increasing importance
in RL, especially considering some recent advances suggesting that it can indeed be beneficial in
practice [1, 37, 41, 71]. There are different formalisms for defining macro-actions [18, 25, 57, 67] and
several different methods that automatically define skills [1, 12, 37], although how to automatically
define such skills is still considered an open problem.
For example, Konidaris and Barto [32] present a method for automatically defining skills in
continuous RL domains based on the Options framework [67] by creating a chain of skills. The first
created skill is used to reach the goal of the RL problem. It is constructed using a pseudo-reward
function that rewards for reaching the goal, and the referred skill can only be used in states inside
a state region that manage to reach the goal by triggering it. The following constructed skill uses
as its goal the state region where the previously created skill can be used. This process, illustrated
in Figure 1, is repeated until a skill is assigned to the state region where the starting state is located.
Another recent skill acquisition method, presented by Vezhnevets et al. [70], uses an action-
plan matrix Pt ∈ R |A |×T to represent macro-actions, where A is the set of possible actions and T
is the macro-action duration. The action-plan matrix Pt thus indicates in column i the probabil-
ity of taking action a ∈ A at time step t + i. The action-plan matrix is updated through Attentive
Planning [24], a technique originally created for image generation.
Some of the pioneering work related to automatic skill acquisition has adopted graph-based
metrics and clustering algorithms over the state transition graph (responsible for representing
the environment dynamics) of an RL problem in order to create high-level actions. These graph-
based approaches achieved good performance in the past [12, 13, 42] and continue to outperform
current state-of-the-art methods [37]. This survey covers graph-based skill acquisition methods for
reinforcement learning. We present how these graph-based methods evolved over time, discussing
different approaches for continuous or discrete state spaces.
Graph-based skill acquisition methods can be classified into three classes depending on the ap-
proach they adopt to identify important states or macro-actions in the transition graph (Figure 2):
Graph-Based Skill Acquisition 6:3
Fig. 2. Graph-based skill acquisition methods covered in this survey.
(1) centrality measures, (2) clustering algorithms, and (3) spectral analysis. The first approach uses
different centrality measures over the state transition graph in order to identify important states (or
state regions) and then create macro-actions that allow the agent to move to these states. Second,
clustering approaches use several different clustering algorithms over the state transition graph
in order to identify states that share a specific similarity. Macro-actions are then created in such a
way that they allow the agent to transition between state clusters. The third approach uses spec-
tral analysis in order to identify macro-actions with their corresponding intrapolicy. There are
also some mixed approaches that use spectral clustering for finding macro-actions, showing that
these three categories can actually intersect. These mixed approaches are discussed in more detail
in Section 6.
To the best of our knowledge, there is no recent survey that thoroughly explores skill acqui-
sition methods, in particular graph-based ones. The only previous survey we know of, by Chiu
and Soo [10], is now outdated. In recent years, graph-based methods applied to different domains
are also receiving ever-increasing attention, aligned with the rise and consolidation of the in-
terdisciplinary field of network science [3, 7]. Indeed, the relation between RL and graph-based
techniques derived from network science emerges as a timely and relevant topic in the current
machine-learning landscape. In this context, this survey aims to be a starting point for the inter-
ested reader who desires to get familiarized with recent graph-based skill acquisition methods for
reinforcement learning.
This survey is organized as follows. Section 2 introduces the basic concepts of Reinforcement
Learning, followed by a description of the skill acquisition problem. This section also describes how
the state transition graph is created in such skill acquisition methods, presenting two of the most
common approaches for creating these graphs. Some basic concepts of graph analysis methods
are then introduced in Section 3, where we describe some key centrality measures and clustering
algorithms that are often used in graph-based skill acquisition methods throughout the literature.
Section 4 presents the most common benchmarks used by graph-based skill acquisition methods.
We then go to the core of this survey: (1) Section 5 describes how to create macro-actions using
centrality measures over the state transition graph, (2) Section 6 presents approaches based on
clustering algorithms, and (3) Section 7 details approaches based on spectral analysis and other
graph-based approaches that substantially differ from the classes presented in Figure 2. Section 8
presents some of the challenges and open issues in the area of graph-based skill acquisition. Finally,
in Section 9, we summarize the article along with some final considerations.
2 REINFORCEMENT LEARNING (RL)

As previously stated, RL is a framework where an agent learns how to interact with a given en-
vironment through experience. The learning process takes place over discrete steps. At each time
step t, the agent (1) gets an observation ot from the environment and determines its state st , (2) in-
teracts with the environment by performing an action at , and (3) receives a reward signal R t . This
process is repeated until a termination state is reached, restarting the process. The discounted sum
of rewards received by the agent during an episode, also called the return, is given by

Gt = γ i R t +i+1 , (1)
i=0
where γ is the discount factor, used to reduce the value of delayed rewards. The objective is to find
a policy π that maps actions into states such that it maximizes the expected return Eπ [G t |st ].
The RL process is often modeled using a Markov Decision Process (MDP), defined as (S, A,
p, r , γ ), where S is the set of states; A is the set of actions the agent can perform; p is the state
transition distribution function p(s |s, a) for a ∈ A, s ∈ S, and s ∈ S; and r is the reward function
given by r (s, a, s ) that corresponds to the reward received by the agent when performing action
a ∈ A while in state s ∈ S and transitioning to state s ∈ S.
As previously mentioned, the objective is to learn a policy π that selects an action to perform
such that the expected return is maximized. The expected future rewards for performing action a
while in state s following a policy π is given by the action-value function, defined as
Q π (s, a) = Eπ [r (s, a, S ) + γQ π (S , π (S ))], (2)
s ,
where the expectation is over the next state which depends on π and on the dynamics of the
MDP. Instead of calculating the action-value function Q π (s, a), several RL methods use a value
function V π (s) that indicates the expected sum of future rewards for following policy π starting
from state s.
After learning the action-value function Q π (s, a), we create a new policy π that selects actions
greedily according to Q π (s, a), that is, π (s) ∈ arдmax a Q π (s, a). In the exact case, the policy is
∗
updated until it converges to an optimal policy π ∗ , such that Q π (s, a) ≥ Q π (s, a) for all π π ∗ .
Q-Learning [73] is a very popular method to approximate the action-value function, which up-
dates Q π (s, a) according to

Q π (s, a) = Q π (s, a) + α r (s, a, s ) + γ max

Q π (s , a ) − Q π (s, a) , (3)
a ∈A
where α ∈ (0, 1] is the learning rate.
The interested reader is encouraged to refer to the book of Sutton and Barto [66] for a more
in-depth explanation of the basic RL framework and classical RL methods. In the following subsec-
tions, we introduce the required background for understanding the graph-based skill acquisition
methods that are surveyed in this article.
2.1 Deep Reinforcement Learning

The action-value function Q (s, a) must map every state-action pair (s, a) to a real value. For small
problems, Q (s, a) can be represented by a table with |S| rows and |A| columns. However, for
complex problems, the number of states |S| is usually too large to be efficiently manipulated and
analyzed using a lookup table. In this case, the exact representation of Q has to be replaced by an
approximation. In recent years, Neural Networks (NNs) have become a popular choice of approx-
imator in RL, in particular, the so-called Deep NNs.
Deep NNs are models with several layers and with different architectures. The most notable DL
models are Convolutional Neural Networks (CNN) [23, 34, 68] and Recurrent Neural Networks
(RNN) [29, 49, 63]. CNNs are typically used to learn image patterns and therefore are commonly
adopted for image classification or image segmentation problems. On the other hand, RNNs are
capable of identifying temporal patterns in the training data, allowing the network to predict the
next input after observing a sequence of inputs. These DL methods, such as CNNs and RNNs,
have recently replaced the simple NN models used to predict Q (s, a), thus giving rise to Deep
Reinforcement Learning methods [50, 51].
Deep Reinforcement Learning methods are capable of dealing with large state spaces as well
as learning complex behaviors. For instance, these methods have already been used for learning
how to play games from the Atari 2600 console at an expert level by only considering the current
game image (pixel vector) and the eventual reward offered by the game, a feat that could not be
accomplished by standard RL methods. This achievement opened many potential perspectives for
new applications based on machine learning. This is also the main reason for the ever-increasing
interest we observe nowadays in Deep Learning in general, and in Deep Reinforcement Learning
in particular.
2.2 Skill Acquisition

The action set of an RL problem may be composed of two types of actions: (1) primitive actions and
(2) macro-actions. The former type is composed of atomic actions that take only one time step to
be completed, while macro-actions are actions composed of a sequence of primitive actions, with
varied lengths. In other words, a macro-action represents a sequence of primitive actions that are
chosen based on a separate intrapolicy that aims at achieving a specific subgoal. Therefore, the
Skill Acquisition problem is concerned with the definition of methods capable of defining macro-
actions in RL problems using primitive actions.
There are several frameworks for constructing macro-actions, such as options [59, 67] and
HEXQ [25], among others. We will focus on the options framework, since most of the graph-based
skill acquisition approaches found during our survey use it to represent macro-actions.
Options are temporally extended actions composed of a sequence of primitive actions executed
according to a closed policy. The agent can only execute an option when it is within a set of states
called the initiation set, while it can only terminate an option in another set of states called the
termination set. Options are then considered as an action belonging to the action set and are ruled
by the policy-over-options π . An option o is formally defined by the tuple , where:
• I is the initiation set, which indicates the set of states from which the agent can initiate the
execution of option o. The definition of this set depends on the skill acquisition method and
will be explored further in this survey.
• πo is the policy for option o, also called intraoption. A common approach to define πo is
to use a special reward function (usually called pseudo-reward function) when executing
an option, such as rewarding or punishing the agent when it achieves certain states. This
reward function is only valid when an option is currently in execution.
• β is the termination condition, which is a probability function β (s) that indicates the prob-
ability of option o terminating in state s. A common approach is to use a termination prob-
ability of 1 for important states and 0 for states inside the scope of the given option. The
termination condition depends on the problem being solved and the skill acquisition method
used, which will be better explained further in this survey.
The update of the action-value function of options differs from the update used for primitive
actions. For example, while the update of primitive actions is done by the standard Q-Learning
algorithm, the update for options can be performed using the Macro-Q-Learning method [47], given
by the following rule:
Q (st , ot ) = Q (st , ot ) + α[r o + γ k max a ∈O Q (S t +k , a ) − Q (st , ot )], (4)
where Q (st , ot ) is the action-value function for option ot executed in state st at time step t, k is
the number of steps that option ot has executed so far, O is the set of all possible actions that can
be carried out (primitive actions and options alike), and r o is the cumulative discounted rewards

received while executing option ot during the past k steps, given by r o = ki=0 γ i r t +i+1 .
When a new option is created (based on whichever method used to discover such options), it is
necessary to learn its intrapolicy πo before effectively using it in the RL problem. Otherwise, the
agent will not be able to properly execute the new option. A common approach used is to learn the
intraoption through the use of the Experience Replay technique [35], where a sequence of actions
and states previously experienced by the agent that is within the given option’s initiation set is
used to learn πo . Using a sequence of states previously observed, coupled with a pseudo-reward
function, allows the policy πo to be quickly learned, allowing the insertion of the newly formed
option into the action set of the agent.
2.3 State Transition Graph (STG)

Graph-based skill acquisition methods work with a State Transition Graph (STG) that represents the
RL problem at hand. Each node in an STG represents one state, and edges indicate the transition
between states by executing an action. In this way, each edge is also associated with a signal
indicating the received reward and the action executed that led the agent to experience such a
transition.
Given this scenario, there are two main approaches used in graph-based skill acquisition meth-
ods for obtaining the STG, which are described as follows:
• Global STG: This approach explores all possible states and all state transitions that may
occur in a given RL problem. With this, a full STG that maps all possible transitions to all
states is built. This approach guarantees a reliable STG that fully represents the RL problem
at hand. The key problem here is that this approach is only viable for small and simple RL
problems, since exploring all possible state transitions may become inviable for large or
complex RL problems.
• Local STG: Instead of exploring all possible state transitions, it is possible to approximate
the STG by using past state-action trajectories visited during previous episodes. A state-
action trajectory is a list of states visited by the agent in past experiences complemented
with the action at the agent performed in state st at time t and the resulting next observed
state st +1 . For an RL problem with a deterministic interaction between the agent and the
environment, this approach is capable of determining an STG that represents a good ap-
proximation of the environment dynamics. On the other hand, for stochastic environments
where the same action a performed while in state s during two distinct time steps can result
in different next states, this approach may not be able to obtain good approximations.
3 GRAPH ANALYSIS METHODS

Since we will cover several graph-based methods for skill acquisition in this survey, it is important
to lay out the basic foundation of some commonly used network analysis metrics used in these
approaches. Therefore, in this section, we briefly describe some important centrality measures and
clustering algorithms that are widely used in the works covered here. How these methods are used
in skill acquisition problems is described later in the survey.
We consider here that a graph G is formally represented by G = (V , E), where V is the set of
vertices (nodes) and E is the set of edges in the graph. The number of vertices (nodes) in the graph
is n = |V |. The graph’s connectivity is represented by an adjacency matrix A, where ai j ∈ A is the
weight of the edge that connects nodes i and j. For unweighted graphs, ai j = 1 if there exists an
edge between nodes i and j, and ai j = 0 otherwise.
3.1 Centrality Measures

There are several ways to define the relative importance of a given node in a graph. This is done
using centrality measures: a score given to each node that measures its importance in a given
context, thus leading to a node ranking. There are several possible centrality measures, but we
will limit ourselves to describe only those that are more commonly used in the context of skill
acquisition.
3.1.1 Degree Centrality. The most trivial centrality measure is based on the degree of each node.

The degree di of node i is given by di = j ∈V ai j . The higher the degree of a node, the higher the
degree centrality. The degree centrality thus allows the identification of highly connected nodes.
The problem is that this information alone is sometimes insufficient, therefore requiring more
meaningful centrality measures.
3.1.2 Closeness Centrality. The closeness centrality reflects the mean distance from a vertex to
all other vertices. Therefore, a node with a high closeness centrality is on average relatively closer
to all other nodes in the network. Formally, the closeness centrality of a node i is thus defined as
the inverse of the mean distance from that node to all other n − 1 nodes in the network, i.e.,
n−1
ci = , (5)
ji ∈V δ (i, j)
where δ (i, j) is the distance connecting nodes i and j.

3.1.3 Betweenness Centrality. In some scenarios, it is desirable to identify nodes by which many
shortest paths pass through. This is achieved by betweenness centrality that measures the relation
of the number of shortest paths (between any two nodes) that passes through a specific node i and
the number of total shortest paths between these same two nodes (there can be several shortest
paths with the same length that differ in the path taken). The betweenness centrality of node i is
formally given by
σst (i)
bi = , (6)
sit ∈V
σst
where σst (i) is the number of shortest paths between nodes s and t that passes through i and σst
is the total number of shortest paths between s and t.
3.1.4 Eigenvector Centrality. In the degree centrality, each node contributes equally to the cen-
trality value of each of its neighbor nodes. In the eigenvector centrality, on the other hand, central
nodes contribute more to the centrality of their neighbor nodes. The eigenvector centrality of a
node is proportional to the sum of the eigenvector centralities of its neighbor nodes. Therefore,
central nodes are usually connected to other central nodes or are connected to a relatively large
number of low-centrality nodes. The most central nodes typically have both. The eigenvector cen-
trality x i of node i is formally defined as
1
xi = ai j x j , (7)
λ j ∈V
where λ is an eigenvalue of the adjacency matrix A associated with the eigenvector x. Therefore,
the eigenvector centrality of each node in a graph can be calculated by solving the eigenvector
problem
Ax = λx . (8)
3.2 Clustering Algorithms

There are two main groups of clustering algorithms: data clustering and graph clustering. Data
clustering is a process where a set of points in a given space Rm is clustered according to their
proximity; that is, it aims to define clusters such that it minimizes the distance between points in
the same cluster and maximizes the distance between points of different clusters. Graph clustering
consists of the process where nodes with similar connections are associated with a single cluster.
The objective is to define clusters that maximize the number of intracluster connections while
minimizing the number of connections to nodes of other clusters. We will present here two of the
most commonly used clustering methods in graph-based approaches for skill acquisition.
3.2.1 K-Means. K-Means is a clustering algorithm that takes as input a set of n points
p1 , . . . , pn ∈ Rm and groups them into k clusters based on their distance, where k is a parameter
of the algorithm. It starts by selecting k random points to represent the centroid of each cluster,
given by C 1 , . . . , Ck . It then enters an iterative stage that (1) assigns each point pi to the cluster
O j with the nearest centroid C j (it usually uses the Euclidean distance) and (2) recalculates the
centroids C 1 , . . . , Ck using
1
Ca = pi , ∀a ∈ {1, . . . , k }. (9)
|O a | p ∈O
i a
This process follows until there is no alteration in the configuration of clusters between two
consecutive steps. The K-Means algorithm is synthesized in Algorithm 1.
ALGORITHM 1: K-Means Clustering
Data: p1 , . . . , pn , k
Result: O 1 , . . . , O k
begin
Select k random points C 1 , . . . , Ck to represent the centroid of each cluster;
while clusters O 1 , . . . , O k change between consecutive iterations do
for i = 1, . . . , n do
Determine j such that C j is the closest centroid to pi ;
Assign point pi to cluster O j ;
end
for a = 1, . . . , k do

Ca = |O1 | pi ∈O a pi ;
a
end
end
end
3.2.2 Spectral Clustering. Spectral Clustering is a clustering method that uses as its main tool
the eigenvectors of the Laplacian matrix of a graph. Before briefly describing the Spectral Cluster-
ing method, let us define the degree matrix and the Laplacian matrix:
• Degree Matrix: the degree matrix, represented by D, is a diagonal matrix where Di j = 0
for i j and Dii = di .
• Laplacian Matrix: the Laplacian matrix is a special matrix given by L = D − A that has
n real and nonnegative eigenvalues (where n = |V |, i.e., the number of nodes in G). The
number of eigenvalues of L that equals zero indicates the number of connected components
in the corresponding graph G = (V , E) [52].
The Spectral Clustering algorithm, presented in Algorithm 2, operates as follows: given a graph
G = (V , E) and a parameter k that indicates the number of desired clusters, it computes the Lapla-
cian matrix L and then determines the k eigenvectors u 1 , . . . , uk ∈ Rn (n being the number of
nodes in the graph G) associated with the k smallest eigenvalues of L. It then constructs a matrix
U ∈ R n×k , where the ith column of U is given by the eigenvector ui . It then considers a set of points
yi ∈ R k for i = 1, . . . , N , where yi is given by the ith row of U. The algorithm then uses a clustering
method (K-Means, for example) over the set of points yi , resulting in k clusters C 1 , . . . , Ck . Finally,
a new set of clusters O 1 , . . . , O k is created such that node j of graph G belongs to O i if y j ∈ Ci , that
is, O i = {j |y j ∈ Ci }.
ALGORITHM 2: Spectral Clustering

Data: G, k
Result: O 1 , . . . , O k
begin
Compute L = D − A;
Compute k eigenvectors u 1 , . . . , uk associated to the k smallest eigenvalues of L;
Build U ∈ R n×k where the ith column of U is given by the vector ui ∈ R n ;
Extract dataset y1 , . . . , yn ∈ R k where yi is the ith row of U;
Run a clustering method over y1 , . . . , yn , resulting in clusters C 1 , . . . , Ck ;
Build k clusters O 1 , . . . , O k where O i = {j |y j ∈ Ci };
end
4 BENCHMARKS AND COMPARISON APPROACHES

Before delving into the different graph-based skill acquisition methods, we first describe in this sec-
tion the common benchmarks used by those methods. The benchmarks described here are usually
used to compare the efficiency of one method over a baseline agent with only primitive actions.
The first widely used test environment for skill acquisition methods is the Two-Room domain
(Figure 3), defined as follows: The agent starts at a random location in a room inside a grid-based
world and must reach a random destination in the adjacent room. These rooms are separated by a
wall with a small door connecting them. Other extended variations of this domain soon followed,
such as the Four-Room and Six-Room domains, used, for example, by Şimşek [11].
The taxi domain is another test environment adopted by many authors. It is another grid-world
environment where the agent controls a taxi. At the start, the taxi is positioned randomly in the
grid, while the passenger and destination are chosen randomly over a fixed set of possible locations.
The objective is to move to the location where the passenger is, pick him or her up, take him or her
to the drop-off point, and leave him or her there. There are six possible actions for the agent: move
north, south, east, west, pick up, and put down. The pick-up action puts the passenger inside
the taxi, and it only works when the taxi is in the same location as the passenger. Otherwise, this
action has no effect. Similarly, the put-down action drops the passenger at the current location.
It only works when the passenger is already inside the taxi and it is positioned in the destination.
Otherwise, this action does nothing. The agent receives a reward of +50 for correctly delivering a
passenger, −1 for each action, and −10 for each unsuccessful pick-up and put-down.
Fig. 3. Two-Room test environment. The RL agent starts at a random position inside the left room and its goal
is to reach the G position inside the right room. Many skill acquisition approaches use this test environment
to compare different approaches. Skills learned in this environment using different techniques tend to focus
on reaching the doorway, showing that this position represents a bottleneck.
The Atari Learning Environment (ALE) [51] is one of the most recent benchmarks in the skill
acquisition community, as well as in the Reinforcement Learning community in general. The ALE
provides a collection of Atari games where the agent can only access the current state of the game,
given by the current frame, and the rewards. This makes ALE an ideal benchmark since it provides
an easy way to test different methods with other approaches in the literature.
There are different ways to assess the quality of a given graph-based skill acquisition method
based on which benchmark is used. In general, there are two main desirable attributes: (1) perfor-
mance, which measures the total amount of rewards the agent can harness, and (2) training speed.
For the Two-Room environment and its variations, authors are more concerned with the training
speed and the exploration capacity, given that these environments can already be fully mastered
by simple RL methods. Therefore, the objective is to create a method capable of reaching this level
of mastery in fewer steps. The same is also valid for the taxi domain. For the ALE, however, the
performance is also very important: many Atari games are considered to be very difficult for a
simple RL method to master. Therefore, creating skill acquisition methods capable of improving
the total score on a specific game is considered a gain. The training speed is also very important
for this benchmark, given that the training time required for many of these games is high.
5 SKILL ACQUISITION THROUGH CENTRALITY MEASURES

Many skill acquisition methods using centrality measures that we encountered during our survey
adopt a similar overall algorithm for creating macro-actions. This generic algorithm, presented in
Algorithm 3. and illustrated in Figure 4, starts by defining a set of options O and initializing it as
the empty set. The algorithm then enters an iterative process that is repeated after the execution of
each new episode of the RL problem of interest. During each episode, the state-action trajectory is
saved. This information is then used to build or update the Local STG at the start of each iteration.
With the Local STG updated, the next step consists of finding important states according to a
centrality measure, resulting in a list s 1 , . . . , se of e important states. The algorithm then proceeds
to create an option for reaching each of the important states si, i ∈ 1, . . . , e, which is done by
(1) defining the initiation set I based on the adopted centrality measure, (2) using a pseudo-reward
function that gives a negative reward for each step and a positive reward for reaching the subgoal
state si, and (3) using as the termination condition β a probability of 1 for the subgoal state or when
Fig. 4. Illustration of the macro-action discovery process using centrality measures. The process starts with
(a) an initial STG of a given problem. It then goes to (b) the identification of the most important states
based on a given centrality (in this figure, we considered the betweenness centrality, with the size of a node
proportional to its betweenness centrality). Finally, (c) the states with the highest centrality (the top x nodes
with the highest centrality or all nodes with a centrality value above a predefined threshold) are identified
as subgoals, thus resulting in macro-actions capable of reaching such states.
the agent leaves the initiation set and 0 for the states in the initiation set. The newly created option
o = is then trained using experience replay [35] in order to find a good intrapolicy πo
that allows the agent to properly execute o.
ALGORITHM 3: Skill Acquisition Using Centrality Measures

Data: Trajectories T = (s 1 , a 1 , r 1 ), . . . , (sm , am , rm )
Result: Set of options O = o 1 , . . . , ok
begin
O = ∅;
for each new episode do
Build the Local STG using trajectories T ;
Determine important states s 1 , . . . , se by using a centrality measure;
for i ∈ 1, . . . , e do
Define an initiation set I for reaching si based on the centrality measure used;
Define a pseudo-reward function that rewards the agent for reaching state si and punishes for
each step taken;
Define termination probability function β as 0 for s ∈ I and 1 for si or s I ;
Build new option o =;
Train option o using Experience Replay;
O = O +o
end
end
end
The main differences between distinct graph-based skill acquisition approaches based on cen-
trality measures, as we show in the remainder of this section, are (1) which centrality measure is
used, (2) how the STG is built, and (3) how the initiation set is defined. The remaining steps are
usually very similar to each other. Algorithm 3. presents the case in which a Local STG is used,
although a Global STG might also be used, as in some approaches presented in the remainder of
this section. Another key concept in many centrality-based methods is that the important states
s 1 , . . . , se are defined as being those with a centrality measure higher than a predefined thresh-
old t. Hence, the threshold t indirectly controls how many subgoals are identified within the STG.
As a consequence, choosing an appropriate value for t depends on the particular problem being
solved, as well as on which centrality measure is used. We now describe existing graph-based skill
acquisition approaches using centrality measures, divided into three categories: (1) methods that
use a Global STG, (2) methods that use a Local STG, and (3) those that do not use an STG.
5.1 Centrality Measures with a Global STG

One of the pioneering works related to skill acquisition is presented by Digney [19]. In his ap-
proach, the importance of a state is measured by the gradient of reinforcement signals received
during state transitions and the number of visits. Therefore, each state is associated with two func-
tions: (1) a function m(s) that measures the gradient of reinforcement signals when transitioning
to state s and (2) a function c (s) that indicates how often the state s was visited during training.
Whenever a state achieves a value for m(s) or c (s) higher than predefined thresholds, this state is
considered to be a subgoal, which means that higher reinforcement signals are received when the
agent reaches this state. The skills are then considered to be any sequence of actions that leads the
agent from a given state to a subgoal. Therefore, Digney [19] introduced a concept analogous to
node centrality as seen in complex network problems, even though this was not explicitly declared
by the author. Digney [19] also presented the Two-Room environment (Figure 3), where the agent
learns that the area that contains the door is associated to important states, since these states are
visited often (resulting in high values of c (s)). Additionally, the destination states are also consid-
ered important states (and subgoals) given their high value of m(s). The main disadvantages of this
approach are the following: (1) using visitation frequencies to find important states may prove dif-
ficult and unfeasible for problems with a large or continuous state space, and (2) the reinforcement
gradient approach may not prove useful for problems with delayed rewards.
Şimşek and Barto [12], Şimşek [11], and Şimşek and Barto [14] present the first approach to
use the betweenness centrality to identify important subgoals. Instead of calculating the exact be-
tweenness centrality for each node, the authors only calculate the number of minimum paths c s (v)
starting from a given node s that passes through a node v. If c s (v) > t, where t is a threshold value,
then v is associated with a subgoal. A skill is created for each subgoal state д by setting the initi-
ation set as the states s where c s (д) > t. To reduce and filter redundant skills, the authors defined

a support measure for each subgoal д, given by p(д) = s ∈I (д) c s (д), where I (д) is the initiation
set of д. Then, all the subgoals previously identified are inserted into a new subgraph G. The only
subgoals considered are those with the highest support p(д) for each strongly connected compo-
nent of G, resulting in the identification of fewer skills. The authors compared their approach with
a basic Q-Learning agent using only primitive actions (called from here on the baseline agent) in
the Four-Room and the taxi domain. It was shown that, by using skills, the agent could reach the
destination in the Four-Room domain in fewer steps when compared with a baseline agent using
only primitive actions, while in the taxi domain, the skilled agent managed to achieve a higher
score while executing the same number of steps.
Menache et al. [48] propose an approach inspired by the Max Flow/Min Cut problem, where
the objective is to search for a path in a weighted graph that corresponds to the maximum flow
between a source node s and a sink node t. The skill acquisition problem can then be abstracted
to a Max Flow problem by considering the STG. The initial state corresponds to the source node s,
and the sink node t is represented by the ending state (the definition of the source and sink nodes
may differ depending on the application). The procedure, called Q-Cut, starts by updating the
weight of each edge based on relative frequencies, obtained after each new episode. This allows
the execution of the Max Flow algorithm, which in turn allows the detection of bottleneck states.
These states are defined as subgoals. As a result, this approach allows the detection of important
states without the need for several episodes, because if the agent crosses a bottleneck at least
once, the Q-Cut algorithm will already be able to identify it. After identifying bottleneck nodes,
the initiation set is defined as the source states of the identified cuts. Similarly to the previous
approach, the agent based on the Q-Cut method was compared with a baseline agent in the
Two-Room and Six-Room domains, where it was shown that the agent using skills managed to
reach the goal in fewer steps than the baseline agent.
In this subsection, we presented some centrality-based skill acquisition methods that rely on
a Global STG. The main difference between these methods is how the subgoal states are chosen:
betweenness centrality, visitation frequencies, and bottleneck states in a minimum cut problem.
Since these methods use a Global STG, their use is restricted to small problems, given that a Global
STG does not scale well for problems with a large state space.
5.2 Centrality Measures with a Local STG

The Relative Novelty algorithm, proposed by Şimşek and Barto [13], defines subgoal states as
those that connect many clusters. Each state is associated with a novelty value, which measures
how often this state is visited, where values near 1 indicate that a state is rarely visited and
values near 0 indicate the opposite. Another key concept is the relative novelty of a state that
measures the relation between the novelty of states visited prior to the current state and the nov-
elty of states visited afterward. High values of relative novelty indicate that a state, after visited,
leads the agent to undiscovered states. Therefore, subgoals are defined as states with relative nov-
elty values above a predefined threshold. The initiation set of a given option (each one associated
to a different subgoal) is defined as the N previous states visited by the agent before reaching the
subgoal (N being a parameter of the algorithm). A drawback of this approach is that maintaining
past histories as well as the novelty values for each visited state may prove unfeasible for problems
with a large, complex, or continuous state space. The results achieved for the proposed method by
the skilled agent in the taxi, Two-Room, and Six-Room domains also showed better performance
when compared with a baseline agent.
A new node centrality measure also used to identify bottleneck states is proposed by Rad
et al. [61]. The idea is to create a centrality measure capable of identifying bottleneck nodes with
greater precision than other centralities. The proposed centrality, called Connection Graph Stability
(CGS) centrality, is given by
δu (s, t )
Su = , (10)
sut ∈V
σst
where s, u, and t are nodes of the built STG, V is the set of nodes of the graph, δu (s, t ) is the length
of the shortest path between s and t that passes through u, and σst is the number of shortest paths
linking nodes s and t. The CGS centrality, therefore, measures the relation between the length
of the shortest paths passing through a given node and the number of total shortest paths. High
values of the CGS indicate that a node is used to connect several pairs of nodes, including node
pairs that are distant from the measured node. The authors then define a second scoring method
for each node that is based on the centrality measure of the node and the centrality of its neighbors,
given by
2
Xu
SXu = Xu · , (11)
max n ∈N (u ) X n
where Xu is the centrality measure of node u (any centrality can be used for this scoring function:
betweenness, closeness, connection graph stability, and so on) and N (u) is the set of neighbors of
node u. The bottleneck states are then defined as the k states with the highest SXu score, where k
is a parameter of the algorithm. A baseline agent was compared with an agent using skills with dif-
ferent scoring functions: betweenness, closeness, and the proposed CGS. The adopted benchmarks
were the Two-Room and taxi domains, where the agent using skills based on the CGS centrality
managed to reach the goal in fewer steps, followed by betweenness and closeness centralities. As
expected, the baseline agent required a greater amount of steps to reach the goal.
Concept Bridge Centrality [53] extends the previous concept of Connection Graph Stability cen-
trality by considering local and global information of a given node, allowing it to better filter
redundant bottleneck nodes. This also identifies bottleneck nodes. The local information is given
by the bridge coefficient that measures the degree of difference between a given node and its
neighbors, as follows:

α |du − d1u du j |
BR(u) = e u j ∈N (u )
, (12)
where α > 0 is a constant value. High values of the bridge coefficient indicate that a node has a
considerably lower degree with respect to its neighbors, suggesting that it must be a bottleneck
node. The global information is measured by any centrality measure; Moradi et al. [53] used the
Connection Graph Stability in their experiments. As a result, this approach finds a reduced num-
ber of bottleneck states that are then used to create macro-actions through the options framework.
The graph and skill construction used in that work is identical to the approach used in the author’s
previous work [61]. Both previous centrality measures proposed by the authors, namely, the Con-
nection Graph Stability and Concept Bridge centralities, were compared with similar approaches
using betweenness [12] and closeness centralities. Results show that these two new centralities
are capable of better filtering the important nodes, thus achieving a better performance.
5.3 Centrality Measures without an STG

Some centrality-based methods for skill acquisition use only the information contained in past
trajectories in order to define the importance of a state. Hence, these methods do not require an
STG. In the sequence, we will describe how each of these methods defines its centrality measure.
McGovern and Barto [46] propose an approach based on the idea of diverse density [43, 44].
Diverse density is a supervised learning technique used for finding important features in a state
space by assigning a score to each feature based on how often they appear in positive instances
and not very often in negative ones. Each episode of the RL problem is then transformed into a se-
quence of states visited during this episode and is later assigned to one of two classes: (1) successful
and (2) unsuccessful episodes. States with a high diverse density can therefore be interpreted as
bottlenecks or subgoals, as they appear regularly in successful episodes and rarely in unsuccessful
ones. The authors also suggest that this approach may be used for abstract state spaces, useful
for problems with a large or continuous state space, although this is not tested in that work. The
option’s initiation set is defined as the t − n previously visited states before reaching a subgoal
in step t during successful episodes. After each new episode, the diverse density procedure is re-
executed and new subgoals are found. Only subgoals found several times are considered in order
to prune unwanted subgoals. This method is tested in the Two-Room experiment, as well as in the
Four-Room experiment [59], where it is shown that it achieves better results than the RL approach
using only primitive actions. As with previous approaches, the authors showed that an agent using
the proposed skill acquisition method outperformed a baseline agent using only primitive actions
in the Two-Room domain.
Connect-based subgoal discovery [9] follows the idea of the previous work [46]. Instead of only
using the visitation frequencies to determine important states, the authors propose an approach
capable of identifying nonimportant states with high visitation frequencies. This is accomplished
by analyzing the visitation frequency of neighboring states and considering only states whose
Fig. 5. Illustration of the macro-action discovery process using clustering algorithms. The process starts
with a, STG of a given problem (a). It then identifies the clusters (b) of the initial graph (a). Each clus-
ter is associated to an abstract state (c), and the connections between abstract states are associated to a
macro-action.
visitation frequency is higher than their neighbors. This helps to disregard highly visited states
that are not important. The authors used the Two-Room environment to show that their approach
is capable of identifying only doorways as subgoals, effectively filtering nonimportant states.
Another approach that uses bottleneck states is presented by Goel and Huber [22], where the
authors propose to identify bottlenecks by checking the number of predecessor and successor
states of a given state. Whenever a state has many predecessors and few successor states, the
given state is a bottleneck. The proposed method is tested in a variation of the Four-Room domain,
where it only identifies doorways as subgoals. The skilled agent also learned to reach the goal
with considerable fewer training steps when compared with a baseline agent using only primitive
actions.
Even though the methods presented in the current subsection do not use an STG, they all rely
on statistical information of each node. However, in large or continuous problems, it is difficult to
associate a set of information with each node, given that there are many or even infinite nodes.
Since these approaches were only tested in simple problems, it is difficult to know how well they
perform for more complex problems.
6 SKILL ACQUISITION THROUGH CLUSTERING ALGORITHMS

Skill acquisition approaches based on clustering algorithms also present a general algorithm that
guides them. Once again, we will present the general algorithm, followed by many of the encoun-
tered approaches in the literature. We will consider the use of a Local STG given its wider usage,
although a Global STG can also be used. The general algorithm, presented in Algorithm 4. and
illustrated in Figure 5, operates as follows: After the execution of each new episode of the RL
problem, the STG is built based on the observed trajectories (in case a Local STG is used). With
this graph, states are separated into clusters through a clustering algorithm. A new option is then
created for each pair of clusters Ci and C j such that a state in C j is reachable from a state in Ci
(represented here by Ci ⇒ C j ). The initiation set I is defined as being the states that belong to
cluster Ci . The adopted pseudo-reward function rewards the agent for reaching a state in the des-
tined cluster C j and punishes it for each step taken or for entering a state that does belongs to

neither I nor C j (s C j I ). The termination probability function β is set to 0 for states s ∈ I and
1 for any state that does not belong to the initiation set (s ∈ S − I ). A new option o = is
then built or updated (in case it already existed from previous episodes) and is then trained using
experience replay.
The main concept of clustering-based skill acquisition methods is to build an option capable
of transitioning the agent from one cluster Ci to another C j (these cluster transitions are repre-
sented by the edges in Figure 5(c)). Since these cluster transitions are controlled by an option, such
transitions may take several primitive actions to occur. These primitive actions are chosen based
on the intrapolicy πo , learned through experience replay using the pseudo-reward function pre-
viously described. Given that each cluster represents a collection of similar states, an option that
transitions the agent between clusters can be seen as an abstract behavior.
Differently from the centrality-based methods, no skill acquisition approach using clustering
algorithms adopts a Global STG. All methods described here use a Local STG. Hence, the main
differences between the clustering-based methods are the used clustering algorithm and how to
define the importance of a state. Several new clustering techniques were developed in this area
to ensure that the STG is well separated. The quality of the resulting clusters directly impacts the
quality of the skill acquisition step. In the remainder of this section, we will present the clustering
methods proposed for graph-based skill acquisition using Local STGs.
ALGORITHM 4: Skill Acquisition Using Clustering Algorithms

Data: Trajectories T = (s 1 , a 1 , r 1 ), . . . , (sm , am , rm )
Result: Set of options O = o 1 , . . . , ok
begin
O = ∅;
for each new episode do
Build the Local STG G using trajectories T ;
Separate nodes in G into clusters C 1 , . . . , Ck ;
for i = 1, . . . , k do
for j = 1, . . . , k and i j and Ci ⇒ C j do
Define an initiation set I as the states of cluster Ci ;
Define a pseudo-reward function that rewards the agent for reaching any state s ∈ C j and

punishes for each step taken or for reaching state s C j I ;
Define termination probability function β as 0 for s ∈ I and 1 for s ∈ S − I ;
Build new option o =;
Train option o using Experience Replay;
O = O +o
end
end
end
end
One of the first approaches to actually use clustering algorithms to find skills in an RL problem is
presented by Mannor et al. [42]. The adopted clustering function focuses on creating clusters with
roughly the same size and with similar reinforcement signal transitions. This approach allows the
agent to identify promising clusters by preferring options that lead the agent to clusters with high
reinforcement signals. However, a key question arises: when is the best moment to execute the
clustering algorithm? This is important since it is infeasible to perform this operation after each
episode due to execution time constraints. The author states that this issue is problem related. The
approach suggested by the author is to perform the clustering algorithm when no new states are
visited after a certain number of episodes (this number is a parameter). The authors compared
an agent using their method with a baseline agent adopting only primitive actions in a 19-Room
domain, showing that their approach manages to reach the destination in fewer steps.
Following the idea presented by Şimşek and Barto [13] of defining important states as those that
connect different regions of the state space, Şimşek et al. [15] propose a new metric to determine
the states that have such a characteristic. The authors propose the L-Cut algorithm, where the idea
is to determine a cut of the state trajectory graph that separates different regions. An edge’s weight
is associated with the number of times a given transition was experienced. The graph cut metric
used is the Normalized Cut (NCut) [64], which separates the STG into two distinct clusters and
measures the relation between the probability of transitioning between clusters and the probability
of transitioning within the same cluster. Therefore, the goal here is to find a minimum NCut, which
is a cut that effectively identifies the edges that separate the two clusters. Since finding a minimal
NCut is an NP-hard problem, the authors first use the spectral clustering method proposed in [64],
which outputs an approximation of the minimum NCut and an approximate cluster separation, as
we have already shown in Section 3.2.2. The L-Cut algorithm then proceeds to test the possible cuts
suggested by the spectral clustering method by analyzing the second eigenvector and choosing
the cut with the minimal NCut. This procedure is performed once for each newly obtained state
trajectory. Options are designed to take the agent from a given state to a state output by the L-Cut
algorithm. The initiation set is defined based on cuts that identified the given state as a subgoal and
are represented by the states from the opposite cluster. An agent using the L-Cut algorithm was
compared with an agent using the Relative Novelty (RN) method [13] and a baseline Q-Learning
agent with no skills in the Two-Room and taxi domain, where the L-Cut method outperforms both
other methods, although the gain over RN is considerably smaller.
Kheradmandian and Rahmati [31] propose an approach based on two clustering algorithms over
the STG G in order to find subgoal states. The first algorithm is a topology clustering algorithm
that iteratively removes specific edges from G in order to identify clusters, resulting in a new
graph G with nonconnected clusters. The second is a clustering algorithm based on the value
function, performed over G . It starts by defining the distance between two states as being the
difference between their value functions. It then creates clusters such that the distances between
states within the same clusters are minimized, while maximizing the distance between states of
different clusters. An option is then built in one of two ways: (1) action sequences that transition
the agent between clusters or (2) frequent action sequences. The former method creates one option
for each connection between clusters, similar to the approach presented in Algorithm 4., where the
shortest path for each transition is learned separately through Dynamic Programming [66]. The
latter method uses statistical analysis to identify frequent action sequences used to reach a given
subgoal, where each sequence identified as frequent is assigned to an option, where the initiation
set is given by the starting state of the sequence, the termination condition is set to 1 for the aimed
subgoal state of the specific sequence, and the option’s subpolicy is determined by the sequence
itself. A drawback of this approach is that the adopted statistical analysis method may not prove
useful in large or continuous state spaces, given that the same sequence of actions may hardly
occur a second time. The authors used the 19-Room and the taxi domains to test their method,
showing that it outperforms the basic Q-Learning algorithm.
Kazemitabar and Beigy [30] present a novel approach where subgoals are identified through
Strongly Connected Components (SCCs). Other approaches of skill acquisition have already used
SCC in some way [12, 25], but [30] present a skill acquisition method that relies solely on the
identification of SCCs. First, Q-Learning is used during a number of iterations in order to obtain
a viable Local STG. Second, SCCs are identified using a linear algorithm proposed by the authors.
Third, subgoal states are identified as being states that connect two SCCs, similar to previous
approaches that consider important states as being those that connect different regions of the state
space [13, 46, 48]. The final step consists of creating options for reaching such subgoals, following
a similar procedure to Algorithm 4. This approach is tested in the Two-Room domain, in which
the skilled agent using the proposed approach outperforms the baseline agent.
A new clustering algorithm created to identify important nodes is presented by Entezari et al.
[21]. The authors propose a new metric to measure the quality of a cluster based on the relation
of (1) connections between border nodes and other nodes of the given cluster and (2) the number
of connections of these same border nodes with their neighbors in different clusters. The metric
is given by

i ∈X, j ∈b (X ) a i j
R(X ) = , (13)
i ∈n(X ), j ∈b (X ) a i j
where b (X ) is the set of border nodes of cluster X , n(X ) corresponds to the neighbor of border
nodes from a different cluster, and ai j indicates the element (i, j) of the adjacency matrix. The
objective is therefore to maximize R(X ). This is accomplished by adding nodes that increase the
value of R(X ); that is, node i is included in cluster X if R(X + i) > R(X ). The agent based on the
proposed skill acquisition method is tested in the Four-Room domain, once again outperforming
the baseline agent.
Davoodabadi and Beigy [16] propose a clustering algorithm that blends two different ap-
proaches: Edge Betweenness [54] and Label Propagation [62] algorithms. The former algorithm
iteratively removes the edge with the highest betweenness and analyzes the resulting clusters.
At each iteration, it calculates a measure that indicates the quality of the resulting cluster, called
the modularity measure, given by

Q= (eii − ai2 ), (14)
i
where ei j is an index of a k × k matrix (k being the number of existing clusters) and corresponds
to the number of edges connecting clusters i and j, while ai is the total number of edges between
cluster i and other clusters. The value of Q varies between 0 and 1, where values near 1 indicate a
strong community structure, while 0 indicates a random structure. The latter algorithm starts by
giving a unique label to each node, and for each iteration, each node receives the most common
label among its neighbors. This is repeated until a stopping criterion is satisfied. The algorithm
proposed in [16] performs the following steps: (1) run the label propagation algorithm and (2) for
each pair of clusters (i, j) resulting from the first step, calculate the modularity variation ΔQ (i, j)
of joining clusters i and j and join clusters i and j associated with the highest ΔQ (i, j). This is per-
formed until the modularity stops rising. Subgoals are then identified as being the border nodes
of each cluster. This approach was then compared with the Edge Betweenness and the Label Prop-
agation algorithms for skill acquisition in RL problems, showing that the proposed new algorithm
performed better, since it was capable of achieving better clustering formations. The experimental
results achieved follow the same idea as previous skill acquisition methods, where an agent based
on the proposed method outperforms the baseline agent in the taxi domain.
Following a similar approach presented in the previous work [16], a new variation of the Label
Propagation Algorithm (LPA) [62], proposed by Bacon and Precup [2], is used in order to separate
similar states into clusters and then identify subgoal states as border states of these clusters. The
novelty proposed by the authors relies on the utilization of a variation of the LPA that consid-
ers a weighted Local STG, where the weight of an edge is equal to the number of times a given
transition was observed. Therefore, the label associated with a given node is not based on the label
occurring in most neighbors of a given node, but instead is the label with the highest cumulative
weight of a node’s neighbors. After clustering the Local STG, the subgoals are defined as being the
border nodes, similar to previous approaches [12, 16]. The proposed skill acquisition method using
LPA is compared with the basic Q-Learning method and to other skill acquisition methods using
betweenness and closeness centralities in the Four-Room domain. The LPA method outperforms
the Q-Learning method but is outperformed by the betweenness and closeness methods. Despite
this, the authors show how to extend their LPA approach to continuous environments.
Taghizadeh and Beigy [69] propose two clustering approaches for skill acquisition: the first is
based on spectral clustering and a variation of the k-means algorithm, called k -means [72], while
the second approach uses Eigenvector Centrality (EVC) to separate clusters. The presented spectral-
based subgoal discovery performs a spectral clustering over the Local STG in order to obtain a set
of k clusters, with k being a parameter informed beforehand. Since it is difficult to know a priori the
number of clusters in the STG of an RL problem, the authors propose to use the k -means algorithm
in order to identify clusters in the eigenvector matrix U (Section 3.2.2). The k -means algorithm
receives as input the maximum number of clusters, given by k, and returns k clusters, where k < k
and k is an approximation to the correct number of clusters. Therefore, the number of clusters
does not need to be informed beforehand. The main drawback of this approach, as pointed out by
the authors, is the high computational cost. Hence, the authors propose a second approach, called
EVC-based subgoal discovery, where the eigenvector centrality of each node is used in order to
separate states into different clusters. To do this, the authors consider the EVC variation over the
graph and categorize states into three possible classes: (1) cluster center, composed of the states
with the highest EVC among its neighbors; (2) cluster member, which encompasses those states
with a lower EVC than at least one of its neighbors and only has neighbors of the same cluster;
and (3) cluster border, which includes states with an EVC lower than at least one of its neighbors and
has neighbors from different clusters. Subgoals are then identified as being cluster border states and
skills are then formed in a similar way as in [13, 14]. Finally, both approaches adopt a skill pruning
technique in order to avoid learning inefficient or redundant skills. Whenever an option created has
an initiation set with a value function higher than the subgoal state, this skill is removed. The EVC-
based approach is then compared with other graph-based skill acquisition approaches, namely,
node betweenness [14], strongly connected components [30], edge betweenness, and reform label
propagation [62]; it was shown that the EVC-based approach outperforms the other mentioned
approaches in the Six-Room and taxi domains.
The approach proposed by Mathew et al. [45] and later expanded by Krishnamurthy et al. [33]
uses a spectral clustering algorithm called PCCA+ [74] in order to identify abstract states that
represent a set of states with similar characteristics. The PCCA+ method receives the transition
matrix of a given graph as an input and returns a membership function X i, j that quantifies the
membership of a given state si to an abstract state S j . Each connection between two abstract states
Si and S j are used to create an option. This transition between abstract states is accomplished by
visiting neighbor states following the positive gradient of the membership function to the des-
tined abstract state. The option ends when a state sy ∈ Sj is reached, S j being the destined abstract
state. This method is then coupled with any reinforcement learning algorithm in order to learn
the optimal policy for the actions and options. Another novel idea presented in [33] is how to
extend the PCCA+ spectral clustering method to define options for Atari games using the Arcade
Learning Environment [6]. Since running the PCCA+ method for high-dimensional problems is
very expensive, the authors adapted the Deep Learning network architecture used for predicting
future frames in Atari games [55] in order to predict future frames using a high-level represen-
tation, reducing the problem dimension and thus making the execution of the PCCA+ algorithm
more feasible. This method is then coupled with the Deep Q-Network (DQN) [51] RL model to
learn when to use each macro-action identified by the proposed algorithm. PCCA+ is compared
with the L-Cut algorithm [15] in the Two-Room and taxi domains, where it manages to identify
fewer subgoals, resulting in a better overall performance. One of the main novelties of this work is
that it is one of the first graph-based skill acquisition approaches to be tested in the ALE domain,
where the PCCA+ method is shown to enhance the overall performance of the learning agent in
the Seaquest Atari game when compared with the results obtained using DQN.
7 SKILL ACQUISITION THROUGH SPECTRAL ANALYSIS AND OTHER APPROACHES

In this section, we will describe another macro-action formalism that uses graph techniques. We
will also present methods based on spectral analysis of the STG in order to automatically define
skills.
The HEXQ framework, presented by Hengst [25, 26], is a different approach for skill acquisition.
The main characteristic of this approach is that it uses a hierarchy of state variables created by an
expert, where the less frequently changed a state variable is, the higher its position in the hierarchy.
Each level of this hierarchy is associated with a sub-MDP that disregards the variable associated
with its level. The idea is to learn an optimal policy for each sub-MDP, which is simpler than
the original MDP. The states of a sub-MDP are transformed to abstract states that correspond to
strongly connected components. Whenever a state transition inside a sub-MDP changes the value
of the variable of the corresponding level, this transition is called an exit and it also serves as a
subgoal. The macro-actions are therefore a sequence of actions that lead the agent to an exit state.
This allows the creation of several macro-actions, each one associated with a level in the hierarchy,
facilitating the learning procedure. This approach, however, presents a similar drawback as the one
discussed in [48]: the sub-MDPs consider all transitions beforehand, making it inviable for large
complex problems. Furthermore, this framework requires a state hierarchy developed beforehand
by an expert, which may also prove unfeasible for large problems. The authors later improve the
HEXQ framework by allowing the state hierarchy to be constructed automatically [27, 58]. To do
so, the agent automatically identifies which state variables are more frequently changed by simply
interacting with the environment. It then assigns the most frequently changed state variables to
lower levels in the hierarchy, until all variables are accounted for. This framework is then tested
in a simple grid-world environment.
A novel approach for approximating value functions is presented in [39], where value functions
are approximated by a sum of proto-value functions. Proto-value functions are a set of functions
f : S → R that map each state to a real value, in a similar way to value functions (Section 2),
and are represented by the eigenvectors e 1 , . . . , e |S | of the Laplacian matrix L derived from the
adjacency matrix A (see Section 3.2.2 for the background), which is associated with the STG G of
an RL problem. Therefore, it is shown that the value function V π can be approximated according to
V π = α 1e 1 + · · · + α |S | e |S | , (15)
where α is a vector of size |S| that must be learned. This idea was then used by Osentoski
and Mahadevan [56]. Using the MAXQ framework [18] to construct macro-actions, the authors
propose a method for creating different state abstractions for each subtask found. The idea here
is that not all state variables are useful for all macro-actions. In some cases, in order to learn the
optimal policy for a macro-action, it is desirable to remove some nonimportant state variables,
although these same state variables may be useful for other macro-actions. The approach works
as follows: First, it is necessary to build an STG based on a state transition history of the agent
for the past n episodes. The authors then propose a graph reduction method capable of joining
states (represented by nodes) with similar connections and with only one different state variable
into a single abstract state that is represented without this state variable. This results in a graph
with abstract states represented by different sets of state variables. Proto-value functions are then
used over each graph G i representing a subtask i in order to obtain the basis function ϕ i (s, a). The
basis function is given by the first k eigenvectors of the Laplacian of graph G i associated with the
smallest eigenvalues, where k is the number of state variables associated with subtask i.
Value function approximation through proto-value functions was also recently used by Machado
et al. [37] in order to automatically identify options. Each proto-value function, represented by an
eigenvector ei of L, is used to create a reward function for a given option, called eigenpurpose
by the authors. After training, each eigenpurpose gives rise to a policy, called eigenbehavior, that
maps states to actions for a given policy. Instead of adding the learned options to the agent’s
action set and learning a policy over options, the authors take a different approach, where options
are executed randomly (until completion) in order to better explore the state space, allowing the
agent to reach otherwise unaccessible state regions. It is important to note that options are only
defined after an initial exploratory step where the agent builds the STG, allowing the construction
of the Laplacian matrix L. The authors show through a series of experiments that this approach
achieves good results and that the options found are not always centered on the idea of reaching
bottleneck states, as done by many previous works [22, 46, 48, 61]. It is also shown that the options
found by that method perform better than bottleneck options. The authors conclude their work
by presenting how to use their method with function approximation for large state spaces: instead
of using an adjacency matrix (from where the Laplacian matrix is derived), authors use a matrix
T ∈ Rt ×d , where t is the number of observed transitions during the sampling process and d is the
number of state features. Each row of matrix T stores a transition observed by saving the difference
between the feature vectors of the previous and current states. The eigenvectors associated with T
are then used to create a set of eigenpurpose and eigenbehavior. This approach is then coupled with a
Deep Reinforcement Learning method (Section 2.1) and tested in the Arcade Learning Environment,
showing promising results.
The previous work was later improved by using Successor Representation (SR) [17]. In brief, SR
indicates the expected future state occupancy for a given state. SR is given by a function that maps
a state pair (s, s ) ⇒ R. The value function can then be decomposed as a product between the SR
of (s, s ) and the reward received during this transition. The SR framework was then extended by
Barreto et al. [4] to problems with nonenumerated states that use function approximation, called
the Successor Features (SF). Stachenfeld et al. [65] showed that there is a strong relationship be-
tween the eigenvectors of the matrix representation of the SR and the eigenvectors of the Laplacian
associated to the STG. Using this relationship, Machado et al. [38] improved their Laplacian Frame-
work [37] by using SR (or SF when using function approximation) as a feature representation of
a state, instead of using raw pixel data. The authors augmented the SF representation by defining
SF over a lower-dimension representation of a state given by the state representation learned by
the NN responsible for predicting future images of a given Atari state [55], as previously used by
Krishnamurthy et al. [33]. Finally, each row of the transition matrix T is now given by the SF rep-
resentation of a given state. Options are then defined similarly to the Laplacian Framework [37].
8 CHALLENGES AND OPEN ISSUES

RL is a large and very active research area with several different approaches. Skill acquisition
emerges as an important subarea within the RL domain, allowing RL agents to restructure their
action set using high-level macro-actions (skills), potentially resulting in a better performance.
One way to automatically discover skills in the RL setting is to use graph-based techniques over
the STG in order to identify important subgoal states and transitions. This survey discussed several
different graph-based skill acquisition methods, including centrality-based, clustering-based, and
spectral analysis approaches. We showed how these graph-based methods evolved through time,
adapting to more complex problems and current advances in the area, such as Deep Reinforcement
Learning (Section 2.1).
Initially, many graph-based skill acquisition methods worked directly with the Global STG of an
RL problem [12, 48], since these methods were only applied to relatively simple RL problems. As
more complex problems appeared, researchers started focusing on graph-based methods capable
of working with a Local STG, therefore requiring only a full transition history [13, 42, 61]. Many
other methods started to use an abstract STG, derived from the Local STG, in order to work with
smaller graphs [33, 56].
Currently, with the appearance of even more complex problems (ATARI games, general-purpose
AI [5], autonomous vehicles, etc.), RL methods became more demanding [50, 51], requiring much
more data in order to efficiently learn how to solve a given problem. This results in increasingly
large Local STG, to the point where it becomes infeasible to create a graph using the whole transi-
tion history. This corresponds to a major challenge for graph-based skill acquisition methods that
must be properly addressed; otherwise, graph-based methods will fail in properly defining skills in
current complex problems. Therefore, it is necessary to search for new approaches for creating an
abstract STG that effectively represent the environment dynamics without requiring the construc-
tion of a fully detailed Local STG. These approaches must, for example, unite a state a RL agent
is currently visiting to another previously visited state with similar characteristics (approximate
state representation, for example), resulting in the construction of an abstract STG while the agent
is transitioning between states.
Spectral analysis allows the identification of not only skills but also their respective intrapoli-
cies, achieved through the approximation of the value function for a given reinforcement learning
problem [37]. Other interesting spectral analysis methods are also available in the literature; al-
though they focus on different domains, it may be possible to adapt them to the context of skill
acquisition. For example, the approach presented in [75] to identify critical nodes for network ro-
bustness can be easily adapted to the skill acquisition problem. This method is used for finding
critical nodes in a complex network by using a local scoring function for each node that measures
how critical the corresponding node is to the network robustness. The score given to a node v is
associated with the spectral gap λ 2 [52] of the subnetwork formed by the nodes reachable within
h hops from node v. This score varies between [0, 1], where a score Sv = 1 indicates that node v is
the most locally critical node in its neighborhood. This also allows the navigation from any node to
a critical node by just going to the neighbor node with the highest score. Therefore, this approach
can then be easily adopted for skill acquisition, where critical nodes represent subgoals and the
navigation process can be associated with skills with their respective intrapolicies already defined.
Graph-based skill acquisition applied to complex RL problems results in potentially very large
Local STGs, as we previously mentioned. To deal with these potentially very large graphs, one can
adopt solutions for processing large-scale graphs that recently flourished in the literature [28, 36,
40]. This allows the graph-based skill acquisition methods to use a more complete STG, helping the
learning agent to achieve a better performance. However, this approach would only be justifiable in
cases where the agent’s learning curve is heavily impacted by skill acquisition methods. Otherwise,
the extra processing time might not be worth it. Another approach is to conceive other graph
processing methods capable of reducing the Local STG, or even new approaches to directly create
reduced STGs, as previously mentioned in the current section. The idea is to allow the agent to
use smaller STGs that efficiently approximate the Global STG. These contrasting approaches are
still underdeveloped, with plenty of room for improvement.
As a final remark, a common problem not only in the machine-learning area but also in
many research areas is how to effectively compare one proposed method with other previous
alternatives. This becomes an especially complicated challenge when there is no established frame-
work in the area capable of providing an efficient test bench for researchers. This problem is also
present in the skill acquisition area, where (1) each researcher uses different problems to test his or
her proposed methods or, in other cases, (2) uses different implementations of the same problem
to compare its method to others. As a consequence of the former problem, authors have difficul-
ties in effectively comparing their methods with other methods in the area, given that there is no
established benchmark or baseline problem. With each method running for different problems, a
direct comparison between concurrent methods becomes a challenge. The latter problem refers to
implementation differences of a same method, given that some papers in the area do not fully de-
scribe their experimental setup, making it difficult to replicate their results. However, this scenario
is slowly changing at least in the RL area with the proposal of different benchmarks that imple-
ment several complex problems, such as the Arcade Learning Environment [6], the benchmark for
continuous control [20], OpenAI Gym [8], and, more recently, the DeepMind Lab [5]. Therefore, it
is important that future work related to skill acquisition adopts such benchmarks in order to allow
a fair and easy comparison of any proposal against previous methods.
9 SUMMARY AND OUTLOOK

Skill acquisition allows the RL agent to better understand the environment it is immersed in, con-
tributing to its learning process and impacting its performance. This survey showed several graph-
based skill acquisition approaches for RL, detailing their evolution and adaptation to current com-
plex RL problems, as well as their main differences and categorization. We can see through this
survey that the first graph-based skill acquisition approaches were focused on identifying impor-
tant states through centrality measures (Section 5). These first approaches were proved useful in
simple grid-world problems, but as new and more complex applications emerged, the focus in the
area changed to clustering-based approaches (Section 6). Clustering-based methods were proved
more useful, but it was only with the introduction of methods based on spectral analysis that the
graph-based skill acquisition methods thrived in recent Deep Reinforcement Learning applica-
tions [33, 37].
Indeed, with RL problems becoming increasingly more complex, graph-based skill acquisition
methods must adapt to either deal with very large graphs or to create graph abstractions able
to generate simpler yet representative STGs. As shown in recent work [33, 37], spectral analysis
corresponds to an attractive approach for graph-based skill acquisition for RL. This stems from
the ability of spectral analysis to uncover important information that helps RL methods by simply
analyzing the underlying STG of an RL problem. If enough abstractions are made while keeping
the STG representativeness, these methods can yield promising results even for complex problems.
REFERENCES
[1] Pierre-Luc Bacon, Jean Harb, and Doina Precup. 2017. The option-critic architecture. In Proceedings of the AAAI
Conference on Artificial Intelligence. 1726–1734.
[2] Pierre-Luc Bacon and Doina Precup. 2013. Using label propagation for learning temporally abstract actions in rein-
forcement learning. In Proceedings of the Workshop on Multiagent Interaction Networks. 1–7.
[3] Albert-László Barabási. 2016. Network Science. Cambridge University Press.
[4] Andre Barreto, Will Dabney, Remi Munos, Jonathan J. Hunt, Tom Schaul, David Silver, and Hado P. van Hasselt.
2017. Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems
(NIPS’17). Curran Associates, 4058–4068.
[5] Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq,
Simon Green, Víctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain,
Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. 2016. DeepMind lab.
arXiv preprint arXiv:1612.03801 (2016).
[6] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. 2013. The arcade learning environment: An
evaluation platform for general agents. Journal of Artificial Intelligence Research 47, 1 (May 2013), 253–279.
[7] Katy Börner, Soma Sanyal, and Alessandro Vespignani. 2007. Network science. Annual Review of Information Science
and Technology 41, 1 (2007), 537–607.
[8] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba.
2016. OpenAI gym. arXiv preprint arXiv:1606.01540 (2016).
[9] Fei Chen, Yang Gao, Shifu Chen, and Zhenduo Ma. 2007. Connect-based subgoal discovery for options in hierarchical
reinforcement learning. In Proceedings of the International Conference on Natural Computation (ICNC’07), Vol. 4. 698–
702. DOI:https://doi.org/10.1109/ICNC.2007.312
[10] Cc Chiu and Von-Wun Soo. 2011. Subgoal identifications in reinforcement learning: A survey. Advances in Reinforce-
ment Learning (2011), 181–188. DOI:https://doi.org/10.5772/13214
[11] Ozgür Şimşek. 2008. Behavioral Building Blocks for Autonomous Agents: Description, Identification, and Learning. Ph.D.
Dissertation. University of Massachusetts.
[12] Ozgür Şimşek and A. Barto. 2007. Betweenness centrality as a basis for forming skills. Technical Report, University of
Massachusetts, Department of Computer Science.
[13] Özgür Şimşek and Andrew G. Barto. 2004. Using relative novelty to identify useful temporal abstractions in rein-
forcement learning. In Proceedings of the International Conference on Machine Learning (ICML’04). ACM, New York,
95–103. DOI:https://doi.org/10.1145/1015330.1015353
[14] Ozgür Şimşek and Andrew G. Barto. 2008. Skill characterization based on betweenness. In Advances in Neural Infor-
mation Processing Systems (NIPS’08), 1497–1504.
[15] Özgür Şimşek, Alicia P. Wolfe, and Andrew G. Barto. 2005. Identifying useful subgoals in reinforcement learning by
local graph partitioning. In Proceedings of the International Conference on Machine Learning (ICML’05). ACM, New
York, 816–823. DOI:https://doi.org/10.1145/1102351.1102454
[16] Marzieh Davoodabadi and Hamid Beigy. 2011. A new method for discovering subgoals and constructing options in
reinforcement learning. In Indian International Conference on Artificial Intelligence (IICAI’11). 441–450.
[17] Peter Dayan. 1993. Improving generalization for temporal difference learning: The successor representation. Neural
Computation 5, 4 (1993), 613–624.
[18] Thomas G. Dietterich. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition.
Journal of Artificial Intelligence Research 13, 1 (Nov. 2000), 227–303.
[19] Bruce L. Digney. 1998. Learning hierarchical control structures for multiple tasks and changing environments. From
Animals to Animats 5: Proceedings of the International Conference on the Simulation of Adaptive Behavior. 321–330.
[20] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. 2016. Benchmarking deep reinforcement
learning for continuous control. In Proceedings of the International Conference on International Conference on Machine
Learning (ICML’16), Vol. 48. 1329–1338.
[21] Negin Entezari, Mohammad Ebrahim Shiri, and Parham Moradi. 2010. A local graph clustering algorithm for discov-
ering subgoals in reinforcement learning. In Communications in Computer and Information Science. Springer, Berlin,
41–50. DOI:https://doi.org/10.1007/978-3-642-17604-3_5
[22] Sandeep Goel and Manfred Huber. 2003. Subgoal discovery for hierarchical reinforcement learning using learned
policies. In Proceedings of the International Florida Artificial Intelligence Research Society Conference, Ingrid Russell
and Susan M. Haller (Eds.). AAAI Press, 346–350.
[23] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS’14).
Curran Associates, 2672–2680.
[24] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. 2015. DRAW: A recurrent neural
network for image generation. In Proceedings of the International Conference on Machine Learning (Proceedings of
Machine Learning Research), Francis Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France, 1462–1471.
[25] Bernhard Hengst. 2002. Discovering hierarchy in reinforcement learning with HEXQ. In Proceedings of the Interna-
tional Conference on Machine Learning (ICML’02). Morgan Kaufmann Publishers, San Francisco, 243–250.
[26] Bernhard Hengst. 2003. Discovering Hierarchy in Reinforcement Learning. Ph.D. Dissertation. University of New South
Wales.
[27] Bernhard Hengst. 2004. Model approximation for HEXQ hierarchical reinforcement learning. In Proceedings of
the European Conference on Machine Learning (ECML’04). Springer, Berlin,144–155. DOI:https://doi.org/10.1007/
978-3-540-30115-8_16
[28] Alexandru Iosup, Tim Hegeman, Wing Lung Ngai, Stijn Heldens, Arnau Prat-Pérez, Thomas Manhardto, Hassan
Chafio, Mihai Capotă, Narayanan Sundaram, Michael Anderson, Ilie Gabriel Tănase, Yinglong Xia, Lifeng Nai, and
Peter Boncz. 2016. LDBC graphalytics: A benchmark for large-scale graph analysis on parallel and distributed plat-
forms. Proceedings of the VLDB Endowment 9, 13 (2016), 1317–1328. DOI:https://doi.org/10.14778/3007263.3007270
[29] Andrej Karpathy and Li Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 664–676.
[30] Seyed Jalal Kazemitabar and Hamid Beigy. 2009. Using strongly connected components as a basis for autonomous skill
acquisition in reinforcement learning. In Proceedings of the International Symposium on Neural Networks (ISNN’09).
Springer, Berlin, 794–803. DOI:https://doi.org/10.1007/978-3-642-01507-6_89
[31] Ghorban Kheradmandian and Mohammad Rahmati. 2009. Automatic abstraction in reinforcement learning using data
mining techniques. Robotics and Autonomous Systems 57, 11 (2009), 1119–1128. DOI:https://doi.org/10.1016/j.robot.
2009.07.002
[32] George Konidaris and Andrew Barto. 2009. Skill discovery in continuous reinforcement learning domains using skill
chaining. In Advances in Neural Information Processing Systems (NIPS’09), Y. Bengio, D. Schuurmans, J. Lafferty, C.
Williams, and A. Culotta (Eds.). 1015–1023.
[33] Ramnandan Krishnamurthy, Aravind S. Lakshminarayanan, Peeyush Kumar, and Balaraman Ravindran. 2016. Hier-
archical reinforcement learning using spatio-temporal abstractions and deep neural networks. In Proceedings of the
International Conference on Machine Learning.
[34] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural
networks. In Advances in Neural Information Processing Systems (NIPS’12). Curran Associates, 1097–1105.
[35] Long-Ji Lin. 1992. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine
Learning 8, 3 (1992), 293–321. DOI:https://doi.org/10.1007/BF00992699
[36] Yi Lu, James Cheng, Da Yan, and Huanhuan Wu. 2014. Large-scale distributed graph computing systems: An experi-
mental evaluation. Proceedings of the VLDB Endowment 8, 3, 281–292. DOI:https://doi.org/10.14778/2735508.2735517
[37] Marlos C. Machado, Marc G. Bellemare, and Michael H. Bowling. 2017. A Laplacian framework for option discovery
in reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML’17). 2295–2304.
[38] Marlos C. Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell. 2017.
Eigenoption discovery through the deep successor representation. CoRR abs/1710.11089 (2017). Retrieved from
http://arxiv.org/abs/1710.11089.
[39] Sridhar Mahadevan. 2005. Proto-value functions: Developmental reinforcement learning. In Proceedings of the Inter-
national Conference on Machine Learning (ICML’05). ACM, New York, 553–560. DOI:https://doi.org/10.1145/1102351.
1102421
[40] Grzegorz Malewicz, Matthew H. Austern, and Aart J. C. Bik. 2010. Pregel: A system for large-scale graph processing.
In Proceedings of the ACM SIGMOD International Conference on Management of Data. 135–145.
[41] Daniel J. Mankowitz, Timothy A. Mann, and Shie Mannor. 2016. Adaptive skills adaptive partitions (ASAP). In Ad-
vances in Neural Information Processing Systems (NIPS’16), D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R.
Garnett (Eds.). Curran Associates, 1588–1596.
[42] Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. 2004. Dynamic abstraction in reinforcement learning via
clustering. In Proceedings of the International Conference on Machine Learning (ICML’04). ACM, New York, 71–78.
DOI:https://doi.org/10.1145/1015330.1015355
[43] Oded Maron. 1998. Learning from Ambiguity. Ph.D. Dissertation. Massachusetts Institute of Technology.
[44] Oded Maron and Tomás Lozano-Pérez. 1998. A framework for multiple-instance learning. In Proceedings of the Con-
ference on Advances in Neural Information Processing Systems (NIPS’98). MIT Press, Cambridge, MA, 570–576.
[45] Vimal Mathew, Peeyush Kumar, and Balaraman Ravindran. 2012. Abstraction in reinforcement learning in terms of
metastability. In Proceedings of the European Workshop on Reinforcement Learning (EWRL’12).1–14.
[46] Amy McGovern and Andrew G. Barto. 2001. Automatic discovery of subgoals in reinforcement learning using diverse
density. In Proceedings of the International Conference on Machine Learning, 361–368.
[47] Amy McGovern, Richard S. Sutton, and Andrew H. Fagg. 1997. Roles of macro-actions in accelerating reinforcement
learning. In Grace Hopper Celebration of Women in Computing. 13–18.
[48] Ishai Menache, Shie Mannor, and Nahum Shimkin. 2002. Q-cut - dynamic discovery of sub-goals in reinforce-
ment learning. Proceedings of the European Conference on Machine Learning, 295–306. DOI:https://doi.org/10.1007/
3-540-36755-1_25
[49] Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Černocký, and Sanjeev Khudanpur. 2011. Extensions of recur-
rent neural network language model. In IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP’11). 5528–5531.
[50] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Sil-
ver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the
International Conference on Machine Learning (Proceedings of Machine Learning Research), Maria Florina Balcan and
Kilian Q. Weinberger (Eds.), Vol. 48. PMLR, New York, 1928–1937.
[51] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves,
Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis
Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level
control through deep reinforcement learning. Nature 518, 7540 (Feb. 2015), 529–533.
[52] Bojan Mohar. 1997. Some Applications of Laplace Eigenvalues of Graphs. Springer Netherlands, Dordrecht, 225–275.
DOI:https://doi.org/10.1007/978-94-015-8937-6_6
[53] Parham Moradi, Mohammad Ebrahim Shiri, and Negin Entezari. 2010. Automatic skill acquisition in reinforcement
learning agents using connection bridge centrality. In Communications in Computer and Information Science. Springer,
Berlin, 51–62. DOI:https://doi.org/10.1007/978-3-642-17604-3_6
[54] Mark E. J. Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Physical
Review E 69, 2 (2004), 15. DOI:https://doi.org/10.1103/PhysRevE.69.026113
[55] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder Singh. 2015. Action-conditional video pre-
diction using deep networks in atari games. In Advances in Neural Information Processing Systems (NIPS’15), C. Cortes,
N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, 2863–2871.
[56] Sarah Osentoski and Sridhar Mahadevan. 2010. Basis function construction for hierarchical reinforcement learning. In
Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS’10). International
Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 747–754.
[57] Ronald Edward Parr. 1998. Hierarchical Control and Learning for Markov Decision Processes. Ph.D. Dissertation. Uni-
versity of California, Berkeley.
[58] Duncan Potts and Bernhard Hengst. 2004. Concurrent discovery of task hierarchies. In AAAI Spring Symposium on
Knowledge Representation and Ontology for Autonomous Systems. 1–8.
[59] Doina Precup. 2000. Temporal Abstraction in Reinforcement Learning. Ph.D. Dissertation. University of Massachusetts.
[60] Martin L. Puterman. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
[61] Ali Ajdari Rad, Martin Hasler, and Parham Moradi. 2010. Automatic skill acquisition in reinforcement learning using
connection graph stability centrality. In Proceedings of IEEE International Symposium on Circuits and Systems. 697–700.
DOI:https://doi.org/10.1109/ISCAS.2010.5537485
[62] Usha Nandini Raghavan, Réka Albert, and Soundar Kumara. 2007. Near linear time algorithm to detect community
structures in large-scale networks. Physical Review E 76, 3 (2007), 11. DOI:https://doi.org/10.1103/PhysRevE.76.036106
[63] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with
region proposal networks. In Advances in Neural Information Processing Systems (NIPS’15). Curran Associates, 91–99.
[64] Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence 22, 8 (Aug. 2000), 888–905. DOI:https://doi.org/10.1109/34.868688
[65] Kimberly L. Stachenfeld, Matthew Botvinick, and Samuel J. Gershman. 2014. Design principles of the hippocampal
cognitive map. In Advances in Neural Information Processing Systems (NIPS’14). Curran Associates, 2528–2536.
[66] Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press.
[67] Richard S. Sutton, Doina Precup, and Satinder Singh. 1999. Between MDPs and semi-MDPs: A framework for temporal
abstraction in reinforcement learning. Artificial Intelligence 112 (1999), 181–211.
[68] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vin-
cent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Computer Vision and Pattern
Recognition (CVPR’15). 1–12.
[69] Nasrin Taghizadeh and Hamid Beigy. 2013. A novel graphical approach to automatic abstraction in reinforcement
learning. Robotics and Autonomous Systems 61, 8 (2013), 821–835. DOI:https://doi.org/10.1016/j.robot.2013.04.010
[70] Alexander Vezhnevets, Volodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John Agapiou, and Koray
Kavukcuoglu. 2016. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing
Systems (NIPS’16), D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, 3486–
3494.
[71] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray
Kavukcuoglu. 2017. FeUdal networks for hierarchical reinforcement learning. In Proceedings of the International Con-
ference on Machine Learning (ICML’17). 1–12.
[72] Krista Rizman Žalik. 2008. An efficient K’-Means clustering algorithm. Pattern Recognition Letters 29, 9 (2008), 1385–
1391. DOI:https://doi.org/10.1016/j.patrec.2008.02.014
[73] Christopher J. C. H. Watkins. 1989. Learning from Delayed Rewards. Ph.D. Dissertation. King’s College, Cambridge,
UK.
[74] Marcus Weber, Wasinee Rungsarityotin, and Alexander Schliep. 2004. Perron cluster analysis and its connection to
graph partitioning for noisy data. Technical Report, Zuse Institute Berlin (ZIB).
[75] Klaus Wehmuth and Artur Ziviani. 2011. Distributed location of the critical nodes to network robustness based
on spectral analysis. In Latin American Network Operations and Management Symposium (LANOMS’11). 1–8. DOI:
https://doi.org/10.1109/LANOMS.2011.6102259
Received January 2018; revised August 2018; accepted October 2018

Graph-Based Skill Acquisition For Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Graph-Based Skill Acquisition For Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Graph-Based Skill Acquisition For Reinforcement Learning

MATHEUS R. F. MENDONÇA, ARTUR ZIVIANI, and ANDRÉ M. S. BARRETO, National

Fig. 2. Graph-based skill acquisition methods covered in this survey.

2 REINFORCEMENT LEARNING (RL)

2.1 Deep Reinforcement Learning

2.2 Skill Acquisition

by the following rule:

Q (st , ot ) = Q (st , ot ) + α[r o + γ k max a ∈O Q (S t +k , a ) − Q (st , ot )], (4)

2.3 State Transition Graph (STG)

3 GRAPH ANALYSIS METHODS

3.1 Centrality Measures

where δ (i, j) is the distance connecting nodes i and j.

3.2 Clustering Algorithms

ALGORITHM 2: Spectral Clustering

4 BENCHMARKS AND COMPARISON APPROACHES

5 SKILL ACQUISITION THROUGH CENTRALITY MEASURES

ALGORITHM 3: Skill Acquisition Using Centrality Measures

5.1 Centrality Measures with a Global STG

5.2 Centrality Measures with a Local STG

5.3 Centrality Measures without an STG

6 SKILL ACQUISITION THROUGH CLUSTERING ALGORITHMS

ALGORITHM 4: Skill Acquisition Using Clustering Algorithms

7 SKILL ACQUISITION THROUGH SPECTRAL ANALYSIS AND OTHER APPROACHES

8 CHALLENGES AND OPEN ISSUES

9 SUMMARY AND OUTLOOK

Received January 2018; revised August 2018; accepted October 2018

You might also like