A Survey On Deep Learning and Deep Reinforcement Learning in Robotics With A Tutorial On Deep Reinforcement Learning

Intelligent Service Robotics (2021) 14:773–805
https://doi.org/10.1007/s11370-021-00398-z
REVIEW ARTICLE
A survey on deep learning and deep reinforcement learning in robotics

with a tutorial on deep reinforcement learning
Eduardo F. Morales1,2 · Rafael Murrieta-Cid1 · Israel Becerra1,3 · Marco A. Esquivel-Basaldua1
Received: 30 August 2021 / Accepted: 18 October 2021 / Published online: 16 November 2021
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021
Abstract
This article is about deep learning (DL) and deep reinforcement learning (DRL) works applied to robotics. Both tools have
been shown to be successful in delivering data-driven solutions for robotics tasks, as well as providing a natural way to develop
an end-to-end pipeline from the robot’s sensing to its actuation, passing through the generation of a policy to perform the given
task. These frameworks have been proven to be able to deal with real-world complications such as noise in sensing, imprecise
actuation, variability in the scenarios where the robot is being deployed, among others. Following that vein, and given the
growing interest in DL and DRL, the present work starts by providing a brief tutorial on deep reinforcement learning, where
the goal is to understand the main concepts and approaches followed in the field. Later, the article describes the main, recent,
and most promising approaches of DL and DRL in robotics, with sufficient technical detail to understand the core of the
works and to motivate interested readers to initiate their own research in the area. Then, to provide a comparative analysis, we
present several taxonomies in which the references can be classified, according to high-level features, the task that the work
addresses, the type of system, and the learning techniques used in the work. We conclude by presenting promising research
directions in both DL and DRL.
Keywords Deep learning · Deep reinforcement learning · Mobile robotics · Mobile robot navigation · Motion planning ·
Mobile manipulation
1 Introduction type of approach is classic. An example is the Newton laws.

The other one is learning, which is more recent; this is typ-
In robotics, there have been two main approaches to ically a model-free approach, which bases its performance
achieve autonomy. The first one is the so-called model-based on data. In general, learning is the ability to improve per-
approach, in which a model is built; the goal of the model formance in a given task through experience. Learning is
is to predict the behavior of the system as time elapses. This an inherent capability of artificial intelligence. According to
[161], model-based approaches are significantly more data-
B Israel Becerra efficient, related to their smaller number of parameters. The
israelb@cimat.mx optimization in model-based approaches can be done effi-
Eduardo F. Morales ciently, but the basin of convergence is small. In contrast,
eduardo.morales@cimat.mx; emorales@inaoep.mx solutions obtained with learning can have large basins of
Rafael Murrieta-Cid convergence. However, a drawback is that they often do not
murrieta@cimat.mx perform well if applied in a regime outside the training data.
Marco A. Esquivel-Basaldua Robotics is an area in which learning is particularly challeng-
marco.esquivel@cimat.mx ing since errors can lead to potentially catastrophic results.
For instance, a self-driving car in which an accident may
1 Centro de Investigación en Matemáticas (CIMAT), result in fatalities. Hence, room for a trial-and-error approach
Guanajuato, Mexico
is very often inadmissible. Nonetheless, learning can also
2 Instituto Nacional de Astrofísica Óptica y Electrónica be done through simulations; the difficulty is to transfer the
(INAOE), Tonantzintla, Mexico
learning to the physical world scenarios. Another difficulty
3 Consejo Nacional de Ciencia y Tecnología (CONACyT),
Mexico City, Mexico
123
774 Intelligent Service Robotics (2021) 14:773–805
of learning in robotics is a large amount of possible, indoor authors of [161] also explain the need for better evaluation
and outdoor scenarios in which the task could be done. metrics, and they stress the importance and unique challenges
This paper is about deep learning (DL) and deep reinforce- for deep robotic learning in simulation. The work in [161]
ment learning (DRL) works applied to robotics; the included mainly focuses on proposing new research directions in the
works have achieved important results in recent years. The field and establishes the limits of deep learning in robotics.
objective of this paper is threefold. The first objective is to However, [161] does not present mathematical descriptions.
present a summary of the primary, general, recent, and most As we have said before, two of the objectives of this work are
promising algorithms in the field, at the level of detail of to provide a sufficient level of technical mathematical detail
a brief tutorial, in the sense of providing a short and shal- about the techniques used in this area—in the form of a brief
low course that teaches the main fundamentals to be able tutorial—and a more technically detailed description of the
to carry out a reinforcement learning solution for a robotics summarized works.
task. The second objective is to present recent works that In [40], the authors analyze the current state of machine
contain interesting results and promising ideas to the field. learning applied to robotic behaviors. They present a broad
Those works are presented with sufficient level of technical overview of robots’ behaviors; however, [40] mainly focuses
mathematical detail in such a way that the readers should be on studying humanoid robots, legged robots, and robotic
able to understand the core of the works, and if they wish, arms, while the present paper tries to cover a broader family
they could initiate their research in the area. The third and of mobile robots. The authors of [40] also propose a classi-
final objective is to propose promising research directions in fication of behaviors and draw conclusions about what can
this area. be learned and what should be learned. Finally, the work in
This paper is organized as follows: In Sect. 2, we describe [40] presents an outlook on five challenging problems in the
previous surveys on machine learning and robotics. In Sect. 3, field. Again, the work in [40] neither presents a technical
a brief tutorial on deep reinforcement learning is given to mathematical description of the employed techniques.
understand the current approaches proposed in the area. Sec- More recently, the work in [83] presents a survey on model
tion 4 describes some relevant work on deep learning and structures and training strategies used in deep learning in
deep reinforcement learning applied to robotics. Section 5 robotics. The survey provides a categorization of major chal-
provides different taxonomies in which the works are classi- lenges in robotics that leverage DL technologies, while here
fied to offer the reader a comparative standpoint. Promising we present a categorization of relevant works but based on
research directions are presented in Sect. 6, and conclusions their domain of application. Moreover, despite the fact that
are given in Sect. 7. [83] provides deep reinforcement learning content, [83] is
mainly focused on deep learning approaches, while in the
present survey deep reinforcement learning plays a major
2 Previous surveys role.
Finally, it is worth mentioning that deep learning and deep
Several related surveys to this paper have been presented in reinforcement learning in robotics are fast paced fields; thus,
the past. The work in [87] presents a survey of reinforcement there is a constant need to compile the most recent advances
learning in robotics. The authors say that “reinforcement in the area.
learning offers robotics a framework and a set of tools for
the design of sophisticated and hard-to-engineer behaviors”.
The authors strengthen the links between the research com- 3 A brief tutorial on deep reinforcement
munities of learning and robotics presenting that work. The learning
work is very educational since it presents techniques with
enough mathematical details and provides a concise intro- In this section, the basic elements of a Markov decision
duction to reinforcement learning. The work also presents a process (MDP) are first introduced, followed by the descrip-
specific case of study called the ball in a cup, in which the tion of some basic RL algorithms that can be used to solve
goal is to put a ball (which is attached to a string) inside a MPDs (Sect. 3.1). These initial algorithms assume that the
cup, by pulling the string. The main drawback of [87] is that domain can be described with a finite, and relatively small,
it does not include recent techniques such as deep learning set of states and actions. However, some more realistic sce-
and deep reinforcement learning, or methods that combine narios may involve a large number of states and actions or
predictive control with deep learning. The present work seeks even continuous domains, where we need to rely on function
to fill those gaps. approximation. In Sect. 3.2, it is described how to use func-
The work in [161] presents a recent approach to deep tion approximation within RL to solve MDPs. The rest of
learning in robotics. It focuses on distinguishing the type this section (Sect. 3.3) is devoted to show how deep learning
of problems that arise in computer vision and robotics. The algorithms can be used, in conjunction with RL, to solve com-
123
Intelligent Service Robotics (2021) 14:773–805 775
plex sequential decision problems commonly encountered in

robotics.
3.1 Preliminaries
An important area of robotics can be framed as a sequen-

tial decision process, where the robot needs to decide which
action to take at each stage while solving a task. Sequential
decision processes can be formalized as Markov decision
processes (MDPs) [127], where an agent in a state, (s), takes
an action (a), possibly changes its state (s ), and receives a
reward (r ). In the new state, it has to take a new decision over
which new action to take, until some termination criterion is
met. Fig. 1 Example of a simple grid world where RL techniques can be
used to optimally reach a goal state from any starting position
An MDP is represented by a tuple M = S, A, , R
where
– S is a finite set of states (si ∈ S, i = {1, . . . , n}),

– A is a finite set of actions, which can depend on each complete and accurate model of the environment is known
state (a j (si ) ∈ A, j = {1, . . . , m}), (i.e., the reward function and the state transition function),
– R(s, a) is a reward function, which defines the goals and (ii) Monte Carlo methods, based on estimates statistics from
maps each state-action pair to a number (reward) indicat- simulations and which does not need to know a model of the
ing how desirable is the state (R : S × A → R), environment, and (iii) temporal difference methods that also
– (s |s, a) is a state transition function that gives us the do not need a model and proceed on a step-by-step incre-
probability of reaching state s ∈ S when taking action mental way.
a ∈ A in state s ∈ S ( : S × A → S). For instance, consider a robot in a grid world (see Fig. 1),
where each square represents a state, which could be repre-
Additional elements include: sented by its coordinates, and where the robot can perform
at most four possible actions per state (i.e., up, down, right,
– Policy (π ): Defines how the agent behaves. It is a (possi- left), but may have less in some states (e.g., the corners). The
bly stochastic) mapping from states to actions (π(s) → square with the treasure represents the goal state. The robot
a) which dictates which action to take on each state. receives a positive reward when reaching the goal (e.g., 10)
– Value function (V π /Q π ): It is the expected accumulated and a negative reward for each intermediate movement (e.g.,
rewards that an agent anticipates to receive in a state s -1). The robot has to learn a policy, which tells the robot
(V π (s)) which action to perform on each state in order to reach the
goal and accumulate the maximum expected reward (in this
V π (s) = Eπ {G t | st = s} (1) case, the policy with the shortest path). The value function
∞
assigns each state or state-action pair the expected accumu-
= Eπ γ rt+k+1 | st = s ,
k
(2) lated reward when following a particular policy. Hence, if
k=0 the current policy moves the robot, from its position shown
in the figure, down until reaching a wall and then follows
or in a state s taking an action a (Q π (s, a))
the wall to its left until reaching the goal, the undiscounted
total reward (its value function) will be 3. However, if the
Q π (s, a) = Eπ {G t | st = s, at = a} (3)
∞ current policy tells the robot to go up, until reaching a wall

= Eπ γ k rt+k+1 | st = s, at = a , (4) and then follow the wall to its right, its value function, in its
k=o
current state, will be 5. The RL algorithms are used to learn
the optimal policy and/or value function for each state.
and following the policy π onward, where γ is a discount In robotics, it is often the case that the robot does not have
factor and G t is the total discounted reward. a model of the environment and has to incrementally learn,
through interaction with the environment, how to perform a
Solving an MDP means obtaining an optimal policy, value task. When we are trying to learn a function f in an incre-
function, or both. There are different approaches for solving mental way, it is common to use the difference between the
MDPs [162]: (i) based on dynamic programming, when a target value and the estimated value of the function that we
123
are trying to learn: and -1 for the rest of the states. The agent will initially behave
randomly but will gradually be improved its performance as
f new ← f old + α( f target − f old ), (5) the Q-values are updated.
An attractive feature of this type of algorithm is that, for
where α is commonly known as the learning rate. The dif- discrete state and action spaces, provided that the agent visits
ference between f target and f old , in temporal difference every state-action pairs several times, they converge to the
methods, is known as the TD-error. One of the main dif- optimal value function (from which an optimal policy can be
ferences between RL and classic classification methods is obtained) [153].
that f target is not known, and it has to be estimated. Let us
suppose that we want to learn the Q value function; then, we 3.2 Function approximation
can represent our learning step as follows:
Learning from interacting with the environment is the subject
Q(s, a) ← Q(s, a) + α(Q(s, a)target − Q(s, a)), (6) area of reinforcement learning (RL), so it is not surprising
that many researchers in robotics have tried to use it. How-
however, since we do not know Q(s, a)target we have to ever, in this domain, there is normally a large search space,
estimate it. We can substitute Q(s, a)target in different ways, and implicit forms (i.e., parametric functions) of the value
which have produced different algorithms, for instance: function or the policy function need to be adopted. In this
case, we want to estimate a value or policy function with a
– SARSA1 : rt+1 + γ Q(st+1 , at+1 ). parameterized function that is close to the real function, e.g.,
– Q-learning: rt+1 + γ maxat+1 Q(st+1 , at+1 ). for the Q(s, a) value function, we want to learn a parameter-
– Monte Carlo: G t . ized function Q θ (s, a) ∼ Q(s, a) with parameters θ . There
– n−step RL: rt+1 + γ rt+2 + γ 2 rt+3 + are several possible function approximators that can be used.
. . . + γ t+n−1 Q(st+n , at+n ). Here, we focus on neural networks. In general, the objective
is to find the parameter θ of a function that minimizes the
For instance, the Q-learning algorithm [182] (also referred loss between the estimated Q θ (s, a) and the real Q(s, a). So
to as an off-policy algorithm) is described in Algorithm 1 θ can be updated by taking the gradient of the loss function
(taken from [162]), where an episode is a sequence of state- (J (θ )), i.e., θ ← θ − 21 α∇θ J (θ ), where the loss function can
action pairs followed by the agent until reaching a terminal be the mean square error: J (θ ) = E[(Q(s, a) − Q θ (s, a))2 ].
state (or termination condition), –greedy means that nor- For instance, if we want to approximate the Q-value func-
mally the action with the best Q-value will be selected, but tion (Q θ (s, a)), θ can be updated as:
sometimes, a random action with a small () probability will
be selected, r is the reward obtained after taking action a in θ ← θ + α(Q(s, a) − Q θ (s, a))∇ Q θ (s, a). (7)
state s when moving to the next state s , and a is an action
selected in s . This promotes keeping some exploration dur- As before, the value of Q(s, a) is unknown so we have to
ing the learning process. approximate it. In this case, we are estimating the error with
an approximation of the true function, which is known as a
Algorithm 1 The Q-learning algorithm. semi-gradient.
Initialize Q(s, a) arbitrarily
Again, there are different ways in which Q(s, a) can be
repeat {for each episode} approximated, giving rise to different algorithms, e.g.,
Initialize s
repeat {for each step in the episode}
Select an action a in s using a policy derived from Q (e.g.,
– One-step RL: r + γ Q θ (s , a).
–greedy) – Monte Carlo: G t .
Take action a, observe r ,s – n-step RL: G t:t+n , where G t:t+n = rt+1 + γ rt+2 +
Q(s, a) ← Q(s, a) + α r + γ maxa Q(s , a ) − Q(s, a) γ 2 rt+3 . . . γ n Q θ (st+n , at+n ).
s←s
until s is terminal
until convergence For instance, a one-step TD function approximation algo-
rithm is shown in Algorithm 2. Consider the example of
Fig. 1, but now with a very large number of states and/or
In the example of Fig. 1, there are 25 states with at most actions where a tabular representation of the Q-function is
four actions per state, a reward function of 10 in the goal state not longer feasible. What this type of algorithms are doing
is approximating a function by adjusting the value of their
1 State, Action, Reward, State, Action. parameters.
123
Algorithm 2 A one-step TD function approximation algo- DRL

rithm.
Initialize arbitrarily the parameters (θ) of the value function
repeat {for each episode}
Initialize s
Q-value Policy Actor-Critic
repeat {for each step in the episode} DQN REINFORCE DDPG
Select an action a in s using a policy derived from Q (e.g., DDQN Maximum Entropy TD3
–greedy)
Take action a, observe r , s
C51 PPO SAC
θ ← θ + α[r + γ Q θ (s , a ) − Q θ (s, a)]∇ Q θ (s, a) Rainbow TRPO A2C/A3C
s ← s QT-Opt DPG
until s is terminal
until convergence
Fig. 2 An overview of some deep reinforcement learning algorithms
This formulation may pose some challenges, as the data initially applied to Atari games and introduced two main
distribution changes during the learning process, i.e., we are techniques to mitigate some of the existing problems. One
updating a function using the same function on a different of them was to use experience replay, a.k.a. mini-batch,
state-action pair which changes its values while interacting where experiences of the agent in the environment (et =
with the environment, which can produce divergence in the (st , at , rt+1 , st+1 )) are stored in a database D = e1 , . . . , e N
function approximation [41]. In addition, subsequent exam- and randomly sampled for updating the Q-function, as it will
ples are correlated, which also breaks the assumption of be shown below. One advantage of experience replay is that
independent samples required for convergence on gradient it can use a sample of experiences for updating the function
descend. parameters, using its average to smooth the learning process
Furthermore, in robotics, as well as in many other (reducing variance) and removing the correlations between
domains, it is not clear how to represent the states and successive samples. However, it stores the last N samples
actions so that the learning process can produce satisfactory and does not distinguish between relevant transitions. The
results, as learning directly from high-dimensional inputs other relevant learning decision was to have one network
(e.g., images, videos, etc.), which is a natural option for with its weights fixed, serving as reference to another net-
robotics, has been one of the big challenges of RL. work which updates its weights during learning. After a fixed
number of steps, the fixed network is replaced by the most
3.3 Deep reinforcement learning recent updated network and the learning process continues.
Although we still have a moving target, by having a fixed
Recent developments on deep neural networks have shown network for some time, it helps to improve convergence. If
that raw data can be used as input for learning. This has θ − refers to the weights of the fixed network and θ refers to
produced a new impetus to the area, as the representation the weights that are updated, the gradient of the loss function
issue is left to these networks. There are still, however, several with respect to θ is:
challenges. Since DL requires a large set of labeled data, and
in RL there are sparse and delayed rewards. Additionally ∇θ L(θ ) = Es,a,r ,s [(r + γ maxa Q θ − (s , a )
(8)
DL assumes that the data are independent and identically − Q θ (s, a))∇θ Q θ (s, a)],
distributed (i.i.d.)—i.e., assumes a fixed data distribution,
and in RL the data distribution changes over time during the where Es,a,r ,s is estimated with samples of the experience
learning process. DL learning techniques have shown that replay (see Algorithm 3).
they are more effective when using mini-batches; however, Imagine that in the example of Fig. 1, now the inputs to
RL data are sequential and there are certain dependencies the learning agent are images taken from the ceiling, and in
between successive examples. this case, the actions are still discrete (although we will later
Deep reinforcement learning has been used to approxi- see algorithms without this restriction). The algorithm would
mate value functions, policy functions, or both (actor-critic learn a Q-value function that takes as input images, which
method). There are other approaches, like model-based RL are processed by a CNN and for each action returns a value.
or offline RL, which are not covered in this paper. An general After the DQN publication, several extensions have been
overview of DRL methods covered in this paper is shown in proposed. Here, we briefly review the main ideas of some of
Fig. 2. them:
The breakthrough came with the DQN (deep Q network)
algorithm [115], which managed a successful combination – DDQN (Double DQN) [176]: It was shown that Q-
between Q-learning and deep convolution networks. It was learning can sometimes over-estimate its Q values [169].
123
Algorithm 3 The DQN algorithm (taken from [115]). 3.3.1 Policy function
Initialize replay memory D with capacity N
Initialize action-value function Q with random weights θ Deep learning can be applied, not only to learn value func-
Initialize target action-value function Q̂ with weights θ − = θ
for episode=1,M do
tions but to directly learn the policy function or both of
Initialize sequence s1 = {x1 } and preprocessed sequence φ1 = them. Learning a policy function may be easier than learn-
φ(s1 ) ing the value function and has better convergence properties,
for t=1,T do although normally converges to a local rather than a global
With probability select a random action at
Otherwise select at = argmaxa Q(φ(st ), a; θ)
optimum; stochastic policies can also be learned in this way.
Execute action at in emulator and observe reward rt and image We would like to learn a policy that produces the optimal
xt+1 value function under that policy. In episodic environments,
Set st+1 = st , at , xt+1 and preprocess φt+1 = φ(st+1 ) we can use the start value of the policy: J1 (θ ) = V πθ (s1 ), i.e.,
Store transition (φt , at , rt , φt+1 ) in D
Sample random mini-batch of transitions (φ j , a j , r j , φ j+1 )
run the policy until reaching a terminal state. To optimize,
from D ⎧ we can take the derivative with respect to θ . For convenience,
⎪
⎪
⎪
rj if episode we will denote V πθ (s) as V (θ ) to indicate the value func-
⎪
⎨ terminates
at step j + 1
tion obtained using the parameterized policy function πθ .
Set y j =
⎪
⎪
⎪ Let τ = (s0 , a0 , r0 , . . . , sT −1 , aT −1 , r T −1 , sT ) denote a tra-
⎪
⎩
r j + γ maxa Q̂(φ j+1 , a ; θ − ) otherwise jectory, where sT is the terminal state. Therefore,
Perform a gradient descent step on (y j − Q(φ j , a j ; θ))2 with
respect to the network parameters θ
T
Every C steps reset Q̂ = Q R(τ ) = R(st , at ), (11)
end for t=0
end for
and
To avoid this, DDQN decomposes the updating step of

T
V (θ ) = Eπθ R(st , at ); πθ (12)
Q-learning in two steps. Q-learning updating equation is:
t=0

= Pθ (τ )R(τ ), (13)
θ ← θ + α(Y Q − Q θ (s, a))∇θ Q θ (s, a), (9) τ
where Pθ (τ ) denotes the probability over trajectories when

but now, Y Q (r + γ Q θ (s , argmaxa Q θ (s , a))) is executing policy πθ and R(τ ) is the return we obtain on that
changed to Y DoubleQ , that is, trajectory. Thus, our goal is to find the policy parameters θ
such that:
Y DoubleQ ← r + γ Q θ − (s , argmaxa Q θ (s , a)), (10)
arg max V (θ ) = arg max Pθ (τ )R(τ ). (14)
θ θ τ
where the action with the largest value is selected from The policy parameters only appear in the distributions of
the fixed network. trajectories. Therefore, the gradient with respect to θ is
– Prioritized experience replay [143]: Instead of randomly

sampling experiences from D, the sampling distribution ∇θ V (θ ) = ∇θ Pθ (τ )R(τ ), (15)
favors samples that produced larger TD errors. τ
– Dueling network [181]: Runs two networks, one for
the value function and one for the advantage function that can be rewritten as:
(defined below), and at the end, they are joined with a
particular aggregator. ∇θ V (θ ) = R(τ )∇θ Pθ (τ ) (16)
– Multi-step learning [58]: Updates the TD error using τ
Pθ (τ )
information in several steps (n-step RL). = R(τ ) ∇θ Pθ (τ ) (17)
– Distributional RL [9]: Uses a distribution of the return τ
Pθ (τ )
value, rather than the expected value.
= R(τ )Pθ (τ )∇θ log Pθ (τ ), (18)
– Noisy DQN [44]: Introduces noise that is reduced over τ
time, improving the exploration process.
– Rainbow [59]: Makes a combination of several of these since ∇ log x = ∇x x . We are summing over the probabilities
approaches. of all trajectories, which we can approximate by sampling
123
some m trajectories and averaging uniformly: where γ t is a discount factor multiplied by the number of
times it reaches the state, G t is the return (total accumulated
1
m
reward) obtained from that state, and at is the action selected
∇θ V (θ ) ≈ ĝ = R(τ (i) )∇θ log Pθ (τ (i) ). (19) by the policy.
m
i=1
The gradient policy theorem can be generalized by includ-
Therefore, we need to evaluate for trajectory i: ing a comparison between the value function and a baseline,
which helps to reduce the variance:
⎛ ⎞
−1
T
∇θ log Pθ (τ (i) ) = ∇θ log ⎝μ(s0 ) p(s j+1 |s j , a j )πθ (a j |s j )⎠ , θt+1 = θt + α (G t − b(st )) ∇θ log πθ (at |st ). (26)
j=0
(20) A natural candidate for b(s) is the estimated value function

V̂ (st ; φ), where φ are the parameters of the value function.
where μ(s0 ) is the probability of the initial state s0 and We can incorporate this into the REINFORCE algorithm,
πθ (a|s) is the policy that decides which action to take at
each state. We can expand the previous expression as: where β is the learning rate for the value function (see Algo-
−1
rithm 4).

T
∇θ log Pθ (τ (i) ) = ∇θ log μ(s0 ) + ∇θ log p(s j+1 |s j , a j )
j=0
(21) Algorithm 4 The REINFORCE algorithm (taken from

T −1
+ ∇θ log πθ (a j |s j ). [162]).
j=0 Initialize the weights of the policy (θ) and of
repeat
Generate an episode s0 , a0 , r1 , s1 , . . . , sT −1 , aT −1 , r T following
By taking the derivative, and since only the last term depends on θ:
π
for each step in the episode t = 0, 1, . . . , T − 1 do
−1
1
m T
G t ← return since time t
∇θ V (θ)) ≈ ĝ = R(τ (i) ) ∇θ log πθ (a j |s j ), (22) φ ← φ + β(G t − V̂ (st ; φ))∇φ V̂ (st ; φ)
m
i=1 j=0
θ ← θ + αγ t (G t − V̂ (st ; φ))∇θ log πθ (at |st )
end for
which means that we do not need to know the transition probability, until convergence
although we still need to evaluate
m the gradient of the log of the policy.
In the last expression m1 i=1 R(τ (i) ) can be seen as a Monte
Carlo estimate. This is noisy and has a large variance. However, we
can improve it in several ways. We have seen that: Each update has to generate new complete trajectories
and can be inefficient. An alternative is to change it into a
∇θ V (θ) = ∇θ Eτ [R] temporal difference method where updates are performed for
T −1 T −1 (23)
= Eτ [ t=0 R(st , at ) t=0 ∇θ log πθ (at |st )]. every single step, and not until the end of an episode. In this
case, G t of a single step can be replaced by: (i) Q̂(s, a; φ) or
We can rearrange the summations: (ii) rt+1 + γ V̂ (st+1 ; φ). In the former case, it results in what
it is known as the advantage function:

T −1
T −1
∇θ V πθ = Eτ [ ∇θ log πθ (at |st ) R(si , ai )], (24) Â(s, a) = Q̂(s, a) − V̂ (s), (27)
t=0 i=t
which tells us the amount of improvement over the average

where the last summation corresponds to the return G t , which
in that state (the extra reward obtained with that action):
is the sum of the rewards from a particular state until a ter-
minal state.
θt+1 = θt + α( Q̂(st , at ) − V̂ (st ))∇θ log πθ (at |st ) (28)
REINFORCE2 uses the policy gradient theorem [184] to
approximate a policy function. In this case, the algorithm fol- = θt + α Â(st , at )∇θ log πθ (at |st ). (29)
lows a Monte Carlo approach (i.e., it updates the parameters
after completing a whole episode). From the previous expres- In the latter case, it results in the TD error, which can be
sion, and considering a discounted reward, we can update the seen as an unbiased estimate of the advantage function:
parameters of the policy function as follows:
θt+1 = θt + α(rt + γ V̂ (st+1 ; φ) − V̂ (st ; φ))∇θ log πθ (at |st ).
θt+1 = θt + αγ G t ∇θ log πθ (at |st ),
t
(25) (30)
2REward Increment = Non negative Factor × Offset Reinforcement In the case of the robot of Fig. 1, the robot would need to
× Characteristic Eligibility. generate a complete episode (reach the goal) and store the
123
accumulated reward from each state in the episode, to the essence, importance sampling can be formulated as fol-
goal. In this case, we can use either discrete or continuous lows:
actions.
After REINFORCE, several approaches have been sug- Ex∼ p(x) [ f (x)] = p(x) f (x)d x (36)
gested:
q(x)
= p(x) f (x)d x (37)
q(x)
– Maximum entropy [149]: The random selection of
actions in RL defines its exploration mechanism. This p(x)
= q(x) f (x)d x (38)
randomness can be expressed as a probability distribution q(x)

p(a|s) and measured with the entropy of this distribution. p(x)
= Ex∼q(x) f (x) . (39)
It is expected for this entropy to decrease as the pol- q(x)
icy function converges. The maximum entropy approach
adds an entropy term called the entropy bonus: We can approximate the objective function with impor-
tance sampling:
∇θ log πθ (at |st )(G t − V̂ (st ; w) + η∇θ H(πθ (at |st ))),
πθ (a|s)
(31) J (θ ) = Eπθ Aπθ (s, a) . (40)
πθ (a|s)
which prevents the agent to converge too fast to an opti- One problem with importance sampling is that small
mum and promotes the agent to take less predictive differences between the distributions can become large
actions, where η is a weighting factor. The entropy of values in the gradient, so this is a good approximation
the policy is defined as: when πθ is similar to πθ . This can be induced using a
bound that depends on the Kullback–Leibler (KL) dis-

H(π(a|s)) = − π(a|s) log π(a|s) (32) tance between the two policies:
a
= Ea∼π(·|s) [− log π(a|s)]. πθ (a|s)
(33) J (θ ) = Eπθ Aπθ (s, a) , (41)
πθ (a|s)
Some researchers suggest not only to give the entropy s.t. Eπθ [K L(πθ ||πθ )] ≤ δ. (42)
bonus locally but in the long term, trying to optimize:
As an alternative, we can constrain the values to be within
certain bounds (PPO or Proximal Policy Optimization
∞
[145]):
π ∗ = argmaxπ Eπ γ t (rt + αHt (π(a|s))) . (34)
J (θ ) = E min(rt (θ ) Ât , cli p(rt (θ ), 1 − , 1+) Ât ) . (43)
t=0
π (a |s )
– Trust-region methods [144]: Remember that we want where rt (θ ) = πθθ (att|stt) and clip constrains the value of
to optimize
the parameters
∞ θ of the objective function rt (θ ) to be within the 1 − , 1 + limits.
J (θ ) = E t=0 r (st , at ) . Using what has been devel- – Deterministic policy [150]: Another possibility is to learn
oped in this section, the policy gradient update can be a deterministic policy (μ(s)). The deterministic pol-
described as: icy gradient is expressed as an expected gradient of
the action-value function, which means that it can be
∞
estimated more efficiently than the stochastic policy gra-
∇θ J (πθ ) = E γ t ∇θ log πθ (at |st )Aπθ (st , at ) .
dient. In the stochastic policy gradient, the policy gradient
t=0
integrates over the state and action spaces, whether in the
(35)
deterministic policy, it is done only over the state space;
however, it is still necessary to explore over the entire
Some limitations of policy gradient are: (i) using old data space.
to estimate the policy gradients of the new policy func- As previously seen, the policy gradient theorem, on
tion and (ii) the distance in the parameter space is not which the algorithm REINFORCE is based, says that:
the same as the distance in policy space. In terms of the
data, we can run several sample trajectories or we can
T −1
T −1
use importance sampling. With importance sampling, we ∇θ J (πθ ) = E ∇θ log πθ (at |st ) R(st , at ) ,
can estimate the policy function using a gradient from t=0 i=t
trajectories sampled from a different behavior policy. In (44)
123
where the last sum can be replaced by the action-value DQN works with discrete actions, but in the robotics
function Q, which is expressed in a simplified form: domain, it is more common to have continuous actions. Deep
deterministic policy gradient (DDPG) is a model-free RL
∇θ J (πθ ) = E ∇θ log πθ (a|s)Q π (s, a) . (45) algorithm for continuous actions that combines DPG (deter-
ministic policy gradient) with DQN [100].
In the actor-critic scheme, we have two components: the As previously seen, DQN uses two main strategies to sta-
actor that adjusts the policy πθ (s) and uses stochastic bilize the learning process, namely an experience replay and a
gradient ascent using the previous equation, and the critic frozen target network. DDPG uses the same strategies within
that uses an action-value function Q φ (s, a) ≈ Q π (s, a) an actor-critic scheme, applying, in this case, four networks:
with parameters φ that we need to estimate. two networks for the critic as before (one with the weights
When we want to improve the policy, in general, a max- frozen or target network and the main Q-network which is
imization greedy strategy is used on the action-value updated every step) and two for the actor using a similar
function: μt+1 (s) = argmaxa Q μ (s, a). In continuous
t
process. To increase the exploration of the policy, it adds
action spaces, however, this is problematic because we Gaussian noise to the actions: μ (s) = μθ (s) + N(0, 1).
need to do a global optimization at each step. Instead, we Rather than replacing the networks every M steps, DDPG
can move the policy in the direction of the gradient of Q. uses soft updates on the parameters of both the actor and the
Thus, the parameters θ of the policy are updated in pro- critic networks, with τ << 1 : θ ← τ θ + (1 − τ )θ and
portion to the gradient ∇θ Q μ (s, μθ (s)). Each state may
t
φ ← τ φ + (1 − τ )φ .
suggest a different direction, so we can average them by Something in the paper, which is particularly relevant for
taking the expected value with respect to the state distri- robotics, where state variables can take different units and
bution ρ μ (s): magnitudes, is that it uses batch normalization [73], that nor-
malizes each dimension across the samples in a mini-batch
θ t+1 = θ t + αEs∼ρ μt ∇θ Q μ (s, μθ (s)) .
t
(46) to have unit mean and variance. The DDPG algorithm (taken
from [100]) is shown in Algorithm 5.
Applying the chain rule:
Algorithm 5 The DDPG algorithm.
θ t+1 = θ t + αEs∼ρ μt ∇θ μθ (s)∇a Q μ (s, a)|a=μθ (s) .
t
Randomly initialize critic Q θ (s, a) and actor μφ (s) networks with
(47) weights θ and φ
Initialize target network Q and μ with weights θ ← θ and φ ← φ
The deterministic policy gradient theorem, which follows Initialize replay buffer B
a similar demonstration as the policy gradient theorem, for episode=1 to M do
says that: Initialize a random process N for action exploration
Receive initial observation state st
for t=1 to T do
∇θ J (μθ ) = Es∼ρ μ ∇θ μθ (s)∇a Q μ (s, a)|a=μθ (s) , Select action at = μφ (st ) + Nt according to the current policy
(48) and exploration noise
Execute action at and observe reward rt and new state st+1
γ γ Store transition (st , at , rt , st+1 ) in B
where J (μθ ) = E[r1 |μ] and r1 is the total discounted Sample a random mini-batch of N transitions (si , ai , ri , si+1 )
γ
reward r1 = ∞ t=1 γ
t−1 r (s , a ).
t t from B
We can use the previous theorem for different actor-critic Set yi = ri + γ Q θ (si+1 , μφ (si+1 ))
algorithms. For instance, in an on-policy algorithm, a
Update critic by minimizing the loss: L = N1 i (yi −
deterministic actor-critic algorithm, using SARSA as the Q θ (si , ai ))2
critic, involves the following steps: Update the actorpolicy using the sampled policy gradient:
μ
δt = rt + γ Q ρ (st+1 , at+1 ) − Q ρ (st , at ) (TD-err.), (49) ∇φ J (μφ ) ≈ N1 i ∇φ μφ (s)∇a Q θ (s, a)|a=μφ (s)
ρt+1 = ρt + αρ δt ∇ρ Q ρ (st , at ), (50) Update the target networks:
θ ← τ θ + (1 − τ )θ
θ t+1 = θ t + αθ ∇θ μθ (st )∇a Q μ (st , at )|a=μθ (s) . (51) φ ← τ φ + (1 − τ )φ
end for
Similarly, we can have an off-policy deterministic actor- end for
critic algorithm using Q-learning as the critic:
δt = rt + γ Q ρ (st+1 , μθ (st+1 )) − Q ρ (st , at ), (52) After DDPG, several extensions have been suggested, like
distributed distributional DDPG (D4PG) [8] (to make it run
ρt+1 = ρt + αρ δt ∇ρ Q ρ (st , at ), (53)
in a distribution fashion, using N-step returns and prior-
μ
θ t+1
= θ + αθ ∇θ μθ (st )∇a Q (st , at )|a=μθ (s) .
t
(54) itized experience replay), multi-agent DDPG (MADDPG)
123
[106] (where multiple agents are coordinated to complete are:

tasks with only local information), and twin delayed deep n−1
deterministic (TD3) [46] (uses clipped double Q-learning,
n −1 ∗
Ŷt = h rt+k γ h (Q θ − (st+n , a )) , (57)
a delayed update of the target and policy networks, and
k=0
smoothing over the target policy to avoid overestimation of
the value function). where θ − are the parameters of the objective network,
3.3.2 Distributed approaches a ∗ = argmaxa Q θ (st+n , a) (58)
DRL algorithms take a long time to converge, so several and

distributed schemes have been proposed to accelerate the
learning process. Some examples include: h(x) = sign(x)( |x| + 1 − 1) + x. (59)
Contrary to Ape-X, it uses a mixture of max and mean

– Learning from multiple agents in parallel, assuming that
n−steps TD-errors δi over the sequence: p = ηmaxi δi +
each agent will have different experiences. Running dif-
(1 − η)δ with η = 0.9.
ferent policies in parallel, each agent has its own copy
To deal with partial information, it requires a representa-
of the environment, evaluates its gradient, and shares the
tion of states that codify information about the trajectory
network that evaluates the loss function (A3C or asyn-
state-action, in addition to the current state. For that, they
chronous advantage actor-critic [114]).
use recurrent networks (LSTM in this case) and so the
– Generate data in parallel and a single agent to learn
trajectories are stored in the experience replay.
[67]. They use a centralized memory for experience
replay, combining data from actors that have different
3.3.3 Sub-goal, auto-play and transfer
exploration policies, which increases the diversity of
the samples. In particular, Ape-X also uses double Q-
Several other approaches involve dealing with sparse rewards,
learning with eligibility traces and a dueling architecture
several goals, auto-play, and transfer learning. Here, we men-
and prioritized experience replay. The loss function is:
tion some of them:
1
lt (θ ) = (G t − Q θ (st , at ))2 , (55) – Universal value function approximator (UVFA) is an
2 extension to DQN used when there can be more than one
goal [142]. If G is the space of possible goals, each g ∈ G
with
has a reward function r g : S × A → R. Each episode
starts sampling a pair state-goal from some distribution.
G t = rt+1 + γ rt+2 + . . . + γ n−1rt+n
(56) The goal remains fixed during the whole episode. At each
+ γ n Q θ − (st+n , argmaxa Q θ (st+n , a)), step, you get information of the current state and current
goal, π : S ×G → A and obtain a reward rt = r g (st , at ).
where θ − are the parameters of the objective function. The function Q now depends on the pair state-action and
If the episode ends before n steps, it is truncated. The the goal: Q π (st , at , g) = E[Rt |st , at , g].
actors use −greedy with different values for . They – Hindsight Experience Replay trains similarly to UVFA.
found that increasing the number of actors improves the However, the goals are automatically defined during
performance (more experience). learning [4]. After a sequence in an episode (s0 , . . . , sT ),
– Recently, LSTMs (long short-term memory) have been the transitions are stored, not only with the original goal
used to deal with partial information. Recurrent replay but with a set of other goals. A replay is done with the goal
distributed DQN (R2D2) trains recurrent networks with m(sT ) (i.e., the achieved goal at the end of the episode).
distributed experience replay [80]. It is similar to Ape- Several strategies were tried to include sub-goals; the best
X (prioritized distributed replays) with n−step double results were obtained by Future, which replay with k ran-
Q-Learning, generating the experience of 256 actors dom states that are within the same episode and that were
and including an LSTM layer after the convolution lay- observed after the state.
ers. Instead of storing transition tuples (s, a, r , s ), they – Self-play has been used since the beginning of RL to learn
store sequences of fixed size of (s, a, r ) with adjacent how to play checkers, backgammon, and more recently,
sequences overlapped. Go. The idea is to place two agents to compete with
When training, they used the network online and the goal each other, learning from their experiences. An alterna-
over the same sequences. The n-steps for the Q functions tive approach is to use a self-play scheme for learning a
123
complex task [157]; it is in a way, a curriculum learning Elaborating on the aforementioned example, in [53] two
[10] approach where the tasks to learn are dynamically networks are used, a generator G that wants to mimic
defined. This is achieved in a nonsymmetric self-play elements from the distribution pdata over data x, and a dis-
with two agents, Alice and Bob. Alice starts at an ini- criminator D that wants to be able to discriminate between
tial state (s0 ) and after a sequence of actions arrives to a samples from pdata and samples from the generator’s distri-
state st . In “reversible” environment, the goal of Bob is bution pg . One defines a prior distribution on the input noise
to start at st and return to s0 ; in “resetable” environments, pz (z), followed by a map G(z; θg ) to data space, which is a
the goal for Bob is to start at s0 and reach st . differentiable function in the form of a multilayer perceptron
The interest is the reward definition that promotes Alice with parameters θg . D(x; θd ) is also modeled as a multi-layer
to place more challenging tasks to Bob, but not too diffi- perceptron with a single scalar as its output. One trains D to
cult, yielding maximize the probability of assigning the correct label to
samples from pdata and from G. G is simultaneously trained
R B = −γ t B , (60) to minimize log(1 − D(G(z))), which can be interpreted as
minimizing the capability of the discriminator of differen-
where R B is the total reward of the episode for Bob and tiating between data from the pdata or from the generator.
t B is the time it took Bob to complete it. Additionally, Then, the training process is modeled as G and D playing a
minimax game on the value function V (G, D), that is:
R A = γ max(0, t B − t A ), (61)
min max V (G, D) = E x∼ pdata (x) [log D(x)]
G D (62)
where R A is the total reward of the episode for Alice and
t A is the time it took Alice to complete it. The total time + E z∼ pz (z) [log(1 − D(G(z)))].
of each episode is limited to tmax , if Bob does not finish
its task, t B = tmax − t A . There are other examples of deep learning advances that
Alice wants Bob to take a long time, but not more than robotics can benefit from. For instance, one of the main
tmax , so Alice limits its steps so that it is easier for Bob. challenges in robotic manipulation is to estimate the pose
The best for Alice is to find the easiest task (small t A ) of the object to be grasped. In [174], the authors claim the
that Bob is unable to solve (large t B ). first deep network trained exclusively on synthetic data with
state-of-the-art performance on 6-DoF object pose estima-
The area of deep reinforcement learning has significantly tion, generalizing well even in extreme lighting conditions.
grown in recent years, and it is not possible to cover all the A challenge when learning with synthetic data is the real-
recent developments in one article. The objective is to provide ity gap, which refers to the phenomena of networks trained
some of the main algorithms and ideas so that the interested on synthetic data not performing well on real data without
reader could follow up and understand some of the recent additional fine-tuning. To overcome this issue, the authors
developments. from [173] propose to train with photorealistic data along
with domain randomization [171], on which the training data
are randomly modified in nonrealistic ways such that the
4 Recent and relevant works in the field real data appear simply as another variety. The authors place
Yale–CMU–Berkeley (YCB) objects [17] in different virtual
Deep learning algorithms [95] have been particularly suc- environments to create the randomized photorealistic data.
cessful in applications to computer vision. This fact can be All data were generated by means of a plugin called NDDS
verified by the number of citations that some works about that [170], which was developed utilizing Unreal Engine 4 (UE4).
subject have achieved in recent times. For instance, the work After training, the resulting poses present sufficient accuracy
in [90] and the work in [151] each have tens of thousands of for robotic manipulation. Results are compared to PoseCNN
citations according to Google Scholar. [187] surpassing its performance.
Notwithstanding, as it is mentioned in [161], the prob- Summarizing, there are many works [6,55,72,76,97,102,
lems in computer vision and robotics are inherently differ- 131,186] whose main application corresponds to computer
ent, because computer vision takes images and translates vision problems; nonetheless, they provide important tools
them into information, while robotics translates sensing into to resolve robotics tasks. In what follows, we will focus on
actions. Nevertheless, useful ideas such as adversarial nets presenting work on deep learning and deep reinforcement
[53] have emerged from the deep learning community, which learning applied exclusively to robotics. Section 4.1 presents
can be applied to robotics, for instance, in noncooperative the deep learning content and Sect. 4.2 the deep reinforce-
games with robots. ment learning approaches. Each section groups the works
123
into topics to provide a taxonomy for the reader’s future ref- extra components. If a routed version of the map is available,
erence. the model also outputs a deterministic command. The net-
Section 4.1 has the following subsections: Direct Control work is written separately as two functions: one representing
(4.1.1) in which deep learning is used to directly control a the stochastic/unrouted part and the second representing the
mobile robot, Hybrid Models (4.1.2) which combines deep deterministic/routed part.
neural networks with planning techniques, Deep Learning The weights of the model are learned using back-
for Learning Representations (4.1.3 ), that technique uses propagation with a defined cost function, which contains a
deep learning to learn abstract representations, which can be term related to likelihood and another to regularization.
used for different robotics tasks, and Deep Learning with Regarding the localization part, the conditional structure
Sampling-based Motion Planners (4.1.4), which combines of the network updates a posterior belief on the pose of the
deep learning and classical sampling based motion planners. car P(θ p |I , M) based on the relation between the road topol-
Section 4.2 has five subsections: Navigation and Track- ogy observed from the vehicle and the map. Parameter θ p
ing (4.2.1) in which deep learning is applied to the tasks is the pose in the map (position and orientation), I is the
of robot navigation and tracking of moving objects, mas- visual input and M is the map. The network only computes
sive amounts of data and additional supervision (4.2.2) P(θs |θ p , I , M), but due to the conditioning structure of this
which uses massive amounts of data for learning a policy, model, it is possible to estimate the pose from the visual input
simulation-based learning (4.2.3), in which simulators are by double marginalization over θ p and θs . The posterior belief
used to learn the policies, dynamic environments (4.2.4) that updating algorithm uses all steering angle samples in order to
subsection presents systems able to be adapted to new con- compute the probability that a given pose and images/maps
ditions, and risk-averse Systems (4.2.5), the techniques in explain the steering angle, considering the added loop needed
that subsection have the goal of avoiding catastrophic events to estimate the partition function and normalize the distribu-
during training and deployment. tion.
As main results, the authors of [3] first demonstrate how to
4.1 Deep learning algorithms drive using a picture from the road map and the estimation of
a steering angle on both kinds of maps: routed and unrouted.
Since the introduction of deep learning, several authors have Second, they show how the approach permits to reduce pose
applied its techniques to robotics. Here we review just some uncertainty from the agreement between the map and the
of them: techniques applied to direct control for naviga- camera inputs.
tion, hybrid models combining dynamic systems with neural Other authors that learned direct control strategies using
networks, deep learning with sampling-based motion plan- deep learning algorithms include [92,96,105,126,154,188].
ning combination, and research work that includes learning
abstract representations. 4.1.2 Hybrid models
4.1.1 Direct control Another possible approach combines deep neural networks
with planning techniques.
One possible application of deep learning is to use it directly In the paper [47], the authors address a navigation problem
to control a mobile robot. with minimal information. The proposed work is hierarchi-
The work in [11] is one of the first approaches, in which cal at two levels, integrating model-based motion planning
a convolutional neural network (CNN) is used to map raw and model-free DL. At the low level, the intention-net (a
pixels from a single camera directly to the steering wheel neural network motion controller) is trained end-to-end in
commands. The authors of [11] were able to demonstrate that order to allow robust local navigation. At the high level,
CNNs are capable of learning the entire task of lane and road to go from a current location to the goal, it is used a 2-
following. Less than a hundred hours of driving data sufficed D floor path planner. The authors say that “there are two
to train the car to drive in a large variety of conditions. key elements to humans’ performance: the capability of path
The work in [3] proposes an approach for robot naviga- planning using a 3-D model of the world in an abstract and
tion and localization based on learning with variational neural simplified way, and principally, the ability to use local visual
networks. Regarding the navigation task, a variational neural information to perform the planned path”. The path planner
network is used whose inputs are raw camera images and an determines a collision-free global path according to a crude
image of an unrouted roadmap. The output to be learned is a input map, and, in order to follow this path, the motion con-
probability distribution over the inverse curvature to navigate troller makes use of images taken from a sole monocular
at each time instance. A Gaussian mixture model with K = 3 camera onboard the robot. The motion controller deals with
modes is used. It describes the steering wheel command, and all local environment dynamics (e.g., obstacle obstructions
the authors apply regularization with an L1/2 norm to prevent that are not present in the input map, for example, peo-
123
ple and furniture) and social protocols for navigation. The k

The reward function R(êt:t+H ) in the planning approach
motion controller is a DNN (deep neural network) trained in encodes the actions the robot is supposed to do, according to
an end-to-end fashion via imitation learning. This assumes the subsequent events predicted by the model. Employing this
that there is a direct mapping from perception (inputs) to con- function (in conjunction with the learned predictive model),
trols. However, according to high-level goals, the robot is not the authors solve at each time step the following planning
steerable. For this reason, the two-level hierarchical approach problem:
is a convenient method. The author proposes a local path
and environment (LPE). It is expressed as a 224x224 image a ∗ = arg max(R( f θ )). (64)
which, in a local window from the robot position, contains
the path and the environment. The intention-net F, which A stochastic optimizer proposed in [119] is used to solve
is a deep multivariate regression network, takes as inputs a the above equation. The approach executes the first action
camera image and an LPE and gives as outputs a motion and continues planning and executing actions by means of a
control (v, θ ), being v the speed of the robot and θ its steer- model-predictive control technique.
ing angle. The path planner re-plans a path at every time The authors demonstrated that the proposed approach is
step. It uses an occupancy grid and performs adaptive Monte able to learn to navigate in scenarios where challenging obsta-
Carlo localization (AMCL) [45] to establish the robot’s cur- cles were present, e.g., tall grass, and is able to include
rent position inside the map. The planner uses the hybrid A* terrain preferences, e.g., avoid bumpy ground. The results
algorithm [35], which is able to accommodate both kinds of were obtained after training on 42 hours of experimental data.
constraints: kinematic and dynamic. The approach is com- In [56], the authors propose a hybrid simulation approach
pared with several alternative methods, and it is tested in both that mixes mathematical models of robotic systems with neu-
simulations and in real-world experimentation. ral networks, where the latter learn aspects of the dynamics
In [78], the authors explore the way to go further from that are not considered in the analytical models. The authors
only geometric-based approaches applying a procedure that propose the use of residual models, in which data-driven
learns navigational capabilities from experience. The naviga- models are utilized only in some parts of the simulation pro-
tion system proposed learns in an end-to-end way collecting cess. Using gradient-based optimization, the work seeks to
self-supervised data in real-world environments. The work in efficiently identify proper simulation parameters and weights
[78] makes two main suppositions: (1) only the events in the in the networks. Given a trajectory {s∗t }t=1
T in the state space,
robot experience can be used to learn utilizing the onboard to perform the system identification, the authors optimize the
sensors and (2) undesirable events are acceptable, colliding, next loss function:
for example, considering these events.
There is a predictive model whose inputs are the current
min L = || f θ (st−1 ) − st ∗ ||2 + R||θ N N ||2 , (65)
observations taken from the sensor and a series of future θ=[θ AM ,θ N N ]
expected actions and predicts the next events in the naviga-
tion. In symbols, f (o, a[t, t + H ]) → et,t+H
k , where o are parameters θ N N correspond to the neural network weights in
observations, a are actions, t is the time, H is a time win- the simulation. The authors regularize the network weights
dow, e are events, and k are the type of events. The main event by a factor R, to minimize residual dynamics. The authors
types are position, collision, and quality of the terrain, i.e., get a solution to the nonlinear least-squares problem in (65)
how bumpy it is. This mapping is learned using a neural net- using the Levenberg–Marquardt algorithm (LMA). To escape
work. The training phase—on which observations, actions, local minima, the authors employ a random search scheme,
and event labels are concerned—aims at minimizing a loss namely parallel basin hopping (PBH) [111]. The implemen-
function penalizing the distance between the ground truth tation runs in parallel, using several simulation and LMA
and the predictions. The utilized loss function is the follow- solver instances. At the same time, the initial values of the
ing one: parameters are continuously randomizing. The local solvers
restart after convergence criteria, time limit, or if maximum
−1 iteration count is met. The authors apply their method to a

K
L(θ, D) = L k êt:t+H
k
− et:t+H
k
, physical system. Utilizing the trajectories of joint positions
(ot ,at,t+H )∈D k (63) found in the double-pendulum dataset in [5], they optimize
the inertia, mass, and link lengths of the simulated model and
0:K
êt:t+H = f θ (ot , at:t+H ).
achieve a minimal difference between the simulation results
and the results obtained in the real experimentation.
If the event is discrete, the single losses L k are cross-entropy; In [64], the authors propose a method that learns deep
otherwise, in the continuous case, they are considered as the visual model predictive control policies capable of perform-
mean squared error. ing vision-based navigation in conjunction with avoiding
123
collisions with objects not seen during navigation. The sys- modeled as:
tem is trained to minimize an objective within the model
1
predictive control framework. It back-propagates utilizing N
a visual dynamic model, and a network that estimates J trav = (1 − p̂t+i

trav
), (68)
N
traversability. Both the model and the network are differ- i=0
entiable; they are inspired by the works in [61,63].

trav is a traversable probability estimated from RGB
where p̂t+i
The trajectory is modeled as a sequence of images (provid-
ing a series of sub-goals) between starting and goal locations. images (see [61]).
The images are sampled considering a fixed time interval. The reference loss generates more smooth and realistic
360o images are represented as two 180o fisheye images. The velocities. It minimizes the difference between the gener-
trajectory is represented in the form of a chain of K sub-goal ated velocities (vi , wi )i=0...N −1 and the reference velocities
ref ref
f f
image pairs (I0 , I0b ), . . . , (I K , I Kb ), where the super-index f (vt+i , wt+i )i=0...N −1 . Thus, it is defined as:
denotes front and b denotes back. The control policy aims at
minimizing the difference between the current 360o camera N −1
1 ref
N −1
1 ref
f
image at time t, (It , Itb ), and the next sub-goal image in the J ref = (vt+i − vi )2 + (wt+i − wi )2 . (69)
f N N
trajectory, (I j , I bj ), while avoiding collisions with obstacles. i=0 i=0
The absolute pixel difference between current and sub-goal

f f
images, that is, |I j − It | + |I bj − Itb | < dth , is the condition Training is done offline to solve the optimization prob-
used to switch to the next sub-goal. Variable j refers to the lem. Since the computations are performed in the computer
current image index, and dth denotes a threshold computed onboard the mobile robot, the eight-convolutional layers net-
experimentally. work was designed to allow fast online computation. The last
The method is trained in a way such that it minimizes a layer employs a tanh(·) activation to keep the linear and angu-
cost function J that incorporates two objectives: (1) to move lar velocities within limits ±vmax and ±wmax , respectively.
across traversable safe areas and (2) to follow the trajectory. Other authors that used hybrid strategies include [14,81,
These two objectives are achieved considering a weighted 82,112,185].
sum of the three following losses: an image loss, a traversabil-
ity loss, and a reference loss. 4.1.3 Deep learning for learning representations
J = J img + K 1 J trav + K 2 J r e f . (66) Deep learning can be used to learn abstract representations,
which in turn can be used for different robotics tasks.
The image loss, J img , is modeled as the mean absolute In [43], the authors propose a data-driven approach to learn
f
pixel difference among the sub-goal image (I j , I bj ) and an manifolds, denoted as M, in order to integrate them in a
arrangement of N predicted images ( Iˆt+i , Iˆt+i framework of sequential manifold planning. A manifold is
f b )
i=1...N , that
is, a collection of local neighborhoods that resemble Euclidean
spaces. Thus, a global representation of M can be built by
1
N obtaining characterizations for its Euclidean local neighbor-
wi (|I j − Iˆt+1 | + (|I j − Iˆt+1 |),
f f f f
J img = hoods. The work in [43] proposes to utilize a neural network
2N · N pi x
i=0 to learn/model a single global representation of the constraint
(67) manifold.
Two different methods are analyzed: variational autoen-
where N pi x denotes the amount of pixels in the image, and coders (VAE) [125], and a new method called equality
( Iˆt+1 , − Iˆt+1
f b ) represents the predicted images obtained using constraint manifold neural network (ECoMaNN). The two
a model based on the work in [63]. approaches are evaluated to determine the path quality gen-
The network structure in [63] employs an encoder– erated during the planning stage and their capability to learn
decoder arrangement, where the robot velocities are concate- representations of constraints from datasets. Sequential man-
nated to the latent vector. There is also a blending module ifold planning introduces the planning problem as an arrange-
that delivers front and back virtual images by performing an ment of (n + 1) such manifolds M = {M1 , M2 , . . . , Mn+1 }
information fusion based on the input front and back images. and a starting configuration qstar t ∈ C M1 , which lays in
The authors utilize bilinear sampling for mapping pixels in the first manifold. The objective is to obtain a path from
input images to predicted images, which is similar to the one qstar t that passes through the manifold arrangement M and
in [195]. ends at a configuration that lays in the goal manifold Mn+1 .
The traversability loss J trav has the objective of penal- The problem is formulated as finding the path within a set
izing sections that present a risk for the robot. That loss is b f τ = {τ1 , . . . , τn }, such that the path minimizes the sum
123
of subpath costs subject to the constraints established by M prediction is compared to the next encoded state zt+1 =
and of being free of collisions; for more details, see [43]. φ (xt+1 ) in the latent space. To ensure that this second
h enc
Six different datasets are evaluated, the constraint of being loss is dynamically consistent, the dynamics around zt and
on a manifold corresponds, for instance, to a 3D point which ut are linearized and the weighted controllability Gramian
must be on the surface of a sphere or to a 6 DOFs robot G t is computed as
manipulator that seeks to keep its orientation in a verti-
dyn dyn
cal direction, e.g., to transport a cup. The performances of δh ψ δh ψ
the models are compared utilizing an implicit function. The At = (zt , ut ), Bt = (zt , ut ), G t = At Bt BtT AtT ,
δz δu
results with both methods are similar. Finally, the learned
manifolds are employed in a sequential motion planning resulting in the loss function
problem. The authors conclude the paper by saying that ECo-
= (ẑt+1 − zt )T G −1
dyn
MaNN method was successful in learning equality constraint Lt t (ẑt+1 − zt ).
manifolds, which can be utilized by a sequential planning
algorithm. However, before the dynamics network has been trained, the
In the context of feasible motion planning, the authors of aforementioned Gramian may be ill-conditioned; thus, the
[70] present a methodology, latent sampling-based motion authors begin the training with the l2 norm and gradually
planning (L-SBMP), aiming at computing motion plans shift to the G −1
t -based norm.
for high-dimensional robotic systems by learning a plan- The collision checking network h cc ζ (zt , zt+1 , Xobs ), which
able latent representation. The main objective is to model is a binary classifier, is trained with a dataset of states and
the underlying manifold of the task by learning a low- a label denoting if the connecting trajectory is in colli-
dimensional latent representation of it, in which sampling- sion. The collision checking network h cc ζ (zt , zt+1 , Xobs ) is
based algorithms can be utilized. Such manifold corresponds learned after the encoder and dynamics have been learned.
to the robot’s operational regime modeled as a latent space, Given all the individual modules described above, the authors
where a global implicit representation is built through only propose the learned latent rapidly exploring random tree
local connections. (L2RRT), which is basically an RRT capable of planning in
The latent space is learned by means of an autoencoder the latent space through the encoding of the planning problem
along with latent dynamics. The latter allows to perform sam- employing the network h enc φ , and whose resulting trajectory
pling in the latent space and connect local states respecting is decoded back to the state space using the network h dec θ .
the system dynamics. More precisely, the proposed approach The work in [54] utilizes a traversability representation
is comprised of three networks: first, an autoencoder of the of the robot’s workspace to enable the robot to perform off-
latent space, second, a latent local dynamics model, and third, road driving. The used robot is the LAGR platform equipped
a collision checker. Therefore, denote X as the state space, with a stereo-based obstacle detection module. The authors
Xobs as the obstacle space, Z the learned latent space and present a self-supervised learning process for long-range
zt ∈ Z as a latent state at time instant t. Then, the authors vision capable to perform terrain classification at distances
seek to learn a map m that goes from X to Z, a map n from Z up to the horizon. The learning process has four main com-
to X , latent dynamics f Z that are executed on Z, in addition ponents: (1) prepossessing and normalization of the acquired
to a collision checker gZ in the latent space. Summarizing, images such that the height of objects is independent of their
the authors propose to learn the next elements: distance to the camera. (2) A stereo vision supervisor that
employs several techniques, e.g., plane estimation, to assign
m : X → Z, n : Z → X , zt+1 = f Z (zt , ut ), class labels (super-ground, ground, foot-line, obstacle, and
gZ (zt , zt+1 , Xobs ) = {1 if xt , xt+1 ∩ Xobs = ∅, super-obstacle) to the terrain, resulting in a database for fur-
ther training a classifier to extend the label assignment to
0 otherwise}.
terrain beyond the stereo vision range. (3) Image feature
extraction using deep learning to gain a more discriminative,
The latent space has the customary treatment for autoen-
but also more concise representation with reduced dimen-
coders and is trained from a set of trajectories obtained
sionality. (4) A classifier that has fast adaptability as a result
operating the robot (sequences of states xt and actions ut ).
of being trained on every frame.
φ (xt ) and a decoding
Thus, there are an encoding network h enc
Regarding the image feature extraction, the authors tested
dyn
θ (zt ). A third network, h ψ (zt , ut ), is the one
network h dec several approaches. However, the best results were obtained
that models the system dynamics over the latent space Z utilizing a fine-tuned convolutional autoencoder. An autoen-
dyn
and predicts the next latent state ẑt+1 . To train h ψ (zt , ut ), coder [52] is comprised of an encoder, Fenc (X ), which
θ (ẑt+1 ), and compared to
the prediction is decoded as: h dec retrieves a feature set from the input X , and a decoder,
the system state xt+1 through an l2 norm. Additionally, the Fdec (Y ), which attempts to reconstruct the input X from the
123
vector of features. Both the encoder and decoder are trained in of obstacle’s point clouds x from Nobs different workspaces,
a simultaneous fashion, seeking to minimize the mean square and x̂ is the point reconstructed by the decoder.
error between the encoded and decoded reconstruction and The planning network (Pnet) learns to predict the robot
the input X , given by the next expression: configuration at time step ĉt+1 given the robot configuration
at time ct , goal configuration cgoal , and the obstacles’ point-
1
P cloud embedding Z . Pnet has the form of a feed-forward deep
L(S) = ||X i − Fdec (Fenc (X i ))||22 , (70) neural network with parameters θ p . To train Pnet, a dataset
P
i=1 of trajectories σ = {c1 , c2 , ...cT } is employed, which can be
obtained from any classical motion planner, e.g., sampling-
where S is a dataset with P training samples. The feature based planners. The training objective is a mean-square-error
vectors resulting from images X are used as the input for the loss between the predicted ĉt+1 and the actual ct+1 , that is,
classifier. To be more precise, there are five binary classifiers,
one per label category, in the form of logistic regression. N̂ T j −1
Then, for a feature vector x, the output of each regression i 1
l Pnet (θ ) =
p
||ĉ j,i+1 − c j,i+1 ||2 , (74)
is computed using the logistic sigmoid function: Np
j i=0
1
qi = f (wi x + b), where f (z) = . (71) where N p is an averaging term, T j the length of the jth
1 + e−z trajectory, and N̂ is the amount of all training paths.
The weights of each regression are adjusted every frame Finally, Enet and Pnet are coupled to perform online
based on the loss function: planning in an incremental bidirectional path generation
heuristic. The authors evaluate MPNet in various 2D and
3D environments. Results exhibit that MPNet can general-
Loss i = − pi log qi − (1 − pi ) log(1 − qi ), (72)
ize to completely unseen environments. Moreover, MPnet
presents a computation time that consistently remains lower
where pi is the probability that the sample belongs to class
than existing state-of-the-art motion planning algorithms.
i, as given by the stereo supervisor labels. To adjust the
The work in [26] proposes an approach to estimate swept
weights, stochastic gradient descent is employed, where
volume computation using deep learning. The authors intro-
the gradient update for the ith regression with sample x is
duce an efficient hierarchical approach called hierarchical
wi = −η( pi − qi )x. Finally, the classified terrain can be
neighbor search (HNS) to apply the trained estimator. Ini-
used for navigation purposes.
tially, the method pre-filters the samples using a Euclidean
Other authors that used deep models for learning repre-
estimator with weighted values trained on the swept vol-
sentations include [38,61,101,164,183].
ume. It then applies the deep neural network estimator. This
approach is used in sampling-based motion planners (PRM
4.1.4 Deep learning with sampling-based motion planners and RRT) to perform a nearest neighbor selection.
A single-layer network called Dwe sv learns the weights
Sampling-based motion planners have been very successful w∗Dwe sv that models a weighted Euclidean distance metric
in robotics. Some authors have taken advantage of the mature dw∗ (c1 , c2 ) with reference to a training dataset, that is,
techniques used in such motion planners and combined them
with deep learning. Dwe
sv
(w∗Dwe
sv ) = dw+∗ (c1 , c2 ), (75)
The authors of [128,129] present a neural network-based
planning algorithm referred to as motion planning networks
where c1 , c2 are two configurations. A stochastic gradient
(MPNets), which consists of two components: an encoder
descent optimizer is used to find w∗Dwesv minimizing a L 2
network and a planning network. The encoder network (Enet)
loss, more precisely,
learns to encode the given workspace directly from a point
cloud measurement into a latent space. Enet takes the form
of a contractive autoencoder (CAE) [137] where the recon-
n
w∗Dwe
sv = arg min (Dwe
sv (c1,i , c2,i ))2 ,
(w, c1,i , c2,i )− SV
struction loss is defined as: w
t=1
(76)
1
l AE (θ e , θ d ) = ||x − x̂||2 + λ (θiej )2 , (73)
Nobs the approximated swept volume. The authors men-
x∈Dobs ij being SV
tion that the weighted Euclidean distance metric is unable to
in which θ e and θ d are the encoder and decoder parameters, well approximate the swept volume because the model is not
respectively, λ is a penalizing coefficient, Dobs is a dataset able to capture nonlinearities. Hence, they propose to use
123
a deep neural network to learn a nonlinear swept volume ment in an order of magnitude in terms of the convergence
model. There are 2d f input neurons, the first d f correspond to the minimum cost and success rate.
to c1 , while the secondd f to c2 . The output is a neuron esti- Other authors that used sampling motion planners include
mating the swept volume between two configuration points. [22,71,91,104,130,167,191].
Stochastic gradient descent is used to find the weights and
biases with respect to a L2 loss and the dataset. Those weights 4.2 Deep reinforcement learning algorithms
and biases are given by the next equation:
Although successful reinforcement learning systems using

n neural networks to estimate, for instance, the value function,
(W ∗, b∗)Ddnn
sv = arg min (Ddnn
sv
((W , b), c1,i ) have been previously developed (e.g., TD-gammon [168]),
(W ,b) i=1 (77)
it was until the introduction of deep learning that an accel-
1,i , c2,i )) .
− SV(c 2
erated development began to emerge in the area of robotics.
One of the main reasons for this trend was the possibility of
The dataset consists of one hundred thousand training using raw images as input for learning policies in sequential
samples per robot. Several experiments are done to validate decision problems. There are, however, several challenges
the approach and determine the performance of the differ- that still need to be solved, and where recent research has
ent steps of the method in several robots. Network Dwe sv is been focused. In particular, reinforcement learning methods
around 2000 to 3000 times faster than querying Dwe . How-
sv require a vast amount of data obtained from experience in
ever, recall that it cannot approximate well the swept volume the task environment, which is not practical and sometimes
because of nonlinearities. Inference with Dwe sv is 3500 to 5000 not feasible in many robotics applications. Another concern
times faster than SV computation. among researchers is how to generalize the learned models
Deep learning has also been used to bias sampling within to other contexts and/or to efficiently adapt them to chang-
motion planners. The work in [69] proposes a methodology ing conditions. Finally, one more interest exists in how to
for nonuniform sampling, in which a sampling distribution avoid catastrophic events during the learning process. In the
is learned (using human demonstrations, earlier paths in the following subsections, we review some of the most represen-
state space, successful motion plans, or other sources that tative research works that apply deep reinforcement learning
supply information about the system operation) and used to to robotics.
bias sampling. The sampling distribution has the form of
a conditional variational autoencoder (CVAE) [155], which 4.2.1 Navigation and tracking
can generate conditional data from the latent space of the
model. In the context of motion planning, the conditioning Some DRL researchers have focused their work on navi-
variables represent external factors, such as obstacles infor- gation and tracking tasks, e.g., [49]. Considering obstacle
mation or initial state and goal region. avoidance is a challenging problem because planning is NP-
The main objective is to use the set of sampled points that hard [20] and in PSPACE [19]. In [49], the authors compare
are close to the optimal trajectory to build a distribution. Con- actions of DRL policies and the expected cumulative reward
sidering x as a sampled point, and y as the finite-dimensional with stochastic reachability (SR), considering a single obsta-
encoding of the planning problem and other exterior charac- cle problem where the only obstacle moves from left to right
teristics, p(x|y) denotes the conditional density of sample and focusing on the A3C [114] DRL algorithm. SR analy-
points, conditioned on y. Such a distribution is constructed sis evaluates the likelihood that the robot remains in a given
as a latent variable model in which the joint density of subset of the state space. This method obtains the avoidance
the sampled points and the latent variable are denoted as policy that is optimal, in addition to providing guarantees
p(x|z, y) p(z|y), with z being a latent variable. Taking that in convergence. However, it is computationally expensive
into account, a neural network, with qφ (z|x, y) as encoder and requires dynamics models of the obstacles. Nonetheless,
and pθ (x|z, y) as decoder, is used to build the CVAE. After DRL policies can learn to predict motion of obstacles and
training, the decoder is capable of approximately producing the corresponding robot avoidance actions, mapping directly
samples from p(x|y) by sampling from p(z|y) = N (0, I ), from sensor observations. However, these methods typically
which is a normal distribution of the latent variable. cannot guarantee policy convergence.
Finally, after the sampling distribution is learned in the The authors of [49] contemplate a holonomic robot with
offline stage, the obtained samples are utilized in conjunc- a radius R r , and an obstacle with radius R o in a two-
tion with the samples (e.g., uniform sampling) within the dimensional workspace. At x rn , the robot’s objective is to
sampling-based planners to ensure coverage of the state avoid the obstacle at x on , at each time step n, by changing
space. The proposed methodology is tested within different its heading angle θnr and moving at a constant speed vr .
planners in different planning problems, achieving improve- The obstacle traverses following a line considering a heading
123
angle θ o , a velocity wn , and speed wn = |wn |, which might These policies (SR and RL) are tested in two kinds of sce-
vary according to a probability density function, p(wn ). The narios, one with deterministic obstacle motions and the other
stochastic obstacle speed space is denoted by W, and robot with stochastic ones. As result, the authors conclude that the
action space is denoted by U. RL policy empirically performs better than other methods
The discrete-time dynamics in relative coordinates ( x̃ = reported in the literature, since the optimal collision proba-
x − x o ∈ X̃ ) is:
r bility can be approximated well by the state-value function. In
some cases, the RL policies can differ from the optimal pol-
x̃ n+1 = x̃ n + ( f r (un , θnr ) − f o (wn )), (78) icy with a negative impact on collision avoidance. However,
the evidence suggests that the actor net is the component that
with denoting the time step, un the action of the robot, and causes this discrepancy, which fails to correctly approximate
with the dynamics of the obstacle and the robot denoted as the action that corresponds to the state-action with highest
f o (wn ) and f r (un , θnr ), accordingly. A collision is said to value.
occur when In [108], the authors study an active tracking task in which
the method receives as input observations and generates as
output signals that control the camera, for instance, move
|| x̃ n ||2 ≤ R r + R o . (79) forward, turn left, etc. This work proposes an end-to-end
solution using DRL. In order to predict the action from a
Summarizing the SR formulation (see [25] for more frame, a ConvNet-LSTM network is employed. To perform
details), the authors first define an indicator function, 1 K ( x̃), a successful training, the authors also propose a reward func-
which takes a value of one when at x̃ the system is not in tion along with an environment augmentation scheme. Even
after losing the target, the system can restore tracking. The
collision and takes a value of zero otherwise. Thereafter, the reinforcement learning (RL) algorithm A3C [114] is used.
stochastic transition kernel, τ ( x̃ n+1 | x̃ n , un ), is defined. The They learn the parameters θ and θ in the networks over a
value function V , corresponding to the probability of colli- trajectory τ , by means of an actor-critic approach using value
sion along a finite time horizon N , is obtained as in [1] from function regression and stochastic policy gradient, utilizing
time step N the next expressions:
θ ← θ + α∇θ log π(at |st ; θ)A(st , at ) + β∇θ H (π(at |st ; θ)),
VN ( x̃) = 1 K ( x̃), (80) 1 (85)
θ ← θ + α∇θ (Rt:t+n−1 + γ n V (st+n ; θ − ) − V (st ; θ ))2 ,
2
Vn ( x̃) = 1 K ( x̃) Vn+1 ( x̃ )τ ( x̃ | x̃ n , un )d x̃ (81)

X̃ where V (s) is the value function, A(s, a) = Q(s, a) − V (s)
∗ denotes the advantage function, and H (·) denotes a term that
= 1 K ( x̃) Vn+1 ( x̃ + ( f nr − f no )) p(w). (82)
w∈W regularizes entropy, β denotes the regularizer factor, θ − the

parameters in the previous iteration, and the θ the parame-
where f nr = f r (un , θnr ) and f no = f o (wn ). ters against which the optimization is carried out. The tracker
The optimal value function V ∗ is obtained at each iteration network has three components: the sequence encoder, the
by choosing the action that maximizes the value function, that observation, and the actor-critic network. The first two ele-
is,
ments extract and fuse features. The critic is in charge of

V ∗n ( x̃) = max 1 K ( x̃) ∗
Vn+1 ( x̃ + ( f nr − f no )) p(w)) .(83) approximating the value function V (st ). On the other hand,
u∈U
w∈W the actor outputs the stochastic policy π(·; st ). The networks
parameters are updated during training using the outputs of
Using DRL, the problem is framed as a partially observ- V (st ) and π(·; st ), see Eq. (85).
able Markov decision process, described by (S, A, O, The authors establish a reward function using the pose of
R, T , ρ, γ ) with state space S, action space A, observa- the moving object in a local reference frame given by the
tion space O, R(s, a) as the reward function, T (s, a, s ) = tracker orientation and position. The reward function is the
P(s |s, a) as the state transition model, ρ(s, o) = P(o|s) as following:
the observation model, and γ as a discount factor. The work

is interested in obtaining the policy π ∗ that is optimal, more x 2 + (y − d)2
precisely, r = A− + λ|w| , (86)
c
π ∗ = arg max Eτ ∼π [R(τ )] . (84)

π with (x, y) the coordinates of the object moving in the local
reference frame, w the orientation of the same object with
A3C [114] approximates the optimal policy using actor respect to the x-axis of the local reference frame, A > 0, c >
and critic neural nets, thought policy gradient [150] and Bell- 0, d > 0 and λ > 0 are parameters to be tuned. At the moment
man’s equation. when the object lies in front of the agent below a distance d
123
and with no rotation (the heading of the object is the same as failed grasp. A grasp is considered successful if, at the end
that of the tracker), the agent achieves the maximum reward of an episode, the robot is able to hold an object and exceed
A. In the equation, c functions as a normalization factor for a specified height.
the distance. The authors make exhaustive experimentation In the formulation, s ∈ S to denote the state, a ∈ A
in simulated environments, and they also implemented the denotes the action, reward is denoted r (st , at ), t is the time. In
active tracker in a TurtleBot robot in both outdoor and indoor practice, the authors aim to learn a parameterized Q-functions
real environments obtaining good results. Q θ (s, a), with θ denoting the neural network weights. The
In [163], the authors propose a learning-based motion minimum error Q-function is learned by minimizing the Bell-
planner, which does not need a previously built map of the man error, which is:
environment. The7 technique uses sparse 10-dimensional
range measurements and the target position w.r.t. the mobile E(θ ) = E(s,a,s )∼ p(s,a,s ) [D(Q θ (s, a), Q T (s, a, s ))]. (87)
robot coordinate frame as input. The output corresponds to
continuous steering commands. The learning technique is the D is some divergence metric, namely, the cross-entropy
so-called deep deterministic policy gradients (DDPG) pro- function, whose total returns lie within [0, 1]. The term
posed in [100]. Q T (s, a, s ) = r (s, a) + γ V (s ) represents a desired value.
Other authors that have combined DRL with motion plan- The expected value is computed considering the distribu-
ning for navigation and tracking include [16,27,28,42,107]. tion p(s, a, s ), over all previously observed transitions (in
A work that might be used to initialize the tracking process, the replay buffer). V (s ) is a target value and γ a damped
which uses DRL, is presented in [15]. discount factor. In the implementation, two target networks
are used to improve stability, which is achieved by keeping
4.2.2 Massive amounts of data and additional supervision two shifted versions of the parameter θ , θ̄1 and θ̄2 . Param-
eter θ̄1 is the exponential moving averaged version with a
For learning adequate policies, direct use of DRL requires mean constant value of 0.999, and θ̄2 a version of θ̄1 that is
large amounts of data. Some authors have shown that this, shifted by an estimate of 6000 gradient steps. The goal value
with enough resources, is a feasible approach, while others is V (s ) = mini=1,2 Q θ̄i (s , arg max Q θ̄1 (s , a )). Once the
a
have introduced some additional feedback to reduce the sam- Q-function has been learned, the remaining step is to recu-
ple complexity. perate the policy as:
The work in [79] introduces QT-Opt, a vision-based RL
framework that is claimed to be self-supervised and scalable, π(s) = arg max Q θ̄1 (s, a). (88)
in addition to be able to profit from around 580,000 real- a
world gripping attempts as the input of a neural network
Q-function in its training phase. This neural network encodes In a practical implementation, this method collects sam-
1.2 million parameters performing closed-loop grasping in ples from the environment interaction, followed by perform-
real-world trials, reaching a 96% generalization success on ing off-policy training employing all samples gathered until
objects not seen during training. then. A parallel asynchronous version of this procedure is
The experimental implementation considers reasonable used. The image s and the action a become the network
assumptions: the observations are taken from a monocular inputs, and the arg max in Eq. (88) is rated following a
over the shoulder RGB camera, and the actions are com- stochastic optimization algorithm, which is able to handle
prised of opening and closing commands for the gripper, multimodal and nonconvex optimization scenarios.
considering end-effector Cartesion motion. Let πθ̄1 (s) denote the policy related to the Q-function
Unlike most reinforcement learning tasks in the litera- Q θ̄1 (s, a). Equation (88) can be recovered by means of sub-
ture, the main challenge in this duty is not only reward stituting the optimal policy
maximization but an effective generalization of previously
unseen objects. This needs a set of very diversified objects πθ̄1 (s) = arg max Q θ̄1 (s, a), (89)
a
during training. To make the most of this diverse dataset,
authors employ an off-policy training method that uses a instead of the arg max argument to the target Q-function.
continuous-action generalization of Q-learning, which they In the algorithm, QT-Opt, πθ̄1 (s), is instead evaluated
name QT-Opt. by performing a stochastic optimization over a, applying
The method considers robotic manipulation as a general Q θ̄1 (s, a) as the objective value.
formulation of a Markov decision process. During data col- A method based on cross-entropy (CEM) is used to per-
lection, the grasping task considers that only a single reward form this optimization, whose parallelization is attainable
is given to the learner, namely, a reward of value 1 after a and robust to local optima, provided that the problem has low-
successful grasp, whereas a reward of 0 is the result of a dimensionality [139]. CEM is a simple optimization method
123
since it does not require derivatives. Moreover, at every itera- dimensions. The three feature vectors that result from that
tion, it samples a batch of size N , suits a Gaussian distribution process are concatenated in a single vector. Then, they are
for the best M < N samples, and samples a new batch of N sent to a multimodal fusion module to obtain a multimodal
values utilizing the previously suited Gaussian. In the appli- representation of 128 dimensions. Using self-supervision,
cation, two iterations of CEM are performed with M = 6 and human annotation is avoided. Based on the current sensory
N = 64. This method is applied to determine targets while data representation and the next action of the robot, two ele-
training and to select actions in the real environment. ments are predicted: (1) the optical flow obtained by the
The work in [99] deals with a peg insertion problem. The action and (2) whether or not in the next control cycle the
authors point out the importance of the manual design of a end-effector will be in contact with the environment.
robot controller which combines different modalities. While As in [36,48], based on known robot geometry and
DRL has exhibited success in computing control policies in kinematics, the authors automatically generate optical flow
cases where the inputs are high-dimensional, it is common annotations as ground truth. Simple heuristics are used over
that these algorithms are out of hand to be placed on real F/T readings to obtain binary states of contact that also serve
robots because of the complexity intrinsic to sample retrieval. as ground truth. Correlation and redundancy are exploited by
Self-supervision is used to learn a multimodal and compact introducing a learning objective. This one foresees if there is
depiction of the sensor inputs. Haptic feedback is used at a a temporal alignment between two sensor streams [123]. A
frequency of 1k hertz and sends commands at 20 hertz. 2-layer MLP is used as a predictor of the alignment. This pre-
The authors formulate the manipulation task as a dis- dictor gets as input the representation of low dimensionality
counted Markov decision process (MDP) M, with finite and returns a binary classification to determine the alignment.
horizon, state set S and action set A. They seek to maximize The action optical flow is trained utilizing the mean end-
the next reward: point error(EPE) loss over all pixels [36]. The contact and
alignment predictions are trained using cross-entropy loss.

T −1 While training, the summation of the three losses is min-
J (π ) = E π γ rt (st , at ) , (90) imized employing stochastic gradient descent on a dataset
t=0 composed of trajectories. The network outputs a feature vec-
tor that is comprised of multimodal data. That vector serves
where r denotes the reward function, the discount factor is as input to the policy that performs the manipulation task,
referred as γ ∈ (0, 1], at are the actions and st the states. The which is learned using RL.
policy is modeled by means of a neural network with param- Model-free reinforcement learning is used to address
eters θπ . Given high-dimensional haptic and visual sensory contact-rich manipulation. A multilayer perceptron with two
data, S is defined by a representation of low dimensionality layers is used to obtain the policy. This network has as input
that is learned. A neural network having θs as parameters the multimodal representation with dimension 128 and gen-
is such representation. x corresponds to 3D displacements erates as output a displacement of the robot end-effector, x,
and A is a continuous action space. in the 3D space. The controller produces direct torque com-
This work also deals with finding sources of supervision mands τ , using as input x at 20Hz. The output is produced
that do not require human annotation. The authors propose at a frequency of 200Hz. Producing Cartesian controls has the
a self-supervised neural network architecture. This network advantage over joint space controls that it does not require to
has as input the information obtained with three sensors: learn the nonlinear mapping between the 7D joint space and
RGB images, the velocity and position of the end-effector the 3D Cartesian space. Finally, there is a trajectory genera-
and readings from force–torque (F/T) sensors over a 32 mil- tor that relates the output of the policy and the torque control
lisecond window. of the robot. The resulting trajectory is used as the input to a
This work uses, for visual feedback, a 6-layer con- proportional derivative impedance controller to calculate an
volutional neural network (CNN). This CNN is similar acceleration command.
to FlowNet [36]. The network encodes 128x128x3 RGB Deep reinforcement learning requires large quantities of
images. A fully connected layer is used for transforming the data. Other researchers have used or provided large datasets
final activation maps into a 128-d feature vector. For hav- for learning (e.g., [13,31,75])
ing haptic feedback, the last 32 readings from the six-axis
F/T sensor are used. That reading has the form of a 32x6 4.2.3 Simulation-based learning
time series. As in [122] 5-layer causal convolutions with
stride 2 are performed to convert the force signal in a vec- Since it is not practical for many applications to use directly
tor of 64 dimensions. For proprioception, the end-effector’s a physical system to explore a real-world environment using
position and velocity are encoded using a multilayer percep- reinforcement learning (e.g., crashing a car thousands of
tron (MLP) with two layers to obtain a features vector of 32 times before learning to drive), it is common to use simu-
123
lators, where the main challenge is how to guarantee that the iterations, where each successful iteration means driving the
learned policies can be applied to real conditions. car for 10 km.
In [2], the authors study the problem of simulation-based RL agents are trained in various simulated environments.
policy learning and transferring to the real world. However, The agents get rewards depending on the distance they can
it still is an open problem in research; there is work to do drive without human intervention. RL lets agents to learn
on this topic. The work proposes a method for simulation actions which maximize the total reward. In contrast, in
and training based on an end-to-end solution, which is able supervised learning, the agents learn to imitate a human
to deploy agents in the real-life world using reinforcement driver. Using the feedback from human interventions, the
learning trained in simulated environments. The work shows agents optimize the policy. This has as a result that the agents
that the direct deployment of the models in the physical world are able to drive longer distances.
is feasible after training, using environments different to the The author compared several simulation environments:
ones used for training. (A) real-world: imitation learning (IMIT-AUG), using real-
An engine called VISTA (Virtual Image Synthesis and world images and control the authors assess models trained
Transformation for Autonomy) is used to synthesize real- with end-to-end imitation learning. (B) Model-based sim-
world driving trajectories using a small dataset driving ulation: Sim-to-Real (DR-AUG), CARLA simulator, and
trajectories performed by humans. Given a desired curva- domain randomization are used. (C) Domain adaptation:
ture and velocity at a time step, VISTA computes the change (S2R-AUG), a model is trained using simulated and real
of state (orientation and position) at the next time step. This images to learn shared control. (D) Reinforcement learning
is performed in both the closest human trajectory and the cur- in VISTA and (E) human driving as a baseline.
rent simulated trajectory. The system considers the relative Each model is trained three times. Then, they are tested
transformations to the closest human trajectory to obtain the individually on every road of the test track. If the vehicle
desire transformation. This transformation is applied to the exists its path, the driver interferes. IMIT-AUG has obtained
3D image; then, this one is projected to a 2D image, and the the best results, VISTA obtained the second place.
whole process starts over again. In this way, a small number The main contributions of the work in [2] are three. (1)
of sample human trajectories can be converted to a very large VISTA, a photorealistic simulator used to synthesize new
number of new trajectories. It is said that the transformations perceptual inputs. (2) An end-to-end learning pipeline that
are local rotations (±15◦ ) and translations both lateral and generates lane-stable controllers. (3) An experimental vali-
longitudinal ones (±1.5 m). Similar to the work in [3,11], dation, in which agents are directly deployed in the real world
the authors propose a method that learns lateral control. This after training.
control is based on the prediction of the desired curvature In [116], the authors propose reinforcement learning for
of motion. This curvature can be transformed to the vehicle training low-level policies in simulation, which are later used
steering angle using a bike model as in [88]. in different physical quadrotors. In particular, the proposed
The goal is to learn an autonomous policy, this policy approach uses neural networks for learning a robust quadro-
is parameterized by θ , it estimates an action ât = f (st , θ ) tor controller. The goal is to determine a policy that permits
from a dataset, of length n, of state-action pairs (st , at ) of to map the quadrotor state to rotor thrust. They assume that
human driving. Using supervised learning, the agent obtains good estimations of the quadrotor’s state (position, velocity,
an action minimizing an empirical error L. When reinforce- orientation, and angular velocity) is available. To ease the
ment learning is used, the agent has no feedback of the human policy transferring to multiple quadrotors, they define a nor-
action, at . The agent has a reward rt for every succeed- malized control input and include lag simulation and noise
ing action not requiring human intervention. The ∞discounted processes to make it more realistic. The policy is learned
accumulated reward Rt is given by Rt = k=0 γ rt+k ,
k using as input the error between goal and current states and
where γ ∈ (0, 1] is a discounting factor. The return, at with the proximal policy optimization algorithm or PPO. At
time t, obtained by that the agent represents the travelled the execution time, they translate the coordinates to employ
distance from the location at time t to the location where the policy as a tracker controller of the trajectory.
the agent is at the instant when the vehicle requires human In [21], the authors describe an approach to use extensive
intervention. Considering the space of all possible actions, simulation data with real data to learn policies. The idea is
the agent optimizes a stochastic policy: π(a|st ; θ ). The prob- to design a distribution of simulation parameters such that a
ability distribution is parameterized at time t as a Gaussian, policy trained on those data should perform well in a real
(μt, σ 2 t). Therefore, the policy gradient, ∇θ π(a|st ; θ ), can environment. Domain randomization has been previously
be computed analytically. During training, the parameters used for simulation training; however, it is difficult to find a
θ are updated in the direction ∇θ log(π(a|st ; θ )) · Rt . The distribution of simulation parameters that brings the observa-
RL algorithm is an episodic policy gradient algorithm, i.e., tions obtained from the trained policy under this distribution
REINFORCE, which converges after 1.5 million of training closer to the observations of the real data. In [21], the authors
123
minimize the differences between the observation trajecto- images is attained by minimizing the losses in Eq. (93), given
ries produced from policies obtained from simulations and as follows:
Lsem R (G R ; S , f S ) = E pr eal [CrossEnt( f S (r ), f S (G S (r )))],
from real-world experiments, such that the Kulback–Leibler (93)
measure of the parameter distribution between the old and Lsem S (G R ; S , f S ) = E psim [CrossEnt( f S (s), f S (G R (s)))].
new simulation parameters is below a threshold measure.
The shift loss in Eq. (94) constrains the consistency for
This is similar to trust region policy optimization (TRPO)
sequential inputs. s[x→i,y→ j] denotes the outcome of a shift
approaches, but applied to the simulation parameters. The
operation: shifting s, i pixels and j pixels along the X and Y
discrepancy function minimizes the differences between the
axes, respectively. The authors based this loss on the result
real and simulated observations calculated as a weighted sum
of [140], in which it is said that a trained stylization network
of L1 and L2 norms, with additional weights for each obser-
is shift-invariant to shifts of multiples of K pixels (where K
vation. To learn the policies, they used PPO. It is shown,
corresponds to the total downsampling factor of the network).
experimentally, that the approach is capable to learn policies
Thus, the shift loss in Eq. (94) restrains the shifted output
using mainly simulated data, which can be later used in real
to match the output of the shifted input, regarding the shifts
robots.
as image-scale movements. This loss is an alternative to con-
In [193], the authors transfer DRL policies for visual
sistency constraints on small movements, removing the use
control tasks from learning in simulated environments to
of optical flow information, allowing to meet the requisites
their implementation in the real world. During deployment,
of real-time control in robotics. The shift loss is given as:
the authors translate the real-world image to the synthetic
domain. The proposed approach is tested both in indoor and Lshi f tR (G R ; S ) = E psim ,i, j∼u(1,K −1) [||G R (s)[x→i,y→ j]
outdoor environments. Additionally, the authors propose a − G R (s[x→i,y→ j] )||22 ],
new shift loss technique that permits the generation of con-
Lshi f tS (G S ; R) = E pr eal ,i, j∼u(1,K −1) [||G S (r )[x→i,y→ j]
sistent synthetic image streams without enforcing temporal
constraints. The main idea is to adapt the real camera streams − G S (r[x→i,y→ j] )||22 ],
to the synthetic modality, only during the actual deploy- (94)
ment phase of robots in real-world scenarios. The authors
learn generative models to map between synthetic and real where u denotes the uniform distribution.
The full objective for learning is given by Eq. (95), where
domains and vice versa. They combine four different losses, a
λcyc , λsem , and λshi f t are the loss weightings.
standard loss typically used in the transfer domain (Eq. (91)),
L(G R , G S , DR , DS ; S , R, f S ) = LG ANR (G R , DR ; S , R)
a cycle consistency loss proposed in [196] (Eq. (92)), a
+ LG ANS (G S , DS ; R, S )
semantic loss proposed in [66] (Eq. (93)), and a shift loss
+ λcyc (LcycR (G S , G R ; R) + LcycS (G R , G S ; S )) (95)
(Eq. (94)), which the authors of the paper propose in the
work. + λsem (Lsem R (G S ; R, f R ) + Lsem S (G R ; S , f S ))
+ λshi f t (Lshi f tR (G R ; S ) + Lshi f tS (G S ; R)).
LG AN R (G R , DR ; S , R) = E pr eal [log DR (r )]
+ E psim [log(1 − DR ((G R (s)))], The optimization to be solved is given as:
(91)
LG ANS (G S , DS ; R, S ) = E psim [log DS (s)]
+ E pr eal [log(1 − DS ((G S (r )))], G ∗R , G ∗S = arg minmax L(G R , G S , DR , DS ). (96)
G R ,G S ,DR ,DS
where G R and G S are generative models, and DR and DS
are discriminators. The authors evaluate their approach in the Carla simulator
In [193], Eq. (92) constrains mapping with the cycle con- using the benchmark setup in [29,37], and with real-world
sistency loss [196] robotics experiments on visual navigation tasks both indoor
and outdoor.
LcycR (G S , G R ; R) = E pr eal [||G R (G S (r )) − r ||1 ], In DRL, it is common to train in simulation, and several
(92) environments have been developed to be used for learning
LcycS (G R , G S ; S) = E psim [||G S (G R (s)) − s||1 ].
and research (e.g., [12,30,86,158,172]).
The authors of the work in [193] take advantage that the
ground-truth semantic labels can be obtained from many 4.2.4 Dynamic environments
robotic simulators and a semantic constraint is added, similar
to CyCADA [66]. It is assumed that the semantic labels are One of the main challenges in many learning systems is how
known for the images in the domain S. It is also assumed that to learn models that can generalize or which can be effectively
for domain R, the ground labels are unknown (which is the
expected case for most of the real-world frameworks). Since adapted to new conditions.
f R may be unknown, f S is used to produce “semi” seman- The paper in [179] proposes an incremental reinforcement
tic labels for R. The translation to semantically consistent learning algorithm to deal with dynamic environments. The
123
idea is that given a learned policy and a model of an environ- (4) The policy is updated using the same training algorithm,
ment, if the environment changes, reflected in changes in the but with a reduced learning rate. Samples training examples
reward function, the proposed algorithm is able to quickly with equal probability from the base and target datasets are
learn how to act in the changing environment. also used. (5) The fine-tuned policy (target task) is evaluated
The system first learns a Q-function using standard Q- offline.
learning and at the same time learns a model E(S, A, R, P), A set of experiments are done, the objective is to deter-
by storing all the tuples (s, a, s , r ) generated during the mine the efficiency of the sampling process and its robustness
learning process, where P is the probabilistic state transition to the variation of conditions. The method is compared with
model. Storing the model explicitly, however, is only possible other two methods; the target dataset increases its size along
in relatively small domain spaces. The proposed approach has with the experiments. The other two methods are: (1) Scratch,
a mechanism to detect changes in the environment based on this is a policy, which is training with a data set of 800 grasp,
changes in rewards, which means that the only change con- which use a randomly initialized Q-function. The authors
sidered by the system is if there are different reward values. want to assess the importance of the pre-trained parameters.
When this happens, they explore the changed environment (2) ImageNet, this policy is also trained with a data set of 800
using a uniform distribution over actions over most of the grasp, but it uses a modified Q-function architecture, where
states (states and actions of the original model) until the the convolutional trunk of the network is replaced with an
number of state-action pairs is similar to the number of state- architecture ResNet50 [55]. It is initialized with weights,
action pairs in the original model. During this exploration which obtained with training procedure based on the clas-
process, the system marks all the state-action pairs which sification of images of the ImageNet dataset [33].
rewards are different from the original ones. Once all this There are five modified tasks, six sizes of datasets (25,
information is gathered (all the state-action pairs with differ- 50, 100, 200, 400, and 800 grasp attempts), and three meth-
ent rewards), they first update all these state-action pairs and ods (the simple fine-tuning method proposed in this paper,
all their m neighbors using value-iteration and then update Scratch and ImageNet), yielding 48 policies to evaluate. The
the whole model using Q-learning. The idea is to start with authors evaluate the 48 policies of the target task using the
the new updates and all previous Q-values to define a new robot, which executes 50 grasps to evaluate the performance.
policy. The proposed project, in a way, assumes that there are The proposed method obtains better results than Scratch
only small changes in the environment, which seems reason- or ImageNet. Thus, the experiments show than transferring
able. from pre-trained conditions gets better results than transfer
In [77], the authors propose an approach and experiments from ImageNet.
for proposing a continuous learning framework. The authors Finally, the authors proposed continuous learning based
demonstrate a way to adapt robotics manipulation based on on offline fine-tuning. In this approach, an already modified
information obtained with computer vision algorithms that policy is used as initialization for a second modified (chal-
allows one fine-tuning using reinforcement learning. The lenge) task. For each challenge task, 800 exploration-grasp
main contributions are two: (1) an experimental study about datasets are used.
continuous adaptation for robotics learning and (2) exper- The authors conclude that the performance of continual
imental evidence showing that fine-tuning can be obtained fine-tuning and single-step fine-tuning is similar. The authors
using adaptation. also mention that RL is well suited for continuous learning in
A Q-function network is trained first; this function is robotics, because it has the ability to learn the combination
trained offline with 580,000 grasp attempts, the set on objects of perception and action.
that were used to train the network have 1000 different Other works related to dynamic environments and adap-
objects, which were visually and physically different. Later tation can be consulted in [39,60,85,118,121].
a real robot is used to collect trails, the parameters of net-
work were updated with this new trail and then the process 4.2.5 Risk-averse systems
is repeated.
The fine-tuning method has five different steps. (1) Fist a One relevant aspect of autonomous robotics is safety. It is
general sampling policy is pre-trained (as it was described desirable to avoid catastrophic events during training and
above). Then, the policy is fine-tuned following the next deployment that can damage the robot or harm humans.
steps. (2) The pre-trained policy is used to collect and explore Learning policies within deep reinforcement learning
the target task. (3) The same off-policy reinforcement learn- can overfit and may not avoid rare catastrophic events. In
ing algorithm used for the pre-training is initialized with the [124], the authors propose an adversarial approach to learn
parameters of the pre-trained policy, and the target task and risk-averse policies. The idea is to learn policies with low
base task datasets as data sources [79]. It is assumed that the variance. For that, they used an ensemble of Q-value net-
base task dataset was employed for training the base policy. works to estimate variance. The proposed system uses two
123
agents: a protagonist and an adversary. The protagonist tries in which u ∗ (s[k]) is the action from the expert and o[k] is
to maximize the expected accumulated discounted reward the observation obtained in the state s[k]. Intuitively speak-
while penalized by the variance of the Q-value function, ing, solving the aforementioned supervised learning problem
evaluated with the ensemble of Q-value networks. The adver- aims to minimize the inconsistency between actions in the
sary tries to minimize the expected accumulated discounted two policies, π and π ∗ , in terms of the expected trajectories
reward while maximizing the variance. They use an asym- ρ(π ) of the agent. In practice, the authors employ an off-
metric reward function, with a low positive reward for good policy learning scheme, in which they perform the trained
behaviors and high negative rewards for risky behaviors. At policy, collect rollouts, include them in a dataset with the
the training time, they choose one of the Q-value functions respective observations and expert’s actions, and train the
randomly to generate an episode, and at test time, they use network on that dataset. To deploy the learned policy to the
the mean value of the ensemble for selecting an action. The physical world, abstraction is leveraged by operating the sen-
protagonist and adversary are processed sequentially, starting sorimotor controller on an intermediate representation of the
with the protagonist, and the target functions include infor- visual signal rather than on the raw images. More precisely,
mation from the other agent but they are trained separately. the network representing the sensorimotor controller receives
In particular, the motion of prominent features in the image plane (obtained
through a reduced version of PointNet [132]), in conjunction

m with IMU signals and the reference trajectory. The authors
Q P (stP , atP ) = r (stP , atP ) + γ i r (st+i
A
, at+i
A
) prove the effectiveness of using an abstract representation
i=1 (97) to ease the sim-to-real transition by Lemma 1. Finally, the
+γ m+1
maxa Q ∗ (st+n+1
P
, at+n+1
P
), proposed approach is tested in simulations and on a physical
platform.
where P refers to the protagonist and A to the adversarial. A
Lemma 1 A policy that acts on an abstract representation
similar expression is used for the adversary.
The work in [84] proposes to learn sensorimotor policies of the observation π f : f (O) → U has a lower simulation
that enable a quadrotor to fly extreme acrobatic maneuvers. to reality gap than a policy πo : O → U that acts on raw
The trained policy is modeled as a neural network that fuses observations.
information from various sensors onboard the quadrotor and
regresses thrust and body rate to execute the maneuvers. To Other risk-averse approaches include [133,152,192,194].
train the network, an imitation scheme in simulations is uti-
lized. The demonstrations are generated through a privileged
expert π ∗ , which consists of a model predictive controller
that generates collective thrust and body rates resulting from
solving the following optimization problem: 5 Taxonomies

N −1
π ∗ = min x[N ]T Qx[N ] + (x[k]T Qx[k] + u[k]T Ru[k]) ,
In this section, different taxonomies are proposed. The works
u
k=1
(98) are classified according to high-level features, the task that
s.t r(x, u) = 0, h(x, u) ≤ 0, they are addressing, and the type of system. No all the ref-
erences fall in a given category since some works are about
in which x[k] = τr [k] − s[k] is the difference between the learning techniques, optimization and others are related to
reference τr [k] and the state of the quadrotor s[k] at time computer vision.
k, r(x, u) are differential constraints related to the system, We begin presenting Table 1 that refers to the miscel-
h(x, u) are limits on states and inputs, and Q and R are pos- laneous topics followed in Sects. 4.1 and 4.2 organization.
itive semidefinite matrices that model costs. The expert is Those topics present high-level features that the works have
said to be privileged since, conversely to the sensorimotor in common. As previously structured, the works are split
controller, it has privileged access to ground-truth state esti- into pure deep learning approaches and a second group that
mates. To train the policy π of the sensorimotor controller, employs deep reinforcement learning. In Table 2, works
an iterative supervised learning process called DAGGER are classified according to the main task that the work
[138] (Dataset Aggregation) is employed. DAGGER collects addresses. The tasks are navigation, localization, manipu-
data iteratively, more precisely, the student controls the plat- lation, multi-robot, tracking and motion planning. Those
form, then DAGGER stores the observations with the expert’s works are prototypical representatives of those tasks. One can
actions and updates the controller of the student solving the observe that the tasks with more works are navigation and
following learning problem in a supervised manner motion planning. In Table 3, works are ordered according to
the type of system. The different types of systems are arms,
terrestrial robots (without legs), aerial robots, autonomous
π = min Es[k]∼ρ(π ) [u ∗ (s[k]) − π̂(o[k])], (99) cars and robots with legs.
π̂
123
Table 1 High-level features

Deep learning
Direct control Hybrid models Learning repre- Sampling-based motion planners
sentations
Reference [3,11,92,96,105, [14,47,56,63,64, [38,43,54,61,70, [22,26,69,71,91,104,128–130,167,191]

126,154,188] 78,81,82,112, 101,164,183]
185]
Deep reinforce-
ment learning
Navigation and Massive amounts Simulation-based Dynamic envi- Risk-averse sys-
tracking of data and addi- learning ronments tems
tional supervision
Reference [16,27,28,42,49, [13,21,31,75,79, [2,12,21,30,86, [39,60,77,85, [84,124,133,152,

107,108,163] 99] 116,158,172, 118,121,179] 192,194]
193]
Table 2 Tasks
Navigation Localization Manipulation Multi-robot Tracking Motion planning
Reference [2,3,11,13,25, [3,15,131,189] [13,17,21,38,48, [31,106,116,133, [34,107,108] [22,26,28,42,43,

27,29,42,47,54, 77,79,101,117, 134,185] 49,69,71,91,104,
62,63,78,82,154, 119,174,180] 126,129,167,177,
163] 180,185,191]
Table 3 Type of systems

Arms Terrestrial robots Aerial robots Autonomous cars Robots with legs
Reference [13,21,38,48,77, [54,62,63,96, [51,84,105,113, [2,3,11,13,29,37, [141,160,183]

79] 126] 116,120,154,159, 92,193]
175]
We also analyzed the reviewed research work under addi- Table 4 An analysis of DRL approaches used in robotics
tional perspectives. In particular, we considered: RL Approach Algorithm
Value function DQN [15], QT-Opt [77], [79],

– The followed RL approach: Model-free value functions,
Q-learning [179], ensemble
policy gradient, actor-critic approaches, model-based [124], Q-value [94], Rainbow
RL, and hybrid approaches. DQN [156], DQN [89]
– Training and testing conditions: Only simulation, Sim- Policy gradient ADDPG [163], ADNet [16], TRPO
to-Real, End-to-end, and other approaches. [99], DPG [13], PG [2], PPO
– Aids used to learn faster: Customized reward function [21], PG [193], MVPPO risk
averse [194], REINFORCE [197]
and reward shaping, distributed or parallel multiagent
Actor-Critic A3C [32], A3C [49], A3C [107],
approaches, model-based or pre-trained models, and Shaping [27], Actor-critic-VaR
domain adaptation or transfer learning. [133], Actor-Critic-CVaR
[152],Soft actor-Critic [156]
More specifically, Table 4 compares the reviewed research Model-based PILCO [60], MAML [118],
mDAGL [121], transition
papers in terms of the reinforcement learning strategy used dynamics [192]
to solve a particular task. We consider approaches that learn
W/sampling-based planner DDPG [28], DDPG [42]
model-free Q-value functions, that use policy gradient tech-
niques to learn a policy, actor-critic methods learning both
a value function and a policy, model-based approaches, and
hybrid methods, that combine reinforcement learning with As we can appreciate, there is a wide variety of approaches
other techniques, in particular, with sampling-based plan- that have been used in DRL for robotics. Although it is not
ners. clearly reflected in Table 4, in general, approaches that learn
123
Table 5 Training/testing of DRL approaches used in robotics Table 6 Aids for learning faster with DRL in robotics
Training/testing Algorithm Aids Algorithm
Simulation A3C [49], Model-based [121], Customized reward Actor-Critic [27], A3C [107],
DQN [124], Actor-Critic-VaR function,reward shaping, A3C [107], DQN [103], SAC
[133], Actor-Critic-CVaR intrinsic reward [190], PPO [18] SAC, PPO,
[152], transition dynamics A2C, Raimbow [147]
[192], MVPPO [194], Distributed/ parallel/ DQN [124], Actor-Critic
Q-value [94], QDN [15] multi-agent/ ensemble (multi-agent) [133],
End-to-end A3C [107], DDPG [27], Actor-Critic distributed
QT-Opt [79], PG [2] [152], QT-Opt [79]
Sim-to-Real TRPO [99], PPO [21], PPO Pre-trained models model-based RL [192], QT-Opt
[116], ADDPG [163] [79]
Real-to-Sim PG [193] Domain adaptation/ PG [193], Q-learning [179],
Simulation and Real PILCO [60], Q-learning [179], transfer/life-long QT-Opt [77], PILCO [60],
model-based [118] learning/dynamic Q-value [94]
environments
Offline RL DPG [13]
Contrastive Self-Supervised Actor-Critic and DQN [156],
Learning DQN [89], Rainbow [146]
Human Intervention DQN [68], off-policy RL [98]
a value function tend to use DQN or one of its variants,

approaches using policy gradient tend to use PPO-based algo- process. Finally, transfer learning and domain adaptation are
rithms, and approaches following actor-critic tend to use A2C a very active research area aiming to reduce training times.
(Advantage Actor Critic)/A3C. In general, model-based sys- Since DRL approaches tend to be data-intensive and slow
tems are more sample-efficient than model-free approaches. learners, several approaches have been used to be more data-
However, normally, the model is not available and learning it efficient and accelerate the learning process. Giving some
is a challenging task, which is the main reason why model- guidance to the learning agent through a more informative
free approaches are more commonly used. Policy learning reward signal can help an agent to learn faster, using intrinsic
systems tend to be less sample efficient than value function rewards possibly in an unsupervised pre-training stage. Using
approaches. However, they tend to be more stable as they are multiple robots in parallel can be used to collect more data
learning the policy directly. under different conditions to learn more effectively. Another
Table 5 describes different approaches used to train DRL approach is to use pre-trained models that can be used to learn
systems. Some researchers test their proposals only in sim- more complex tasks. Recently self-supervised learning has
ulation environments. Others train in simulation and test in been to learn high-level features from raw pixels that could be
real-world environments, while some researchers follow an used to learn faster. Human interventions have also been used
end-to-end approach in real-world applications, among other to accelerate the learning process. Finally, there is a common
approaches. interest to use previously learned models to learn new tasks,
Given the intrinsic difficulties of real-world applications, which could be through transfer learning and domain adap-
such as noise and uncertainty, expensive equipment, and the tation, lifelong learning approach, or adaptation to changing
wear and tear of the equipment due to the need for large conditions in the environment. All of these approaches are
quantities of samples, it is not surprising to find a large body relevant to deploy robots under more realistic conditions.
of research performed only in simulation. Using simulation
to acquire data for training and then use the learned model
in real-world conditions, poses the challenge of the simula- 6 Promising research directions
tion gap. In general, we would like to have RL systems that
could learn from few iterations for an effective deployment Although important advances have been recently achieved
of applications. with deep learning and deep reinforcement learning in
Table 6 describes some of the approaches that have been robotics in recent years, as shown in this paper, there are still
followed to try to learn faster in DRL. In particular, many several research directions that need to be tackled to develop
researchers use customized reward functions or reward shap- more practical and useful systems.
ing to help the agents solve their particular tasks. It is also One relevant aspect is how to learn from few examples,
common to use distributed approaches where several agents which is closer to human learning. This can involve, besides
are used to better/faster learn their tasks. It is also common to feedback in the form of rewards, the discovery of causal
use pre-trained models or initial policies to guide the learning relations, the inclusion of reasoning capabilities, the forma-
123
tion of hypotheses, and directed experimentation aimed at is used in an option discovery system. It is expected that
hypothesis elimination. Recent advances in causal reinforce- new strategies to balance exploration and exploitation will
ment learning (e.g., [7,32,50,65,109,197]) are a promising be developed shortly.
research area in this direction. Similarly, recent advances in Some researchers have explored how humans can inter-
neuro-symbolic approaches promise to produce explanation vene to accelerate the learning process. Previous work on
and reasoning capabilities that are currently beyond deep imitation learning and behavioral cloning has used human
learning approaches (e.g., [93,110,135,136,148]). Another traces of the target task to guide the agent in its learning
relevant area is how to build incremental learning methods process. Other researchers have included the user within
that can reuse previous models and share information among the learning process, who can give feedback, for instance,
different models. Transfer learning is an active research area with voice commands, to provide guidance to the agent
in deep reinforcement learning relevant for this task (see (e.g., [166]). Recently, researchers have provided feedback
[198]), but further research is needed for a more effective in terms of preferences between alternative performances,
transfer and incremental learning process. where the idea is to learn a reward model from the demon-
Particularly interesting and little explored is unsupervised strations consistent with the preferences of the user (e.g.,
or partially supervised learning. Here, the goal is to reduce [68,98]). A natural way to learn for humans is through feed-
the human intervention in the learning process and avoid, back from other humans, so it is expected that different forms
for instance, manually label images or other objects. For of human interventions will be developed in RL.
addressing this challenging case, it might be possible to learn Additionally, it is important to have models that could
from several automatic processes and combine them through be aware of when they are no longer working and need
a neural network architecture, in such a way that the com- adjustments that could imply new states, new ways of repre-
bination has better performance than any single automatic senting states, and possible new actions. There is some work
process. A concrete example is a robotic arm that does a in deep Bayesian learning (e.g., [178]) that can be used to
manipulation task automatically using a geometric motion recognize when the model should not be trusted, but further
planning method or using a feedback-based control tech- research work is needed to change, if necessary, the represen-
nique. The learning method will learn by demonstration or tation space. Failures in the desired task can also be detected
some other way and will combine the results of those two automatically, sensing conditions that indicate, with high cer-
techniques producing a better performance than each of the tainty, that the task is not accomplished. A specific example
approaches on their own. is as follows: an automatic car uses a camera to keep the vehi-
Some exciting new research has been developing recently. cle inside the road, but an inertial sensor will detect whether
In particular, in the area of contrastive self-supervised learn- the car is traveling on a bumping terrain and therefore very
ing, where the idea is to use self-defined pseudo-labels and likely outside the road. We also believe that in robotics, it
data augmentation to learn representations that can be used is essential that the machine learns directly from interacting
for solving complex tasks (e.g., [23,24,57,74]). It has become with its environment, such that empirically validates its abil-
dominant in computer vision and natural language processing ity to perform the task. A drawback with this scheme is that
and has been recently used in reinforcement learning (e.g., the robot’s safety can be compromised. An alternative is to
[89,146,156]). The idea is to extract high-level features from easily create realistic simulated environments such that the
raw pixels using contrastive learning and learn an off-policy learning can be transferred from the simulated environment
control using those extracted features. This is certainly rele- to a real one.
vant for robotics as high-level features can be automatically Finally, we would like to be able to learn several tasks
extracted and used to solve complex tasks without the need concurrently. Work on lifelong learning (e.g., [94]) aims to
to collect large amounts of data. solve this task, but it still needs to solve several challenges.
Another relevant area in reinforcement learning is the
trade-off between exploration and exploitation. Intrinsic
motivation has been used in reinforcement learning to moti- 7 Conclusions
vate exploration in interesting areas to discover new goals,
and even new concepts and actions (e.g., [165]). Recently, the Applying machine learning techniques to robotics has been
intrinsic reward has been used for unsupervised pre-training, an appealing area of research as it eases the programming
where the idea is to introduce rewards based on a measure of of robots and widens its potential applications. Learning in
novelty, for instance using the state entropy (e.g., [103,147]) robotics, however, has not been an easy task due to a large
or learn first prototypical embedding and then used intrin- number of variables involved, the continuous nature of the
sic rewards to efficiently learn a policy (e.g., [190]). Another problem, and prevalent noisy conditions with uncertainty.
interesting approach to promote exploration while learning is Reinforcement learning seems like a natural paradigm to
proposed in [18], where an information-theoretic approach use in robotics as it solves sequential decision processes
123
under uncertainty, via a trial-and-error process which fits Ultimately, deep reinforcement learning is currently a very
well with a robotics setting. Based on the success of these two active research area with constant new developments. We
approaches to address robotic tasks, this work was developed believe that this paper can help to understand this trend as well
having three main objectives in mind: (1) to present a brief as the different scientific and practical challenges that need
tutorial on deep reinforcement learning, (2) to describe the to be solved. Still, more research is needed in unsupervised
main, recent, and most promising approaches of deep learn- or partially supervised learning. It is also necessary to have
ing and deep reinforcement learning in robotics, and (3) to models that could be aware of when they are no longer work-
propose promising research directions in this area. ing correctly. Other aspects open for further investigation are
Additionally, this article presented a series of taxonomies the ability of robots for learning directly from interacting
that offer the reader comparisons between the referenced with their environment and the creation of realistic simu-
works from different perspectives, including: whether they lated environments such that the learning can be transferred
utilize DL or DRL techniques, the specific task that the works from the simulated environment to the real world.
are addressing, the type of robotic system employed in the
Author Contributions E. F. Morales and R. Murrieta-Cid contributed to
work, the reinforcement learning strategy used to solve a
the article conception and design. All authors performed the literature
particular task, whether the learning process is carried out in search, drafted, and critically revised the work.
simulation or in real settings, and the aids used to learn faster
in DRL. Funding This work was supported by CONACyT Alliance of Artificial
Intelligence, Cátedras-CONACyT project 745 and Intel Corporation.
It is interesting to observe that works in robotics that use
learning tools have naturally evolved in the same order as
the development of DL and DRL, namely the oldest works Declarations
merely used DL techniques, while there is growing recent
trend to use DRL over DL. There are also predominant tasks
that have been addressed utilizing DL and DRL, for instance, Conflict of interest The authors have no relevant financial or nonfinan-
cial interests to disclose.
navigation, manipulation, and motion planning. Concerning
the most used reinforcement learning techniques, approaches
that learn a value function are inclined to use DQN or
its variants, approaches using policy gradient tend to use References
PPO-based algorithms, and approaches following actor-critic
notions are inclined to use A2C or A3C. Moreover, model- 1. Abate A, Prandini M, Lygeros J, Sastry S (2008) Probabilis-
free approaches are more commonly used than model-based tic reachability and safety for controlled discrete time stochastic
hybrid systems. Automatica 44(11):2724–2734
systems, since learning models is a challenging task. It is
2. Amini A, Gilitschenski I, Phillips J, Moseyko J, Banerjee R,
worth mentioning that policy learning systems tend to be Karaman S, Rus D (2020) Learning robust control policies for end-
less sample efficient than value function approaches, but are to-end autonomous driving from data-driven simulation. IEEE
more stable. There is also a larger amount of work that per- Robot Autom Lett RA-L 5(2):1143–1150
3. Amini A, Rosman G, Karaman S, Rus D (2019) Variational
forms only in simulation due to the complexity of obtaining
end-to-end navigation and localization. In: IEEE international
large quantities of samples in real-world setups. Lastly, sev- conference on robotics and automation (ICRA), pp 8958–8964
eral approaches have been proposed to speed up the learning 4. Andrychowicz M, Wolski F, Ray A, Schneider J, Fong R, Welin-
process, ranging from customized reward functions to pre- der P, McGrew B, Tobin J, Pieter Abbeel O, Zaremba W (2017)
Hindsight experience replay, pp 5048–5058
trained models, or even human intervention.
5. Asseman A, Kornuta T, Ozcan A (2018) Learning beyond simu-
In regard to potential future work, recent advances in deep lated physics
learning have allowed researchers to use directly raw inputs 6. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep
to create powerful models. The combination of deep mod- convolutional encoder-decoder architecture for image segmenta-
tion. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
els with reinforcement learning is certainly advancing, at
7. Bareinboim E, Forney A, Pearl J (2015) Bandits with unobserved
an accelerated pace, the area of robotics. This combination confounders: a causal approach. Adv Neural Inf Process Syst
allows to use directly the raw data from the robot’s sen- NIPS 28:1342–1350
sors into a learning framework which is capable of solving 8. Barth-Maron G, Hoffman MW, Budden D, Dabney W, Horgan
D, Tb D, Muldal A, Heess N, Lillicrap T (2018) Distributed
sequential tasks under uncertainty. This, however, poses sev-
distributional deterministic policy gradients. arXiv preprint
eral challenges regarding the amounts of data and the long arXiv:1804.08617
training times, among others, required for learning suitable 9. Bellemare MG, Dabney W, Munos R (2017) A distributional per-
models. Moreover, interesting concepts such as contrastive spective on reinforcement learning. In: International conference
on machine learning (ICML), pp 449–458
self-supervised learning, intrinsic motivation or how to make
10. Bengio Y, Louradour J, Collobert R, Weston J (2009) Curricu-
humans intervene to cleverly accelerate the learning, are yet lum learning. In: International conference on machine learning
to be exploited to their full potential. (ICML), pp 41–48
123
11. Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp 30. Crosby M, Beyret B, Halina M (2019) The animal-ai olympics.
B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J et al Nat Mach Intell 1(5):257–257
(2016) End to end learning for self-driving cars. arXiv preprint 31. Dasari S, Ebert F, Tian S, Nair S, Bucher B, Schmeckpeper K,
arXiv:1604.07316 Singh S, Levine S, Finn C (2019) Robonet: large-scale multi-
12. Brockman G, Cheung V, Pettersson L, Schneider J, Schulman robot learning. arXiv preprint arXiv:1910.11215
J, Tang J, Zaremba W (2016) OpenAI gym. arXiv preprint 32. Dasgupta I, Wang J, Chiappa S, Mitrovic J, Ortega P, Raposo
arXiv:1606.01540 D, Hughes E, Battaglia P, Botvinick M, Kurth-Nelson Z
13. Cabi S, Colmenarejo SG, Novikov A, Konyushkova K, Reed S, (2019) Causal reasoning from meta-reinforcement learning. arXiv
Jeong R, Zolna K, Aytar Y, Budden D, Vecerik M et al (2019) preprint arXiv:1901.08162
Scaling data-driven robotics with reward sketching and batch rein- 33. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009)
forcement learning. arXiv preprint arXiv:1909.12200 ImageNet: a large-scale hierarchical image database. In: IEEE
14. Cai P, Wang S, Sun Y, Liu M (2020) Probabilistic end-to-end conference on computer vision and pattern recognition (CVPR),
vehicle navigation in complex dynamic environments with mul- pp 248–255
timodal sensor fusion. IEEE Robot Autom Lett 5(3):4218–4224 34. Devo A, Dionigi A, Costante G (2021) Enhancing continuous con-
15. Caicedo JC, Lazebnik S (2015) Active object localization with trol of mobile robots for end-to-end visual active tracking. Robot
deep reinforcement learning. In: International conference on com- Autonom Syst. https://doi.org/10.1016/j.robot.2021.103799
puter vision, pp 2488–2496 35. Dolgov D, Thrun S, Montemerlo M, Diebel J (2010) Path planning
16. Caicedo JC, Lazebnik S (2017) Action-decision networks for for autonomous vehicles in unknown semi-structured environ-
visual tracking with deep reinforcement learning. In: IEEE inter- ments. Int J Robot Res 29(5):485–501
national conference on imaging, vision and pattern recognition, 36. Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V,
pp 2711–2720 Van Der Smagt P, Cremers D, Brox T (2015) FlowNet: learning
17. Calli B, Singh A, Walsman A, Srinivasa S, Abbeel P, Dollar AM optical flow with convolutional networks. In: IEEE international
(2015) The YCB object and model set: towards common bench- conference on computer vision (ICCV), pp 2758–2766
marks for manipulation research. In: International conference on 37. Dosovitskiy A, Ros G, Codevilla F, Lopez A, Koltun V (2017)
advanced robotics (ICAR), pp 510–517 Carla: an open urban driving simulator. In: Conference on robot
18. Campos V, Trott A, Xiong C, Socher R, Giró-i Nieto X, Tor- learning (CoRL), pp 1–16
res J (2020) Explore, discover and learn: Unsupervised discovery 38. Driess D, Oguz O, Ha JS, Toussaint M (2020) Deep visual
of state-covering skills. In: International conference on machine heuristics: learning feasibility of mixed-integer programs for
learning. PMLR, pp 1317–1327 manipulation planning. In: 2020 IEEE international conference
19. Canny J (1988) Some algebraic and geometric computations in on robotics and automation (ICRA). IEEE, pp 9563–9569
PSPACE. In: ACM symposium on theory of computing (STOC), 39. Elwell R, Polikar R (2011) Incremental learning of concept
pp 460–467 drift in nonstationary environments. IEEE Trans Neural Netw
20. Canny J, Reif J (1987) New lower bound techniques for robot 22(10):1517–1531. https://doi.org/10.1109/TNN.2011.2160459
motion planning problems. In: IEEE symposium on foundations 40. Fabisch A, Petzoldt C, Otto M, Kirchner F (2019) A survey of
of computer science (FOCS), pp 49–60 behavior learning applications in robotics-state of the art and per-
21. Chebotar Y, Handa A, Makoviychuk V, Macklin M, Issac J, Ratliff spectives. arXiv preprint arXiv:1906.01868
N, Fox D (2019) Closing the sim-to-real loop: adapting simulation 41. Fairbank M, Alonso E (2012) The divergence of reinforcement
randomization with real world experience. In: IEEE international learning algorithms with value-iteration and function approxima-
conference on robotics and automation (ICRA), pp 8973–8979 tion. In: IEEE international joint conference on neural networks
22. Chen B, Dai B, Lin Q, Ye G, Liu H, Song L (2019) Learning to (IJCNN), pp 1–8
plan in high dimensions via neural exploration-exploitation trees. 42. Faust A, Oslund K, Ramirez O, Francis A, Tapia L, Fiser
In: International conference on learning representations (ICLR) M, Davidson J (2018) PRM-RL: long-range robotic navigation
23. Chen T, Zhai X, Ritter M, Lucic M, Houlsby N (2019) Self- tasks by combining reinforcement learning and sampling-based
supervised GANs via auxiliary rotation loss. In: Proceedings of planning. In: IEEE international conference on robotics and
the IEEE/CVF conference on computer vision and pattern recog- automation (ICRA), pp 5113–5120
nition, pp 12154–12163 43. Fernández IMR, Sutanto G, Englert P, Ramachandran RK,
24. Chen X, Fan H, Girshick R, He K (2020) Improved base- Sukhatme GS (2020) Learning manifolds for sequential motion
lines with momentum contrastive learning. arXiv preprint planning. arXiv preprint arXiv:2006.07746
arXiv:2003.04297 44. Fortunato M, Azar MG, Piot B, Menick J, Hessel M, Osband
25. Chiang HT, Malone N, Lesser K, Oishi M, Tapia L (2015) Aggres- I, Graves A, Mnih V, Munos R, Hassabis D et al (2018) Noisy
sive moving obstacle avoidance using a stochastic reachable set networks for exploration. In: International conference on learning
based potential field. In: International workshop on the algorith- representations (ICLR)
mic foundations of robotics (WAFR), pp 73–89 45. Fox D (2001) KLD-sampling: adaptive particle filters, pp 713–720
26. Chiang HTL, Faust A, Sugaya S, Tapia L (2018) Fast swept vol- 46. Fujimoto S, Hoof H, Meger D (2018) Addressing function approx-
ume estimation with deep learning. In: International workshop on imation error in actor-critic methods. In: International conference
the algorithmic foundations of robotics (WAFR), pp 52–68 on machine learning. PMLR, pp 1587–1596
27. Chiang HTL, Faust M, Fiser M, Frances A (2019) Learning nav- 47. Gao W, Hsu D, Lee WS, Shen S, Subramanian K (2017) Intention-
igation behaviors end to end with auto-RL. IEEE Robot Autom net: integrating planning and deep learning for goal-directed
Lett 56:2007–2014 autonomous navigation. In: Conference on robot learning (CoRL),
28. Chiang HTL, Hsu J, Fiser M, Tapia L, Faust A (2019) RL-RRT: pp 185–194
kinodynamic motion planning via learning reachability estimators 48. Garcia Cifuentes C, Issac J, Wüthrich M, Schaal S, Bohg J (2016)
from RL policies. IEEE Robot Autom Lett 4(4):4298–4305 Probabilistic articulated real-time tracking for robot manipulation.
29. Codevilla F, Miiller M, López A, Koltun V, Dosovitskiy A (2018) IEEE Robot Autom Lett RA-L 2(2):577–584
End-to-end driving via conditional imitation learning. In: IEEE 49. Garg A, Chiang HTL, Sugaya S, Faust A, Tapia L (2019) Compar-
international conference on robotics and automation (ICRA), pp ison of deep reinforcement learning policies to formal methods
1–9
123
for moving obstacle avoidance. In: IEEE/RSJ international con- 69. Ichter B, Harrison J, Pavone M (2018) Learning sampling distribu-
ference on intelligent robots and systems (IROS), pp 3534–3541 tions for robot motion planning. In: IEEE international conference
50. Gershman SJ (2017) Reinforcement learning and causal models. on robotics and automation (ICRA), pp 7087–7094
The Oxford handbook of causal reasoning, p 295 70. Ichter B, Pavone M (2019) Robot motion planning in learned
51. Gonzalez-Trejo J, Mercado-Ravell DA, Becerra I, Murrieta-Cid latent spaces. IEEE Robot Autom Lett 4(3):2407–2414
R (2021) On the visual-based safe landing of UAVS in populated 71. Ichter B, Schmerling E, Lee TWE, Faust A (2020) Learned criti-
areas: a crucial aspect for urban deployment. IEEE Robot Autom cal probabilistic roadmaps for robotic motion planning. In: IEEE
Lett. https://doi.org/10.1109/LRA.2021.3101861 international conference on robotics and automation (ICRA), pp
52. Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep 9535–9541
learning, vol 1. MIT Press, Cambridge 72. Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T
53. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley (2017) Flownet 2.0: evolution of optical flow estimation with deep
D, Ozair S, Courville A, Bengio Y (2014) Generative adversar- networks. In: IEEE conference on computer vision and pattern
ial nets. In: Advances in neural information processing systems recognition (CVPR), pp 2462–2470
(NIPS), pp 2672–2680 73. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep
54. Hadsell R, Sermanet P, Ben J, Erkan A, Scoffier M, Kavukcuoglu network training by reducing internal covariate shift. In: Interna-
K, Muller U, LeCun Y (2009) Learning long-range vision for tional conference on machine learning. PMLR, pp 448–456
autonomous off-road driving. J Field Robot 26(2):120–144 74. Jaiswal A, Babu AR, Zadeh MZ, Banerjee D, Makedon F (2021)
55. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for A survey on contrastive self-supervised learning. Technologies
image recognition. In: IEEE conference on computer vision and 9(1):2
pattern recognition (CVPR), pp 770–778 75. James S, Ma Z, Arrojo DR, Davison AJ (2020) RLBench: the
56. Heiden E, Millard D, Coumans E, Sukhatme GS (2020) Aug- robot learning benchmark & learning environment. IEEE Robot
menting differentiable simulators with neural networks to close Autom Lett 5(2):3019–3026
the Sim2Real gap. arXiv preprint arXiv:2007.06045 76. Jing L, Tian Y (2020) Self-supervised visual feature learning with
57. Henaff O (2020) Data-efficient image recognition with contrastive deep neural networks: a survey. IEEE Trans Pattern Anal Mach
predictive coding. In: International conference on machine learn- Intell
ing. PMLR, pp 4182–4192 77. Julian R, Swanson B, Sukhatme GS, Levine S, Finn C, Hausman
58. Hernandez-Garcia JF, Sutton RS (2019) Understanding multi-step K (2020) Efficient adaptation for end-to-end vision-based robotic
deep reinforcement learning: a systematic study of the DQN tar- manipulation. arXiv preprint arXiv:2004.10190
get. arXiv preprint arXiv:1901.07510 78. Kahn G, Abbeel P, Levine S (2020) Badgr: an autonomous
59. Hessel M, Modayil J, van Hasselt H, Schaul T, Ostrovski G, self-supervised learning-based navigation system. arXiv preprint
Dabney W, Horgan D, Piot B, Azar MG, Silver D (2018) Rain- arXiv:2002.05700
bow: combining improvements in deep reinforcement learning. 79. Kalashnikov D, Irpan A, Pastor P, Ibarz J, Herzog A, Jang E,
In: AAAI conference on artificial intelligence Quillen D, Holly E, Kalakrishnan M, Vanhoucke V et al (2018)
60. Higuera JCG, Meger D, Dudek G (2017) Adapting learned Scalable deep reinforcement learning for vision-based robotic
robotics behaviours through policy adjustment. In: IEEE inter- manipulation. In: Conference on robot learning (CoRL), pp 651–
national conference on robotics and automation (ICRA), pp 673
5837–5843 80. Kapturowski S, Ostrovski G, Quan J, Munos R, Dabney W (2018)
61. Hirose N, Sadeghian A, Vázquez M, Goebel P, Savarese S (2018) Recurrent experience replay in distributed reinforcement learning.
Gonet: a semi-supervised deep learning approach for traversabil- In: International conference on learning representations (ICLR)
ity estimation. In: 2018 IEEE/RSJ international conference on 81. Karkus P, Hsu D, Lee WS (2017) Qmdp-net: deep learning for
intelligent robots and systems (IROS), pp 3044–3051 planning under partial observability. In: Advances in neural infor-
62. Hirose N, Sadeghian A, Vázquez M, Goebel P, Savarese S mation processing systems (NIPS), pp 4697–4707
(2018) Gonet: a semi-supervised deep learning approach for 82. Karkus P, Ma X, Hsu D, Kaelbling LP, Lee WS, Lozano-Pérez T
traversability estimation. In: IEEE/RSJ international conference (2019) Differentiable algorithm networks for composable robot
on intelligent robots and systems (IROS), pp 3044–3051 learning. arXiv preprint arXiv:1905.11602
63. Hirose N, Sadeghian A, Xia F, Martín-Martín R, Savarese S 83. Károly AI, Galambos P, Kuti J, Rudas IJ (2020) Deep learning in
(2019) VUNet: dynamic scene view synthesis for traversabil- robotics: survey on model structures and training strategies. IEEE
ity estimation using an RGB camera. IEEE Robot Autom Lett Trans Syst Man Cybern Syst
4(2):2062–2069 84. Kaufmann E, Loquercio A, Ranftl R, Müller M, Koltun V,
64. Hirose N, Xia F, Martín-Martín R, Sadeghian A, Savarese S (2019) Scaramuzza D (2020) Deep drone acrobatics. arXiv preprint
Deep visual MPC-policy learning for navigation. IEEE Robot arXiv:2006.05768
Autom Lett RA-L 4(4):3184–3191 85. Kaushik R, Desreumaux P, Mouret JB (2020) Adaptive prior
65. Ho SB (2017) Causal learning versus reinforcement learning for selection for repertoire-based online adaptation in robotics. Front
knowledge learning and problem solving. In: AAAI workshops Robot AI 6:151. https://doi.org/10.3389/frobt.2019.00151
66. Hoffman J, Tzeng E, Park T, Zhu JY, Isola P, Saenko K, Efros 86. Kirtas M, Tsampazis K, Passalis N, Tefas A (2020) Deep-
A, Darrell T (2018) Cycada: cycle-consistent adversarial domain bots: a webots-based deep reinforcement learning framework for
adaptation. In: International conference on machine learning robotics. In: IFIP international conference on artificial intelligence
(ICML), pp 1989–1998 applications and innovations. Springer, pp 64–75
67. Horgan D, Quan J, Budden D, Barth-Maron G, Hessel M, van Has- 87. Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in
selt H, Silver D (2018) Distributed prioritized experience replay. robotics: a survey. Int J Robot Res 32(11):1238–1274
In: International conference on learning representations (ICLR) 88. Kong J, Pfeiffer M, Schildbach G, Borrelli F (2015) Kinematic and
68. Ibarz B, Leike J, Pohlen T, Irving G, Legg S, Amodei D (2018) dynamic vehicle models for autonomous driving control design.
Reward learning from human preferences and demonstrations in In: IEEE Intelligent vehicles symposium (IV), pp 1094–1099
Atari. arXiv preprint arXiv:1811.06521 89. Kostrikov I, Yarats D, Fergus R (2020) Image augmentation is all
you need: regularizing deep reinforcement learning from pixels.
arXiv preprint arXiv:2004.13649
123
90. Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classifi- 110. Mao J, Gan C, Kohli P, Tenenbaum JB, Wu J (2019) The neuro-
cation with deep convolutional neural networks. Commun ACM symbolic concept learner: interpreting scenes, words, and sen-
60(6):84–90 tences from natural supervision. arXiv preprint arXiv:1904.12584
91. Kumar R, Mandalika A, Choudhury S, Srinivasa S (2019) Lego: 111. McCarty SL, Burke LM, McGuire M (2018) Parallel mono-
leveraging experience in roadmap generation for sampling-based tonic basin hopping for low thrust trajectory optimization. In:
planning. In: IEEE/RSJ international conference on intelligent AAS/AIAA space flight mechanics meeting, p 1452
robots and systems (IROS), pp 1488–1495 112. Mendoza M, Vasquez-Gomez JI, Taud H, Sucar LE, Reta C (2020)
92. Kuutti S, Bowden R, Jin Y, Barber P, Fallah S (2020) A survey of Supervised learning of the next-best-view for 3D object recon-
deep learning applications to autonomous vehicle control. IEEE struction. Pattern Recognit Lett 133:224–231
Trans Intell Transp Syst 113. Merkt WX, Ivan V, Dinev T, Havoutis I, Vijayakumar S (2021)
93. Lamb L, Garcez A, Gori M, Prates M, Avelar P, Vardi M (2020) Memory clustering using persistent homology for multimodality-
Graph neural networks meet neural-symbolic computing: A sur- and discontinuity-sensitive learning of optimal control warm-
vey and perspective. arXiv preprint arXiv:2003.00330 starts. IEEE Trans Robot. https://doi.org/10.1109/TRO.2021.
94. Lecarpentier E, Abel D, Asadi K, Jinnai Y, Rachelson E, Littman 3069132
ML (2020) Lipschitz lifelong reinforcement learning. arXiv 114. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T,
preprint arXiv:2001.05411 Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep
95. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature reinforcement learning. In: International conference on machine
521(7553):436–444 learning (ICML), pp 1928–1937
96. LeCun Y, Muller U, Ben J, Cosatto E, Flepp B (2006) Off-road 115. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare
obstacle avoidance through end-to-end learning. In: Advances in MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al
neural information processing systems (NIPS), pp 739–746 (2015) Human-level control through deep reinforcement learning.
97. Lee JH, Han MK, Ko DW, Suh IH (2019) From big to small: multi- Nature 518(7540):529–533
scale local planar guidance for monocular depth estimation. arXiv 116. Molchanov A, Chen T, Hönig W, Preiss JA, Ayanian N, Sukhatme
preprint arXiv:1907.10326 GS (2019) Sim-to-(multi)-real: transfer of low-level robust control
98. Lee K, Smith L, Abbeel P (2021) Pebble: feedback-efficient policies to multiple quadrotors. arXiv preprint arXiv:1903.04628
interactive reinforcement learning via relabeling experience and 117. Morgan AS, Bircher WG, Dollar AM (2021) Towards generalized
unsupervised pre-training. arXiv preprint arXiv:2106.05091 manipulation learning through grasp mechanics-based features
99. Lee MA, Zhu Y, Srinivasan K, Shah P, Savarese S, Fei-Fei L, and self-supervision. IEEE Trans Robot. https://doi.org/10.1109/
Garg A, Bohg J (2019) Making sense of vision and touch: self- TRO.2021.3057802
supervised learning of multimodal representations for contact- 118. Nagabandi A, Finn C, Levine S (2019) Deep online learning via
rich tasks. In: IEEE international conference on robotics and meta-learning: continual adaptation for model-based RL
automation (ICRA), pp 8943–8950 119. Nagabandi A, Konolige K, Levine S, Kumar V (2020) Deep
100. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver dynamics models for learning dexterous manipulation. In: Con-
D, Wierstra D (2016) Continuous control with deep reinforcement ference on robot learning (CoRL), pp 1101–1112
learning. In: International conference on learning representations 120. Nagami K, Schwager M (2021) Hjb-rl: Initializing reinforcement
(ICLR)—poster learning with optimal control policies applied to autonomous
101. Lippi M, Poklukar P, Welle MC, Varava A, Yin H, Marino A, drone racing. In: Robotics: science and systems, pp 1–9
Kragic D (2020) Latent space roadmap for visual action plan- 121. Nguyen TT, Silander T, Li Z, Leong TY (2017) Scalable trans-
ning of deformable and rigid object manipulation. In: IEEE/RSJ fer learning in heterogeneous, dynamic environments. Artif Intell
international conference on intelligent robots and systems (IROS) 247:70–94. https://doi.org/10.1016/j.artint.2015.09.013
102. Liu F, Shen C, Lin G, Reid I (2015) Learning depth from single 122. Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A,
monocular images using deep convolutional neural fields. IEEE Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a
Trans Pattern Anal Mach Intell 38(10):2024–2039 generative model for raw audio. arXiv preprint arXiv:1609.03499
103. Liu H, Abbeel P (2021) Behavior from the void: Unsupervised 123. Owens A, Efros AA (2018) Audio-visual scene analysis with
active pre-training self-supervised multisensory features. In: European conference
104. Liu K, Stadler M, Roy N (2020) Learned sampling distributions on computer vision (ECCV), pp 631–648
for efficient planning in hybrid geometric and object-level rep- 124. Pan X, Seita D, Gao Y, Canny J (2019) Risk averse robust adver-
resentations. In: IEEE international conference on robotics and sarial reinforcement learning. In: IEEE international conference
automation (ICRA), pp 9555–9562 on robotics and automation (ICRA), pp 8522–8528
105. Loquercio A, Maqueda AI, Del-Blanco CR, Scaramuzza D (2018) 125. Park D, Hoshi Y, Kemp CC (2018) A multimodal anomaly detec-
DroNet: learning to fly by driving. IEEE Robot Autom Lett tor for robot-assisted feeding using an LSTM-based variational
3(2):1088–1095 autoencoder. IEEE Robot Autom Lett RA-L 3(3):1544–1551
106. Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I (2017) 126. Pfeiffer M, Schaeuble M, Nieto J, Siegwart R, Cadena C (2017)
Multi-agent actor-critic for mixed cooperative-competitive envi- From perception to decision: a data-driven approach to end-to-
ronments. arXiv preprint arXiv:1706.02275 end motion planning for autonomous ground robots. In: IEEE
107. Luo W, Sun P, Zhong F, Liu W, Zhang T, Wang Y (2018) international conference on robotics and automation (ICRA), pp
End-to-end active object tracking via reinforcement learning. In: 1527–1533
International conference on machine learning, pp 3286–3295 127. Puterman ML (2014) Markov decision processes: discrete
108. Luo W, Sun P, Zhong F, Liu W, Zhang T, Wang Y (2019) End- stochastic dynamic programming. Wiley, New York
to-end active object tracking and its real-world deployment via 128. Qureshi AH, Miao Y, Simeonov A, Yip MC (2021) Motion plan-
reinforcement learning. IEEE Trans Pattern Anal Mach Intell ning networks: bridging the gap between learning-based and
42(6):1317–1332 classical motion planners. IEEE Trans Robot
109. Madumal P, Miller T, Sonenberg L, Vetere F (2020) Explainable 129. Qureshi AH, Simeonov A, Bency MJ, Yip MC (2019) Motion
reinforcement learning through a causal lens. In: Proceedings of planning networks. In: IEEE international conference on robotics
the AAAI conference on artificial intelligence, vol 34, pp 2493– and automation (ICRA), pp 2118–2124
2500
123
130. Qureshi AH, Yip MC (2018) Deeply informed neural sampling 151. Simonyan K, Zisserman A (2015) Very deep convolutional
for robot motion planning. In: IEEE/RSJ international conference networks for large-scale image recognition. In: International con-
on intelligent robots and systems (IROS), pp 6582–6588 ference on learning representations (ICLR)
131. Radwan N, Valada A, Burgard W (2018) Vlocnet++: deep multi- 152. Singh R, Zhang Q, Chen Y (2020) Improving robustness via
task learning for semantic visual localization and odometry. IEEE risk averse distributional reinforcement learning. In: Learning for
Robot Autom Lett 3(4):4407–4414 dynamics and control. PMLR, pp 958–968
132. Ranftl R, Koltun V (2018) Deep fundamental matrix estimation. 153. Singh S, Jaakkola T, Littman ML, Szepesvári C (2000) Con-
In: European conference on computer vision (ECCV), pp 284–299 vergence results for single-step on-policy reinforcement-learning
133. Reddy DSK, Saha A, Tamilselvam SG, Agrawal P, Dayama algorithms. Mach Learn 38(3):287–308
P (2019) Risk averse reinforcement learning for mixed multi- 154. Smolyanskiy N, Kamenev A, Smith J, Birchfield S (2017) Toward
agent environments. In: Proceedings of the 18th international low-flying autonomous mav trail navigation using deep neural net-
conference on autonomous agents and multiagent systems, pp works for environmental awareness. In: IEEE/RSJ international
2171–2173 conference on intelligent robots and systems (IROS), pp 4241–
134. Ribeiro EG, de Queiroz Mendes R, Grassi V (2021) Real-time 4247
deep learning approach to visual servo control and grasp detection 155. Sohn K, Lee H, Yan X (2015) Learning structured output repre-
for autonomous robotic manipulation. Robot Auton Syst. https:// sentation using deep conditional generative models. Adv Neural
doi.org/10.1016/j.robot.2021.103757 Inf Process Syst NIPS 28:3483–3491
135. Richardson M, Domingos P (2006) Markov logic networks. Mach 156. Srinivas A, Laskin M, Abbeel P (2020) Curl: contrastive unsuper-
Learn 62(1–2):107–136 vised representations for reinforcement learning. arXiv preprint
136. Riegel R, Gray A, Luus F, Khan N, Makondo N, Akhalwaya IY, arXiv:2004.04136
Qian H, Fagin R, Barahona F, Sharma U et al (2020) Logical 157. Sukhbaatar S, Lin Z, Kostrikov I, Synnaeve G, Szlam A, Fer-
neural networks. arXiv preprint arXiv:2006.13155 gus R (2018) Intrinsic motivation and automatic curricula via
137. Rifai S, Vincent P, Muller X, Glorot X, Bengio Y (2011) Contrac- asymmetric self-play. In: International conference on learning
tive auto-encoders: explicit invariance during feature extraction. representations (ICLR)
In: International conference on machine learning (ICML) 158. Sun T, Gong L, Li X, Xie S, Chen Z, Hu Q, Filliat D (2021)
138. Ross S, Gordon G, Bagnell D (2011) A reduction of imitation Robotdrlsim: a real time robot simulation platform for reinforce-
learning and structured prediction to no-regret online learning. ment learning and human interactive demonstration learning. J
In: International conference on artificial intelligence and statistics Phys Conf Ser 1746:012035
(AISTATS), pp 627–635 159. Sun Z, Li F, Duan X, Jin L, Lian Y, Liu S, Liu K (2021) Deep
139. Rubinstein RY, Kroese DP (2013) The cross-entropy method: reinforcement learning for quadrotor path following with adap-
a unified approach to combinatorial optimization, Monte-Carlo tive velocity. Auton Robots 45:119–134. https://doi.org/10.1016/
simulation and machine learning. Springer, New York j.robot.2021.103757
140. Ruder M, Dosovitskiy A, Brox T (2018) Artistic style transfer 160. Sun Z, Li F, Duan X, Jin L, Lian Y, Liu S, Liu K (2021) A
for videos and spherical images. Int J Comput Vis 126(11):1199– novel adaptive iterative learning control approach and human-
1219 in-the-loop control pattern for lower limb rehabilitation robot in
141. Rudin N, Kolvenbach H, Tsounis V, Hutter M (2021) Cat-like disturbances environment. Auton Robots 45:595–610. https://doi.
jumping and landing of legged robots in low gravity using deep org/10.1016/j.robot.2021.103757
reinforcement learning. IEEE Trans Robot. https://doi.org/10. 161. Sünderhauf N, Brock O, Scheirer W, Hadsell R, Fox D, Leitner
1109/TRO.2021.3084374 J, Upcroft B, Abbeel P, Burgard W, Milford M et al (2018) The
142. Schaul T, Horgan D, Gregor K, Silver D (2015) Universal value limits and potentials of deep learning for robotics. Int J Robot Res
function approximators. In: International conference on machine 37(4–5):405–420
learning (ICML), pp 1312–1320 162. Sutton RS, Barto AG (2018) Reinforcement learning: an intro-
143. Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized expe- duction. MIT Press, Cambridge
rience replay. arXiv preprint arXiv:1511.05952 163. Tai L, Paolo G, Liu M (2017) Virtual-to-real deep reinforcement
144. Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) learning: continuous control of mobile robots for mapless naviga-
Trust region policy optimization. In: International conference on tion. In: IEEE/RSJ international conference on intelligent robots
machine learning (ICML), pp 1889–1897 and systems (IROS), pp 31–36
145. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O 164. Tang G, Hauser K (2019) Discontinuity-sensitive optimal control
(2017) Proximal policy optimization algorithms. arXiv preprint learning by mixture of experts. In: 2019 international conference
arXiv:1707.06347 on robotics and automation (ICRA). IEEE, pp 7892–7898
146. Schwarzer M, Anand A, Goel R, Hjelm RD, Courville A, 165. Tenorio-González AC, Morales EF (2018) Automatic discovery
Bachman P (2020) Data-efficient reinforcement learning with of concepts and actions. Expert Syst Appl 92:192–205
self-predictive representations. arXiv preprint arXiv:2007.05929 166. Tenorio-Gonzalez AC, Morales EF, Villasenor-Pineda L (2010)
147. Seo Y, Chen L, Shin J, Lee H, Abbeel P, Lee K (2021) State Dynamic reward shaping: training a robot by voice. In: Ibero-
entropy maximization with random encoders for efficient explo- American conference on artificial intelligence. Springer, pp 483–
ration. arXiv preprint arXiv:2102.09430 492
148. Serafini L, Garcez Ad (2016) Logic tensor networks: deep learn- 167. Terasawa R, Ariki Y, Narihira T, Tsuboi T, Nagasaka K (2020)
ing and logical reasoning from data and knowledge. arXiv preprint 3D-CNN based heuristic guided task-space planner for faster
arXiv:1606.04422 motion planning. In: IEEE international conference on robotics
149. Shi W, Song S, Wu C (2019) Soft policy gradient method for and automation (ICRA), pp 9548–9554
maximum entropy deep reinforcement learning. In: International 168. Tesauro G (1992) Practical issues in temporal difference learning.
joint conference on artificial intelligence (IJCAI), pp 3425–3431 In: Advances in neural information processing systems (NIPS),
150. Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M pp 259–266
(2014) Deterministic policy gradient algorithms. In: International 169. Thrun S, Schwartz A (1993) Issues in using function approxima-
conference on machine learning (ICML) tion for reinforcement learning. In: Connectionist models summer
school
123
170. To T, Tremblay J, McKay D, Yamaguchi Y, Leung K, Balanon A, 185. Wu C, Zeng R, Pan J, Wang CC, Liu YJ (2019) Plant phenotyp-
Cheng J, Birchfield S (2018) Ndds: Nvidia deep learning dataset ing by deep-learning-based planner for multi-robots. IEEE Robot
synthesizer. https://github.com/NVIDIA/Dataset_Synthesizer Autom Lett 4(4):3113–3120
171. Tobin J, Fong R, Ray A, Schneider J, Zaremba W, Abbeel P (2017) 186. Wu X, Sahoo D, Hoi SC (2020) Recent advances in deep learning
Domain randomization for transferring deep neural networks from for object detection. Neurocomputing 396:39–64
simulation to the real world. In: IEEE/RSJ international confer- 187. Xiang Y, Schmidt T, Narayanan V, Fox D (2017) PoseCNN: a
ence on intelligent robots and systems (IROS), pp 23–30 convolutional neural network for 6D object pose estimation in
172. Todorov E, Erez T, Tassa Y (2012) Mujoco: a physics engine for cluttered scenes. arXiv preprint arXiv:1711.00199
model-based control. In: 2012 IEEE/RSJ international conference 188. Xu H, Gao Y, Yu F, Darrell T (2017) End-to-end learning of driving
on intelligent robots and systems. IEEE, pp 5026–5033 models from large-scale video datasets. In: IEEE conference on
173. Tremblay J, To T, Birchfield S (2018) Falling things: a synthetic computer vision and pattern recognition (CVPR), pp 2174–2182
dataset for 3D object detection and pose estimation. In: IEEE 189. Yang C, Liu Y, Zell A (2021) Relative camera pose estimation
conference on computer vision and pattern recognition (CVPR) using synthetic data with domain adaptation via cycle-consistent
workshops, pp 2038–2041 adversarial networks. J Intell Robot Syst. https://doi.org/10.1007/
174. Tremblay J, To T, Sundaralingam B, Xiang Y, Fox D, Birchfield S s10846-021-01439-6
(2018) Deep object pose estimation for semantic robotic grasping 190. Yarats D, Fergus R, Lazaric A, Pinto L (2021) Reinforcement
of household objects. arXiv preprint arXiv:1809.10790 learning with prototypical representations
175. Ugurlu H, Kalkan S, Saranli A (2021) Reinforcement learning 191. Zhang C, Huh J, Lee DD (2018) Learning implicit sampling
versus conventional control for controlling a planar bi-rotor plat- distributions for motion planning. In: IEEE/RSJ international con-
form with tail appendage. J Intell Robot Syst. https://doi.org/10. ference on intelligent robots and systems (IROS), pp 3654–3661
1007/s10846-021-01412-3 192. Zhang J, Cheung B, Finn C, Levine S, Jayaraman D (2020)
176. Van Hasselt H, Guez A, Silver D (2015) Deep reinforcement learn- Cautious adaptation for reinforcement learning in safety-critical
ing with double q-learning. arXiv preprint arXiv:1509.06461 settings. In: International conference on machine learning. PMLR,
177. Vasquez-Gomez JI, Troncoso D, Becerra I, Sucar E, Murrieta- pp 11055–11065
Cid R (2021) Next-best-view regression using a 3D convolutional 193. Zhang J, Tai L, Yun P, Xiong Y, Liu M, Boedecker J, Burgard W
neural network. Mach Vis Appl 32(42):1–14. https://doi.org/10. (2019) VR-goggles for robots: real-to-sim domain adaptation for
1007/s00138-020-01166-2 visual control. IEEE Robot Autom Lett RA-L 4(2):1148–1155
178. Wang H, Yeung DY (2020) A survey on Bayesian deep learning. 194. Zhang S, Liu B, Whiteson S (2020) Per-step reward: a new per-
ACM Comput Surv CSUR 53(5):1–37 spective for risk-averse reinforcement learning. arXiv preprint
179. Wang Z, Chen C, Li HX, Dong D, Tarn TJ (2019) Incremen- arXiv:2004.10888
tal reinforcement learning with prioritized sweeping for dynamic 195. Zhou T, Tulsiani S, Sun W, Malik J, Efros AA (2016) View syn-
environments. IEEE/ASME Trans Mechatron 24(2):621–632 thesis by appearance flow. In: European conference on computer
180. Wang Z, Reed Garrett C, Pack Kaelbling L, Lozano-Pérez T vision (ECCV), pp 286–301
(2021) Learning compositional models of robot skills for task 196. Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-
and motion planning. Int J Robot Res 40(6–7):866–894. https:// image translation using cycle-consistent adversarial networks. In:
doi.org/10.1177/02783649211004615 IEEE international conference on computer vision (ICCV), pp
181. Wang Z, Schaul T, Hessel M, Hasselt H, Lanctot M, Freitas 2223–2232
N (2016) Dueling network architectures for deep reinforce- 197. Zhu Z, Lin K, Zhou J (2020) Transfer learning in deep reinforce-
ment learning. In: International conference on machine learning ment learning: a survey. arXiv preprint arXiv:2009.07888
(ICML), pp 1995–2003 198. Zhu Z, Lin K, Zhou J (2020) Transfer learning in deep reinforce-
182. Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8(3–4):279– ment learning: a survey. arXiv preprint arXiv:2009.07888
292
183. Wellhausen L, Dosovitskiy A, Ranftl R, Walas K, Cadena C, Hut-
ter M (2019) Where should i walk? Predicting terrain properties
Publisher’s Note Springer Nature remains neutral with regard to juris-
from images via self-supervised learning. IEEE Robot Autom Lett
dictional claims in published maps and institutional affiliations.
4(2):1509–1516
184. Williams RJ (1992) Simple statistical gradient-following algo-
rithms for connectionist reinforcement learning. Mach Learn
8(3–4):229–256
123

A Survey On Deep Learning and Deep Reinforcement Learning in Robotics With A Tutorial On Deep Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Survey On Deep Learning and Deep Reinforcement Learning in Robotics With A Tutorial On Deep Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Intelligent Service Robotics (2021) 14:773–805

A survey on deep learning and deep reinforcement learning in robotics

1 Introduction type of approach is classic. An example is the Newton laws.

plex sequential decision problems commonly encountered in

An important area of robotics can be framed as a sequen-

– S is a finite set of states (si ∈ S, i = {1, . . . , n}),

Algorithm 2 A one-step TD function approximation algo- DRL

To avoid this, DDQN decomposes the updating step of

where Pθ (τ ) denotes the probability over trajectories when

(20) A natural candidate for b(s) is the estimated value function

which tells us the amount of improvement over the average

[106] (where multiple agents are coordinated to complete are:

3.3.2 Distributed approaches a ∗ = argmaxa Q θ (st+n , a) (58)

DRL algorithms take a long time to converge, so several and

Contrary to Ape-X, it uses a mixture of max and mean

ple and furniture) and social protocols for navigation. The k

a visual dynamic model, and a network that estimates J trav = (1 − p̂t+i

entiable; they are inspired by the works in [61,63].

The absolute pixel difference between current and sub-goal

π ∗ = arg max Eτ ∼π [R(τ )] . (84)

Table 1 High-level features

Reference [3,11,92,96,105, [14,47,56,63,64, [38,43,54,61,70, [22,26,69,71,91,104,128–130,167,191]

Reference [16,27,28,42,49, [13,21,31,75,79, [2,12,21,30,86, [39,60,77,85, [84,124,133,152,

Reference [2,3,11,13,25, [3,15,131,189] [13,17,21,38,48, [31,106,116,133, [34,107,108] [22,26,28,42,43,

Table 3 Type of systems

Reference [13,21,38,48,77, [54,62,63,96, [51,84,105,113, [2,3,11,13,29,37, [141,160,183]

Value function DQN [15], QT-Opt [77], [79],

a value function tend to use DQN or one of its variants,

You might also like