Professional Documents
Culture Documents
paper, we study how an agent can navigate the instructions is one of the most crucial skills
long paths when learning from a corpus that to acquire by VLN agents. Jain et al. (2019)
consists of shorter ones. We show that existing shows that the VLN agents trained on the orig-
state-of-the-art agents do not generalize well. inally proposed dataset ROOM 2ROOM (i.e. R 2 R
To this end, we propose BabyWalk, a new
thereafter) do not follow the instructions, despite
VLN agent that is learned to navigate by de-
composing long instructions into shorter ones having achieved high success rates of reaching the
(BabySteps) and completing them sequentially. navigation goals. They proposed two remedies: a
A special design memory buffer is used by new dataset ROOM 4ROOM (or R 4 R) that doubles
the agent to turn its past experiences into con- the path lengths in the R 2 R, and a new evaluation
texts for future steps. The learning process is metric Coverage weighted by Length Score (CLS)
composed of two phases. In the first phase, that measures more closely whether the ground-
the agent uses imitation learning from demon-
truth paths are followed. They showed optimizing
stration to accomplish BabySteps. In the sec-
ond phase, the agent uses curriculum-based the fidelity of following instructions leads to agents
reinforcement learning to maximize rewards with desirable behavior. Moreover, the long lengths
on navigation tasks with increasingly longer in R 4 R are informative in identifying agents who
instructions. We create two new benchmark score higher in such fidelity measure.
datasets (of long navigation tasks) and use In this paper, we investigate another crucial as-
them in conjunction with existing ones to ex- pect of following the instructions: can a VLN agent
amine BabyWalk’s generalization ability. Em-
generalize to following longer instructions by learn-
pirical results show that BabyWalk achieves
state-of-the-art results on several metrics, in ing from shorter ones? This aspect has important
particular, is able to follow long instructions implication to real-world applications as collect-
better. The codes and the datasets are released ing annotated long sequences of instructions and
on our project page https://github.com/ training on them can be costly. Thus, it is highly de-
Sha-Lab/babywalk. sirable to have this generalization ability. After all,
it seems that humans can achieve this effortlessly1 .
1 Introduction
To this end, we have created several datasets
Autonomous agents such as household robots need of longer navigation tasks, inspired by R 4 R (Jain
to interact with the physical world in multiple et al., 2019). We trained VLN agents on R 4 R and
modalities. As an example, in vision-and-language use the agents to navigate in ROOM 6ROOM (i.e.,
navigation (VLN) (Anderson et al., 2018), the R 6 R) and ROOM 8ROOM (i.e., R 8 R). We contrast to
agent moves around in a photo-realistic simulated the performance of the agents which are trained on
environment (Chang et al., 2017) by following a those datasets directly (“in-domain”). The results
sequence of natural language instructions. To in-
1
fer its whereabouts so as to decide its moves, the Anecdotally, we do not have to learn from long navigation
experiences. Instead, we extrapolate from our experiences of
∗
Author contributed equally learning to navigate in shorter distances or smaller spaces
†
On leave from University of Southern California (perhaps a skill we learn when we were babies or kids).
the pairs of a BABY-S TEP and the corresponding
, Q G R P D L Q 5 &