You are on page 1of 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/375375521

LfDT: Learning Dual-Arm Manipulation from Demonstration Translated from a


Human and Robotic Arm

Conference Paper · December 2023

CITATIONS READS

0 21

4 authors, including:

Masato Kobayashi Masashi Hamaya


Osaka University OMRON SINIC X Corporation, Japan, Tokyo
16 PUBLICATIONS 61 CITATIONS 36 PUBLICATIONS 358 CITATIONS

SEE PROFILE SEE PROFILE

Kazutoshi Tanaka
OMRON SINIC X Corporation
55 PUBLICATIONS 251 CITATIONS

SEE PROFILE

All content following this page was uploaded by Kazutoshi Tanaka on 06 November 2023.

The user has requested enhancement of the downloaded file.


© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any
current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating
new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in
other works.

Published article:
M. Kobayashi, J. Yamada, M. Hamaya, K. Tanaka, “LfDT: Learning Dual-Arm Manipulation from Demonstration Translated
from a Human and Robotic Arm,” The 2023 IEEE-RAS International Conference on Humanoid Robots (Humanoids 2023),
2023.
LfDT: Learning Dual-Arm Manipulation from Demonstration
Translated from a Human and Robotic Arm
Masato Kobayashi∗1 , Jun Yamada∗2 , Masashi Hamaya3 , Kazutoshi Tanaka3

Abstract— Imitation learning (IL) is a promising method Robot1 Human Human Robot2
for programming dual-arm manipulation easily by imitating
demonstrations from human experts. However, IL for dual-
arm manipulation is still challenging because operating two
robotic arms to collect demonstrations requires considerable
effort. Therefore, we present a novel IL framework for dual-arm
manipulation: learning dual-arm manipulation from demon-
stration translated from a human and robotic arm (LfDT). Human-Robot2
Robot1-Human
LfDT collects demonstrations of one human and one robotic demonstration
demonstration
arm. Thus, a human expert can easily and precisely adjust its
arm movements according to the movement of the robotic arm.
Domain translation framework
LfDT collects demonstrations of one human and one robotic
arm, whereas IL methods typically demand demonstrations
of two robotic arms. Therefore, LfDT employs a domain- Robot1-Robot2 Robot1-Robot2
translation network to convert the demonstrations of one demonstration demonstration
human and one robotic arm into demonstrations of two robotic
arms, which are then used to learn dual-arm manipulation via Imitation learning
IL. The experiments demonstrate that LfDT successfully con-
verts the demonstrations and learns the dual-arm manipulation Robot1-Robot2 policy
in both simulation and real-world. Robot1 Robot2
I. I NTRODUCTION
Many real-world tasks such as folding laundry [1], kitchen
support [2], and industrial assembly [3] require dual-arm
manipulation. Dual-arm manipulation often requires two
robotic arms to precisely adjust their movements relative to
each other for cooperation. For example, as demonstrated in
Fig. 1, for a 2D-pushing task, one arm must slide its end- Fig. 1. Overview of LfDT. LfDT collects demonstrations of dual-
effector synchronously with the other arm to cooperatively arm manipulations performed by one human arm of an expert and one
move the object to a target position. If one of the arms robotic arm operated by the other expert. Unlike other methods that collect
demonstrations of two robotic arms operated together by one expert, LfDT
slides its end-effector too quickly or slowly, it is likely that can easily collect the demonstrations because the human expert can easily
the object will rotate in the same position, and the arms and precisely adjust its arm movements according to the movement of the
will fail to complete the task. While dual-arm manipulation robotic arm to cooperate with these arms.
is essential to broaden the scope of robotic applications in
real-world, manually programming dual-arm manipulation is
often time-consuming due to the required precise cooperation requires precise coordination of two robotic arm using
of the arms. passive observation. Kinesthetic teaching also collects the
Imitation learning (IL) is a core research topic for learn- demonstrations by physical guidance of an expert for a
ing robotic manipulation from demonstrations collected by robotic arm [7], [8], [9], [10]. However, IL for dual-arm
human experts [4], [5] instead of manually programming the manipulations is still challenging because operating two
manipulation. Passive observation collects such demonstra- robotic arms to collect demonstrations requires considerable
tions by observing demonstrations of a human expert [6]. effort. Teleoperation allows an expert to operate two robotic
However, it is difficult to learn dual-arm manipulation which arms to collect demonstrations of dual-arm manipulations
∗ Equal contribution. intuitively; however, it requires a specific device [4].
1 Osaka University kobayashi.masato.cmc@osaka-u.ac. Operating multiple robotic arms together by a human ex-
jp, work done during internship at OMRON SINIC X Corporation pert to collect demonstrations requires considerable cognitive
2 University of Oxford jyamada@robots.ox.ac.uk, work done
during internship at OMRON SINIC X Corporation effort. Therefore, Tung et al. presented an IL framework, in
3 OMRON SINIC X Corporation, Hongo 5-24-5, Bunkyo-ku, Tokyo,
which multiple human experts operated multiple robotic arms
Japan {masashi.hamaya, kazutoshi.tanaka}@sinicx to collect demonstrations, and each expert easily operated
.com
† This work was supported by JST, PRESTO Grant Number JP- one robotic arm using a smartphone as a six degrees-of-
MJPR22C6, Japan. freedom (DoF) motion controller [11]. However, it is difficult
to precisely adjust the movements of a robotic arm according a reward function from the demonstrations. In contrast to
to the movement of the other arms using a smartphone. IRL, BC does not require any additional trials for training
For these reasons, we aim to present a novel IL framework, the manipulation. Therefore, in this study, LfDT uses BC,
learning dual-arm manipulation from demonstration trans- and it is potentially applicable to all IL methods.
lated from a human and robotic arm (LfDT) to easily collect Demonstrations in IL are often collected by teleoperation
demonstrations of dual-arm manipulation. This framework using devices such as virtual reality headsets and hand-
collects demonstrations of one human and one robotic arm as tracking systems [4], smartphones [11], 3D motion con-
illustrated in Fig. 1. A human can intuitively and easily adjust trollers [21], and keyboards [22]. Prior studies have mostly
the movement of its arm to cooperate with the other entity, tackled single-arm manipulation, and only a few studies
even a robotic arm, for manipulation tasks. Owing to the have applied IL to dual-arm manipulation because operating
adjustment of a human arm, the other expert does not have two robotic arms to collect the demonstrations requires
to operate the robotic arm precisely. Thus, leveraging the considerable effort.
human arm enables these experts to collect demonstrations Laghi et al. introduced a system that can operate two
of dual-arm manipulation easily. robotic arms using one arm of a human expert [23]. Some
Although IL methods, such as behavior cloning (BC), studies have introduced bimanual robot teaching systems
typically demand demonstrations of all robotic arms, LfDT using inertial measurement unit (IMU) and electromyogra-
collects demonstrations of one human and one robotic arm. phy (EMG) signals [23] as well as exoskeletons [24]. These
Generally, demonstrations by a human arm cannot be repli- frameworks require a specific device to operate two robotic
cated directly by a robotic arm due to the physical differences arms to collect demonstrations therefrom.
between the human and robotic arm, such as range of motion, Tung et al. built a system for collecting demonstrations of
joint velocity, and joint torque. Therefore, LfDT employs a multi-arm manipulation by multiple human experts [11]. In
domain-translation framework to convert demonstrations of their system, each human expert collected demonstrations by
one human and one robotic arm into demonstrations of two operating one robotic arm using a smartphone. In the system,
robotic arms and then learns the dual-arm manipulation by it is difficult for an expert to precisely adjust the movement of
imitating the converted demonstrations with BC. a robotic arm using a smartphone, whereas the other expert
The contributions of our work are three-fold: (1) We using LfDT can easily and precisely adjust its arm movement
propose LfDT, an IL framework for learning dual-arm ma- according to the movements of the other robotic arm.
nipulation from demonstrations translated from a human
and robotic arm; (2) We introduce a domain-translation C. Domain transfer
framework that converts the demonstrations of one human Domain transfer in RL and IL has gained attention for their
and one robotic arm into demonstrations of two robotic generalization across tasks [25]. Prior studies have learned
arms; (3) We empirically demonstrate that LfDT successfully state-translation mapping by defining a common state space
learns both simulated and real-world dual-arm manipulation. between two tasks, and have employed unsupervised man-
ifold alignments using hand-crafted features [26]. Several
II. R ELATED WORKS prior studies have suggested leveraging a set of proxy tasks
A. Learning dual-arm manipulation to learn the state correspondence or domain-invariant space
Dual-arm manipulation requires two robotic arms to co- for mismatches in morphology [27] and viewpoint [28],
operate by precisely adjusting their movements relative to [29]. Moreover, a recent study has proposed learning-state
each other. Therefore, compared to single-arm manipulation, correspondence from demonstrations of proxy tasks without
dual-arm manipulation requires more advanced planning and access to expert actions [30]. Although domain transfer has
control approaches and necessitates substantial engineering been used for generalization across tasks, we use domain
efforts [3]. To mitigate this issue, reinforcement learning transfer to convert demonstrations of one human and one
(RL) has been applied to robots for learning dual-arm robotic arm into demonstrations of two robotic arms to learn
manipulation [12], [13], [14]. However, RL for dual-arm dual-arm manipulation.
manipulation requires a large number of samples collected Cycle-consistency is an essential concept in domain-
via interaction with an environment, limiting its applications translation networks [31], [32] to learn bidirectional trans-
in the real-world. Consequently, we employ IL for dual-arm lation between different domains. CycleGAN [31] intro-
manipulation, which is applicable to robotic tasks in the real- duced cycle-consistency, which enables bidirectional un-
world. paired image-to-image translation with generative adversarial
networks [33]. Subsequently, cycle-consistency was extended
B. Imitation learning for video translation [34] and domain adaptation [35]. Dy-
IL enables a robotic arm to learn manipulation by imitating namic cycles have proposed cycle-consistency for dynam-
the behavior of human experts in demonstrations [15], [16]. ics [32], which have used a forward dynamics model to
One mainstream of IL is BC [4], [17], which uses supervised convert states and actions from one domain into those in
learning to learn a function that maps an observation to an the other domain. LfDT also uses the cycle-consistency for
action from demonstrations of experts. Another mainstream dynamics to convert demonstrations of one human and one
of IL is inverse RL (IRL) [18], [19], [20], which estimates robotic arm into demonstrations of two robotic arms.
(1) Robot1-Human Human-Robot2 of the robotic arm xRi (i = 1, 2), the state of an object
demonstration demonstration
{(xR1, xH, xO, aR1)} {(xH, xR2, xO, aR2)} xO , and the action of the robotic arm aRi . The generalized
coordinate of a robot, joint angles, the pose of the end
(2) effector, and visual image can be used for the states, and
Domain translation framework
motor commands and the target pose of the end effector can
Human-Robot (domain X) Robot-Robot (domain Y)
G
be used as the action. Y contains the states of two robotic
xt ŷt yt arms and that of the object, y R1 , y R2 , y O , and actions of
x t⊂
Ri
ŷRit⊂ D

Lid: Identity mapping loss


Ladv: Adversarial loss the arms aR1 , aR2 . We define the states in X and Y as
F-1 x = (xH , xRi , xO ) and y = (y R1 , y R2 , y O ), and the action
ãt
F in Y as a = (aR1 , aR2 ). We define MDPs in X and Y
G : Generator
ỹt+1 as Mx = (Sx , Ax , Px , ρ0x ) and My = (Sy , Ay , Py , ρ0y ),
D : Discriminator
Ldyn: Dynamics cycle loss respectively. Our goal is to learn a policy π : Sy 7−→ Ay for
F: Forward dynamics
xt+1 G ŷt+1 dual-arm manipulation. Actually, LfDT learns π using BC
F-1: Inverse dynamics
minimizing E[||a − π(y)||2 ].

Robot1-Robot2 Robot1-Robot2
demonstration demonstration B. Data collection
{(yR1, yR2, yO, aR1, aR2)} {(yR1, yR2, yO, aR1, aR2)}
LfDT collects demonstrations of cooperation between the
(3)
human arm and Robot1 as well as between the human arm
Imitation learning and Robot2, as illustrated in Fig. 1. Collecting both demon-
(yR1, yR2, yO) strations of a human as Robot1 and Robot2 helps to consider
Robot1-Robot2 policy π the different roles of these robots in the manipulation. As
(aR1, aR2)
mentioned before, a human expert can intuitively manipulate
Fig. 2. Framework overview. Illustration of the overall LfDT procedure. an object by hand and precisely adjust its movement in
(1) Demonstration: LfDT collects demonstrations of one human and one response to the movement of the robotic arm. Thus, we can
robotic arm. A human expert first operates Robot1 to manipulate an collect demonstrations more easily compared to operating
object and then operates Robot2 to manipulate this object. (2) Translation:
Subsequently, LfDT trains models of the domain-translation framework to two robotic arms simultaneously, which would require an
convert the collected demonstrations of the human and robotic arm into operating expert considerable cognitive effort.
demonstrations of two robotic arms. (3) Imitation: Finally, LfDT learns a LfDT also collects random motion data in the domains of
control policy with BC, using the converted demonstrations.
X and Y to train a generator G, a discriminator D, a forward
dynamics model F , and an inverse dynamics model F −1 as
III. A PPROACH described in Sec. III-C. Random action commands are sent
to move the robotic arms and the resulting movement data
In this section, we describe our approach, LfDT in detail. is recorded.
Fig. 2 shows an overview of our approach. While IL methods
generally require demonstrations of all robotic arms for
C. Domain-translation framework
dual-arm manipulation, LfDT collects demonstrations of one
human and one robotic arm. Then, given such demon- LfDT does not collect any demonstrations of two robotic
strations, LfDT employs a domain-translation framework arms in the domain Y directly. That is, there is no paired
to translate demonstrations of one human and one robotic data of the state of two robotic arms y and the state of a
arm into demonstrations of two robotic arms to learn dual- human and robotic arm x. Thus, we cannot apply supervised
arm manipulation from the converted demonstrations using learning to learn a function that maps x to y. Therefore,
BC. This section consists of the problem formulation, data LfDT converts demonstrations of the human and robotic arm
collection, and domain-translation framework. in X into demonstrations of two robotic arms in Y using
a domain-translation framework learned from unpaired data
A. Problem Formulation in domains X and Y, as illustrated in Fig. 2. Then, we
We define the problem as a Markov decision process employ IL, specifically BC, to learn a policy π for dual-
(MDP) formulated as a tuple (S, A, P, ρ0 ) consisting of arm manipulation from the converted demonstrations of the
states s ∈ S, actions a ∈ A, transition function P (s′ ∈ two robotic arms.
S|s, a), and initial state distribution ρ0 . 1) Domain-translation: To convert demonstrations of one
As mentioned above, LfDT translates demonstrations of human and one robotic arm into demonstrations of two
one human and one robotic arm into demonstrations of robotic arms, we train a state translation function that maps
two robotic arms. We define two domains to contain raw states xt ∈ X to states yt ∈ Y. Through adversarial train-
and translated states: domain X for demonstrations of the ing [33], a generator G learns to map x onto the distribution
human and robotic arm; and the target domain, domain Y of y as ŷ to deceive a discriminator D. Meanwhile, D is
for deployment of dual-arm manipulation by two robotic trained to distinguish between generated ŷt and real yt input
arms. X contains the state of the human arm xH , the state samples:
Demonstration Deployment
(a) 2D dual-arm box push
min max Ladv (G, D) = Ey∼p(y) [log D(y)]+ Robot1 Human Human Robot2 Robot1 Robot2
G D (1)
Ex∼p(x) [log(1 − D(G(x)))],
where p(∗) is a distribution over data ∗. Goal
In addition to demonstrations in the domain X and random
motion data in Y, random motion data in X involving (b) Dual-arm block push
Robot1 Human Human Robot2 Robot1 Robot2
states of one human and one robotic arm are used to
avoid overfitting G. Since we collect demonstrations and
random motion data of Robot1-Human and Human-Robot2
respectively, we concatenate the states xt ∈ X with a one-hot Goal
vector representing either Robot1-Human or Human-Robot2
(c) Dual-arm peg insertion
to construct an input to G. Robot2 Robot2
Robot1 Human Robot1
2) Dynamics-consistency: If LfDT employs the domain- Human
translation without any constraints, the converted state ŷ may
not be consistent with the state transition PY . Therefore, to
make ŷt and ŷt+1 consistent with PY , we apply dynamics-
consistency loss inspired by a previous study [32]. Fig. 3. Demonstrations of the human and robotic arm and dual-
Specifically, if translated states ŷt (= G(xt )) and ŷt+1 (= arm manipulation. The left and middle columns show demonstrations of
the human and robotic arm. (a) two-dimensional (2D) dual-arm box push:
G(xt+1 )) are consistent, then, ideally, the state ŷt+1 would Robot1 and Robot2 have three links, and the green box needs to be pushed
be the same as the ỹt+1 predicted by the forward dynamics by the two arms towards a black rectangle goal position. We collected
model using the input ŷt and ãt such that ỹt+1 = F (ŷt , ãt ), demonstrations of either Robot1 or Robot2 and a robotic arm with four
links to represent physical differences between the human and robotic arm.
where ãt is an action predicted by the inverse dynamics (b) Dual-arm block push: Two robotic arms need to push the elongated
model given ŷt and ŷt+1 such that ãt = F −1 (ŷt , ŷt+1 ). block onto the green line. (c) Dual-arm peg insertion: The peg attached to
Thus, to train G with consistent dynamics, LfDT minimizes Robot2 must be inserted into the hole attached to Robot1.
the objective:
Ldyn =E[||ỹt+1 − ŷt+1 ||2 ] IV. E XPERIMENTS
=E[||F (ŷt , ãt ) − ŷt+1 ||2 ] (2) In this section, we evaluate LfDT to verify whether LfDT
−1 successfully converts demonstrations of one human and
=E[||F (ŷt , F (ŷt , ŷt+1 )) − ŷt+1 ||2 ].
one robotic arm into demonstrations of two robotic arms
The models F and F −1 are trained using the following and learns dual-arm manipulation by employing IL on the
objectives: converted demonstration in both simulated and real-world
experiments.
min Lfwd (F ) = E[||yt+1 − F (yt , at )||2 ], (3)
F A. Simulated experiment setup
−1 −1 We first evaluated LfDT for “2D dual-arm box push”
min
−1
Linv (F ) = E[||at − F (yt , yt+1 )||2 ]. (4)
F
simulated using MuJoCo physics engine [36] as shown in
3) Partial identity mapping: In addition to the dynamics- Fig. 3(a). In this task, Robot1 and Robot2 need to cooper-
consistency constraint, we also introduce a partial identity atively push a box to a goal while avoiding the rotation of
mapping that partially constraints the generator. In particular, the box. Robot1 and Robot2 consist of three links in the
since the state xt in the domain X includes the states of both domain Y, whereas one of the robots that demonstrate the
the human and robotic arm, the converted state of Robot{i}, task consists of four links to represent physical differences
ŷtRi (⊂ ŷt ) must have values closed to the raw state of the between the human and robotic arm. The states xRi and
same arm xRi t (⊂ xt ) in demonstrations. Thus, we add the y Ri include the joint angles, joint angular velocity, and end
following objective function to constrain G: position of the robots, and xO and y O contain the position
and orientation of the box. The robots are controlled by
min Lid (G) = E[||ŷ Ri − xRi ||2 ]. (5) applying the joint torque of the robots defined as aRi .
G
In this experiment, we added random noise to the ini-
Note that ŷ Ri and xRi have the same state types of Robot{i}. tial joint angles of both arms and the initial position and
4) Full objective: In summary, our full objective is as orientation of the box. The task is considered successful
follows: when the distance between the center of the box and the
Lfull = λadv Ladv (G, D) + λdyn Ldyn (G) + λid Lid (G), (6) goal position and the rotation of the box are smaller than
their thresholds. The demonstrations were collected from
where λadv , λdyn , and λid are coefficients for balancing the expert policies trained by a soft actor-critic (SAC) [37]. In
loss terms. LfDT learns F and F −1 from the random motion the simulated experiment, we collected 0.4M samples of
data in Y before training G and D. random motion data for each domain X and Y. In addition,
we also collected 1.6K samples which contain 200 expert D. Simulation results
demonstrations of a human and Robot1, and those of the As shown in Tab. I, LfDT outperforms ablated baselines by
human and Robot2, respectively. a significant margin. This table reports the success rate of our
method and ablated baselines out of 20 trials averaged over
B. Training details three different seeds. This result implies that our domain-
translation framework successfully translates demonstrations
F and F −1 consist of four fully connected layers of 256 of one human and one robotic arm into demonstrations of the
hidden units with leaky ReLU activation functions. G and two robotic arms in a consistent manner with the dynamics
D comprise three fully connected layers of 256 hidden units of the domain of the robotic arms.
with leaky ReLU nonlinearity. We used a learning rate of On the other hand, BC without translation often fails the
0.0003, Adam optimizer, and a batch size of 32 for all task because the policy cannot learn coordinated behavior by
experiments. imitating demonstrations independent of Robot1 and Robot2.
We also ablate our method, LfDT, on dynamics-consistency
C. Baselines and partial identity mapping to verify the importance of these
constraints. The success rate of LfDT without dynamics-
To leverage IL using demonstrations of one human and one consistency significantly drops, indicating that the dynamics-
robotic arm, the physical differences between the human and consistency constraint contributes to translating dynamically
robotic arm require a method to convert the demonstrations feasible demonstrations of Robot1 and Robot2. Furthermore,
into demonstrations of two robotic arms. However, to the best LfDT without partial identity mapping completely fails the
of our knowledge, no other method can be used to convert task, suggesting that the partial identity mapping constraint
them except for LfDT. Thus, we compare LfDT against for enabling the generator to output states of the robot close
three ablated baselines, BC without translation, LfDT with- to those in the demonstration is an essential component.
out dynamics-consistency, and LfDT without partial identity
mapping. E. Real robot experiment
BC without translation learns a policy π R1 and the other In this section, we verify whether LfDT can work for a
policy π R2 such that aR1 = π R1 (sR1 , sO ) and aR2 = real-robot setup. We use two robotic arms (myCobot, Ele-
π R2 (sR2 , sO ) from {(xR1 , xO , aR1 )} and {(xR2 , xO , aR2 )}, phant Robotics), small-scaled 6DoF robotic arms, for “Dual-
respectively. These two policies are separately trained with- arm block push” and “Dual-arm peg insertion” experiments,
out our domain-translation framework to perform the task. as shown in Fig. 3. The two robotic arms push a block onto
Since demonstrations of Robot1-Human and Human-Robot2 the green line in Dual-arm block push. Robot2 inserts a peg
are independent of each other, we hypothesize that the attached to Robot2 into the hole attached to Robot1 in Dual-
policies π R1 and π R2 struggle to learn coordinated behavior. arm peg insertion. The hand of a human expert and the end
If BC without translation does not perform well but LfDT effectors of the robotic arms moves in the horizontal plane in
successfully learns the task, it implies that the translation Dual-arm block push and in the vertical plane in Dual-arm
of the demonstrations in LfDT helps IL learn coordinated peg insertion, respectively.
behavior in the dual-arm manipulation. We set the position of the end of the robotic arms as xRi
LfDT without dynamics-consistency and LfDT without and y Ri , the position of the hand of the human expert as
partial identity mapping first convert demonstrations of one xH , the position and orientation of the object as xO and
human and one robotic arm into demonstrations of robotic y O , and the target position of the end effector as aRi . We
arms similar to LfDT. However, we remove the dynamics- expected that LfDT translated demonstrations considering
consistency or partial identity constraint in the domain- the difference between the pose of the object in the human
translation network from LfDT, respectively, to verify the hand and that in the robotic gripper. The robotic arms
importance of these constraints. received the target joint angles, which were calculated from
In the simulated experiment, we evaluate methods based the target position using inverse kinematics, and moved their
on the task success rate. Across all methods, we train the joints toward the target angles. To observe the pose of the
BC policies using 90% of the demonstrations for training human hand and object, two markers were attached to both
and 10% for validation, and we evaluate the policies when the hand and object, and they were tracked using a camera
the validation loss is minimum. (RealSense D435, Intel) and AprilTag [38]. One human ex-
pert collects random motion data and demonstrations. When
collecting demonstrations, the robotic arm replays a motion
#demonstrations 50 100 200
BC w/o translation 28.3% 6.7% 26.7%
programmed in advance instead of real-time teleoperation.
LfDT w/o dynamics-consistency 51.7% 61.7% 51.7% We collect 3300 random motion data samples in X and Y
LfDT w/o partial identity 0% 0% 0% as well as 390 demonstration data samples which contain ten
LfDT 80.0% 81.7% 71.7% demonstrations of Robot1-Human and ten demonstrations of
TABLE I Human-Robot2 in X in Dual-arm block push and Dual-arm
S UCCESS RATES OF 2D DUAL - ARM BOX PUSH . peg insertion. We evaluate the effectiveness of LfDT by the
task success rate over ten trials. We compare LfDT and BC
!.),/)(#01,$230+.&4

Demonstration
!"#$%&'()'*$%

Deployment
!"+,$-#"%'

(a) Dual-arm block push


!.),/)(#0+"50*%&"('*$%

!"#$%&'()'*$%
Demonstration

Deployment
!"+,$-#"%'

(b) Dual-arm peg insertion

Fig. 4. Snapshots of demonstrations and the deployment of (a) dual-arm block push and (b) dual-arm peg insertion. LfDT successfully learns
and executes the manipulation of real robots from the human-arm and robotic-arm demonstrations.

without translation to evaluate effects of the state translation. one human and one robotic arm into demonstrations of two
The models are trained using the same parameters of the robotic arms. The dual-arm using LfDT succeeded the tasks
simulations as described in Sec. IV-B. by coordination of their movements, which the tasks require.
Fig. 4 shows the snapshots of demonstrations of the human LfDT can translate demonstrations of a human source arm
and robotic arm as well as the deployment of the tasks into ones of a target robotic arm with a different morphology
by the robotic arms. Finally, LfDT shows 100%(= 10/10) from the source arm without the action of the arm. Therefore,
and 80%(= 8/10) success rates for Dual-arm block push LfDT can also translate demonstrations of a source robot
and Dual-arm peg insertion, respectively, while BC without arm, such as a different serial-link arm, a parallel-link
translation showed 100%(= 10/10) and 20%(= 2/10) arm, and a continuum robot, into the target serial-link arm.
success rates for Dual-arm block push and Dual-arm peg Furthermore, LfDT can be potentially extended to multi-
insertion, respectively. These results indicate that LfDT arm manipulation by a larger number of robotic arms than
works in a real robot setup. two, in which LfDT would significantly reduce the effort of
demonstration collection.
V. D ISCUSSION
While LfDT successfully converts demonstrations of the
In simulations, Robot1 and Robot2, each with three links, human and robotic arm into demonstrations of two robotic
learned 2D dual-arm box task from demonstrations, in which arms, it leverages a large size of random motion data
one of the arms had four links, using LfDT. That is, by to generate an accurate forward dynamics model and an
leveraging LfDT, robotic arms can learn dual-arm manipula- accurate inverse dynamics model. These dynamics models
tion from demonstrations of arms with physical differences, enable LfDT to translate the demonstrations satisfying the
such as human arms. Based on the successful results in the dynamics consistency and to improve its performance as
simulated experiments, we further apply our method for real- shown in Tab. I. Also, LfDT learns quasi-static manipulation
world dual-arm manipulation. As described in Section IV-E, in this study. Using a parametric and non-parametric hybrid
LfDT successfully learns Dual-arm block push and Dual- model [39], online adaptation [40], transfer learning [41],
arm peg insertion tasks by converting demonstrations of meta-learning [42] and continuous learning [43] enable to
train these models using a smaller size of random motion [15] S. Schaal, “Is imitation learning the route to humanoid robots?” Trends
data. Therefore, in future work, we will combine LfDT with in Cognitive Sciences, vol. 3, no. 6, pp. 233–242, 1999.
[16] A. Billard, S. Calinon, R. Dillmann, and S. Schaal, “Survey: Robot
these methods to learn more complex manipulations, such as programming by demonstration,” Springer Handbook of Robotics, pp.
dual-arm object grasping and dynamic manipulation. 1371–1394, 2008.
[17] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp,
VI. C ONCLUSIONS P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang,
J. Zhao, and K. Zieba, “End to end learning for self-driving cars,”
In this study, we present LfDT, a novel IL framework arXiv preprint arXiv:1604.07316, 2016.
for dual-arm manipulation, which collects demonstrations [18] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum
of one human and one robotic arm and converts these entropy inverse reinforcement learning,” in National Conference on
Artificial Intelligence, vol. 3, 2008, pp. 1433–1438.
demonstrations into demonstrations of two robotic arms us- [19] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforce-
ing a domain-translation framework. LfDT outperforms other ment learning,” in International Conference on Machine Learning,
ablated baselines in the simulated experiments. Furthermore, 2004.
[20] J. Ho and S. Ermon, “Generative adversarial imitation learning,”
we demonstrate that LfDT works in real-robot experiments. in Advances in Neural Information Processing Systems, D. Lee,
Future work should focus on methods to efficiently train a M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29,
dynamics model and an inverse dynamics model for complex 2016.
[21] Y. Zhu, Z. Wang, J. Merel, A. Rusu, T. Erez, S. Cabi, S. Tunya-
dual-arm manipulation. suvunakool, J. Kramár, R. Hadsell, N. de Freitas, and N. Heess,
“Reinforcement and imitation learning for diverse visuomotor skills,”
ACKNOWLEDGMENT arXiv preprint arXiv:1802.09564, 2018.
We would like to thank Motoi Laboratory from Kobe Uni- [22] L. Fan, Y. Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa,
S. Savarese, and L. Fei-Fei, “SURREAL: Open-source reinforcement
versity for allowing us to use their facilities for conducting learning framework and robot manipulation benchmark,” in Confer-
the experiments. ence on Robot Learning, 2018.
[23] M. Laghi, M. Maimeri, M. Marchand, C. Leparoux, M. Catalano,
R EFERENCES A. Ajoudani, and A. Bicchi, “Shared-autonomy control for intuitive
[1] H. Ha and S. Song, “Flingbot: The unreasonable effectiveness of bimanual tele-manipulation,” in IEEE-RAS International Conference
dynamic manipulation for cloth unfolding,” in Conference on Robot on Humanoid Robots, 2018, pp. 1–9.
Learning, 2022, pp. 24–33. [24] H. Lee, J. Kim, and T. Kim, “A robot teaching framework for a
[2] M. Bollini, S. Tellex, T. Thompson, N. Roy, and D. Rus, “Interpreting redundant dual arm manipulator with teleoperation from exoskeleton
and executing recipes with a cooking robot,” in Experimental Robotics. motion data,” in IEEE-RAS International Conference on Humanoid
Springer, 2013, pp. 481–495. Robots, 2014, pp. 1057–1062.
[3] C. Smith, Y. Karayiannidis, L. Nalpantidis, X. Gratal, P. Qi, D. V. [25] M. E. Taylor and P. Stone, “Transfer learning for reinforcement
Dimarogonas, and D. Kragic, “Dual arm manipulation—A survey,” learning domains: A survey,” Journal of Machine Learning Research,
Robotics and Autonomous Systems, vol. 60, no. 10, pp. 1340–1353, vol. 10, no. 56, pp. 1633–1685, 2009.
2012. [26] H. B. Ammar, E. Eaton, P. Ruvolo, and M. E. Taylor, “Unsupervised
[4] T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and cross-domain transfer in policy gradient reinforcement learning via
P. Abbeel, “Deep imitation learning for complex manipulation tasks manifold alignment,” in AAAI Conference on Artificial Intelligence,
from virtual reality teleoperation,” in IEEE International Conference 2015, pp. 2504–2510.
on Robotics and Automation, 2018, pp. 5628–5635. [27] A. Gupta, C. Devin, P. Abbeel, and S. Levine, “Learning invariant
[5] Z. Zhu and H. Hu, “Robot learning from demonstration in robotic feature spaces to transfer skills with reinforcement learning,” in
assembly: A survey,” Robotics, vol. 7, no. 2, 2018. International Conference on Learning Representations, 2017.
[6] R. Caccavale, M. Saveriano, G. A. Fontanelli, F. Ficuciello, D. Lee, [28] Y. Liu, A. Gupta, P. Abbeel, and S. Levine, “Imitation from observa-
and A. Finzi, “Imitation learning and attentional supervision of dual- tion: Learning to imitate behaviors from raw video via context transla-
arm structured tasks,” in Joint IEEE International Conference on tion,” in IEEE International Conference on Robotics and Automation,
Development and Learning and Epigenetic Robotics. IEEE, 2017, 2018, pp. 1118–1125.
pp. 66–71. [29] P. Sharma, D. Pathak, and A. Gupta, “Third-person visual imitation
[7] B. Akgün, M. Cakmak, K. Jiang, and A. Thomaz, “Keyframe- learning via decoupled hierarchical controller,” Advances in Neural
based learning from demonstration,” International Journal of Social Information Processing Systems, vol. 32, 2019.
Robotics, vol. 4, pp. 343–355, 2012. [30] D. S. Raychaudhuri, S. Paul, J. Vanbaar, and A. K. Roy-Chowdhury,
[8] G. Ye and R. Alterovitz, “Guided motion planning,” Robotics research, “Cross-domain imitation from observations,” in International Confer-
pp. 291–307, 2017. ence on Machine Learning, 2021, pp. 8902–8912.
[9] E. Gribovskaya and A. Billard, “Combining dynamical systems control [31] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-
and programmingby demonstration for teaching discrete bimanual to-image translation using cycle-consistent adversarial networks,” in
coordination tasks to a humanoid robot,” in ACM/IEEE International IEEE International Conference on Computer Vision, 2017.
Conference on Human Robot Interaction, 2008, pp. 33–40. [32] Q. Zhang, T. Xiao, A. A. Efros, L. Pinto, and X. Wang, “Learn-
[10] G. Franzese, L. de Souza Rosa, T. Verburg, L. Peternel, and J. Kober, ing cross-domain correspondence for control with dynamics cycle-
“Interactive imitation learning of bimanual movement primitives,” consistency,” in International Conference on Learning Representa-
IEEE/ASME Transactions on Mechatronics, 2023. tions, 2021.
[11] A. Tung, J. Wong, A. Mandlekar, R. Martı́n-Martı́n, Y. Zhu, L. Fei-Fei, [33] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
and S. Savarese, “Learning multi-arm manipulation through collabo- S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
rative teleoperation,” in IEEE International Conference on Robotics in Advances in Neural Information Processing Systems, vol. 27, 2014.
and Automation, 2021. [34] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, “Recycle-GAN: Un-
[12] R. Chitnis, S. Tulsiani, S. Gupta, and A. Gupta, “Intrinsic motivation supervised video retargeting,” in European Conference on Computer
for encouraging synergistic behavior,” in International Conference on Vision, 2018.
Learning Representations, 2020. [35] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros,
[13] R. Chitnis, S. Tulsiani, S. Gupta, and A. K. Gupta, “Efficient bimanual and T. Darrell, “CyCADA: Cycle-consistent adversarial domain adap-
manipulation using learned task schemas,” IEEE International Con- tation,” in International Conference on Machine Learning, 2018, pp.
ference on Robotics and Automation, pp. 1149–1155, 2019. 1989–1998.
[14] F. Amadio, A. Colomé, and C. Torras, “Exploiting symmetries in [36] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine
reinforcement learning of bimanual robotic tasks,” IEEE Robotics and for model-based control.” in IEEE/RSJ International Conference on
Automation Letters, vol. 4, no. 2, pp. 1838–1845, 2019. Intelligent Robots and Systems, 2012, pp. 5026–5033.
[37] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic
actor,” in International Conference on Machine Learning, 2018, pp.
1861–1870.
[38] E. Olson, “Apriltag: A robust and flexible visual fiducial system,” in
IEEE International Conference on Robotics and Automation, 2011,
pp. 3400–3407.
[39] T.-C. Çallar and S. Böttger, “Hybrid learning of time-series inverse
dynamics models for locally isotropic robot motion,” IEEE Robotics
and Automation Letters, vol. 8, no. 2, pp. 1061–1068, 2022.
[40] J. Fu, S. Levine, and P. Abbeel, “One-shot learning of manipulation
skills with online dynamics adaptation and neural network priors,” in
IEEE/RSJ International Conference on Intelligent Robots and Systems.
IEEE, 2016, pp. 4019–4026.
[41] K. Tanaka, R. Yonetani, M. Hamaya, R. Lee, F. von Drigalski, and
Y. Ijiri, “TRANS-AM: Transfer learning by aggregating dynamics
models for soft robotic assembly,” in IEEE International Conference
on Robotics and Automation. IEEE, 2021, pp. 4627–4633.
[42] K. Morse, N. Das, Y. Lin, A. S. Wang, A. Rai, and F. Meier, “Learning
state-dependent losses for inverse dynamics learning,” in IEEE/RSJ
International Conference on Intelligent Robots and Systems. IEEE,
2020, pp. 5261–5268.
[43] K. Hitzler, F. Meier, S. Schaal, and T. Asfour, “Learning and adaptation
of inverse dynamics models: A comparison,” in IEEE-RAS Interna-
tional Conference on Humanoid Robots. IEEE, 2019, pp. 491–498.

View publication stats

You might also like