0 views

Uploaded by dzenek

sensors-18-02539

- Adaptive Traffic Signal Control Using Approximate Dynamic Programming
- Model-Based H∞ Control of a UPQC
- ENGD3038 - Rectilinear 2.0
- PID Tuning Guide V_1
- ROUTING IN ALL-OPTICAL NETWORKS USING RECURSIVE STATE SPACE TECHNIQUE
- proposal.pdf
- Safe Multi Agent Reforcement Learning for Autonomous Driving
- Optimal State Feedback Control of Brushless Direct-current Motor
- Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
- State Space Reduction for Hierarchical Reinforcement Learning
- Autonomous Vehicles
- Multi Agent
- Hydrological Model Vidra
- An Empiral Study and Discussions on Multiple Object Tracking: Current State of Art
- Canuto 2013
- controlTheory_12
- Super l High Cap
- References
- Aa Automation Dynamic Simulation Part3
- Edmundo-COB03-0738

You are on page 1of 26

Article

Efficient Force Control Learning System for Industrial

Robots Based on Variable Impedance Control

Chao Li ID

, Zhi Zhang *, Guihua Xia, Xinru Xie and Qidan Zhu

College of Automation, Harbin Engineering University, Harbin 150001, China; li_chao@hrbeu.edu.cn (C.L.);

xiaguihua@hrbeu.edu.cn (G.X.); xiexinru@hrbeu.edu.cn (X.X.); zhuqidan@hrbeu.edu.cn (Q.Z.)

* Correspondence: zhangzhi1981@hrbeu.edu.cn; Tel.: +86-139-4614-2053

Received: 4 June 2018; Accepted: 26 July 2018; Published: 3 August 2018

Abstract: Learning variable impedance control is a powerful method to improve the performance

of force control. However, current methods typically require too many interactions to achieve good

performance. Data-inefficiency has limited these methods to learn force-sensitive tasks in real systems.

In order to improve the sampling efficiency and decrease the required interactions during the learning

process, this paper develops a data-efficient learning variable impedance control method that enables

the industrial robots automatically learn to control the contact force in the unstructured environment.

To this end, a Gaussian process model is learned as a faithful proxy of the system, which is then used

to predict long-term state evolution for internal simulation, allowing for efficient strategy updates.

The effects of model bias are reduced effectively by incorporating model uncertainty into long-term

planning. Then the impedance profiles are regulated online according to the learned humanlike

impedance strategy. In this way, the flexibility and adaptivity of the system could be enhanced.

Both simulated and experimental tests have been performed on an industrial manipulator to verify

the performance of the proposed method.

Keywords: force control; variable impedance control; efficient learning; Gaussian processes;

industrial robot

1. Introduction

With the development of the modern robotics, compliance control is becoming an important

component for industrial robots. Control of contact force is crucial for successfully executing

operational tasks that involve physical contacts, such as grinding, deburring, or assembly. In the

structured environment, good performances could be achieved using classical force control methods [1].

However, it is difficult to control the contact force effectively in the unstructured environment.

Impedance control [2] provides a suitable control architecture for robots in both unconstrained and

constrained motions by establishing a suitable mass-spring-damper system.

Neuroscience studies have demonstrated how humans perform specific tasks by adapting muscle

stiffness [3]. Kieboom [4] studied the impedance regulation rule for bipedal locomotion and found that

variable impedance control can improve gait quality and reduce energy expenditure. The ability to

task-dependently change the impedance is one important aspect of biomechanical systems that leads to

its good performance. Recently, many researchers have explored the benefits of varying the impedance

during the task for robotics [5–9]. The basic idea is to adjust the impedance parameters according

to the force feedback. Humanlike adaptivity was achieved in [9] by adapting force and impedance,

providing an intuitive solution for human-robot interactions. Considering ensuring safe interaction,

Calinon [6] proposed a learning-based control strategy with variable stiffness to reproduce the skill

characteristics. Kronander [10] has demonstrated the stability of variable impedance control for the

control system. Variable impedance not only enables control of the dynamic relationship between

Sensors 2018, 18, 2539 2 of 26

contact forces and robot movements, but also enhances the flexibility of the control system. Generally,

the methods of specifying the varying impedance can be classified into three categories.

(1) Optimal control. The basic idea is to dynamically adjust the gains by the feedback information

using optimization techniques. They are usually robust to uncertain systems. Medina [11] proposed

a variable compliance control approach based on risk-sensitive optimal feedback control. This approach

has the benefits of high adaptability to uncertain and variable environment. The joint torque and the

joint stiffness are independently and optimally modulated using the optimal variable stiffness control

in [12]. Adaptive variable impedance control is proposed in [13], it stabilizes the system by adjusting

the gains online according to the force feedback. To guarantee stable execution of variable impedance

tasks, a tank-based approach to passive varying stiffness is proposed in [14].

(2) Imitation of human impedance. Humans have a perfect ability to complete a variety of

interaction tasks in various environments by adapting their biomechanical impedance characteristics.

These excellent abilities are developed over years of experience and stored in the central nervous

system [15]. For the purpose of imitating the human impedance modulation manner, some methods

have been proposed [6,8]. Toshi [16] discussed the impedance regulation law of the human hand

during dynamic-contact tasks and proposed a bio-mimetic impedance control for robot. Lee [17]

designed a variable stiffness control scheme imitating human control characteristics, and it achieved

force tracking by adjusting the target stiffness without estimating the environment stiffness. Yang [18]

introduced a coupling interface to naturally transfer human impedance adaptive skill to the robot

by demonstration. Kronander [19] and Li [20] addressed the problem of compliance adjusting in

a robot learning from demonstrations (RLfD), in which a robot could learn to adapt the stiffness

based on human–robot interaction. Yang [21] proposed a framework for learning and generalizing

humanlike variable impedance skills, combining the merits of the electromyographic (EMG) and

dynamic movement primitives (DMP) model. To enhance the control performance, the problem of

transferring human impedance behaviors to the robot has been studied in [22–24]. These methods

usually use the EMG device to collect the muscle activities information, based on which variation of

human limb impedance can be estimated.

(3) Reinforcement learning (RL). Reinforcement learning constitutes a significant aspect of the

artificial intelligence field with numerous applications ranging from medicine to robotics [25,26].

Researchers have recently focused on learning an appropriate modulation strategy by means of RL

to adjust the impedance characteristic of robot [7,27–29]. Du [30] proposed a variable admittance

control method based on fuzzy RL for human–robot interaction. It improves the positioning

accuracy and reduces the required energy by dynamically regulating the virtual damping. Li [31]

proposed an BLF-based adaptive impedance control framework for a human–robot cooperation task.

The impedance parameters were learned using the integral RL to get adaptive robot behaviors.

In summary, to derive an effective variable impedance controller, the first category and the second

category of methods usually need advanced engineering knowledge about the robot and the task,

as well as designing these parameters. Learning variable impedance control based on RL is a promising

and powerful method, which can get the proper task-specific control strategy automatically through

trial-and-error. RL algorithms can be broadly classified as two types: model-free and model-based.

In model-free RL, policy is found without even building a model of the dynamics, and the policy

parameters can be searched directly. However, for each sampled trajectory, it is necessary to interact

with the robot, which is time-consuming and expensive. In model-based RL, the algorithm explicitly

builds a transition model of the system, which is then used for internal simulations and predictions.

The (local) optimal policy is improved based on the evaluations of these internal simulations.

The model-based RL is more data-efficient than the model-free RL, but more computationally

intensive [32,33]. In order to extend to high-dimensional tasks conveniently and avoid the model-bias

of model-based RL algorithm, the existing learning variable impedance control methods are most

commonly based on a model-free RL algorithm. Buchli [5] proposed a novel variable impedance

control based on PI2 , which is a model-free RL algorithm. It realized simultaneous regulation of

Sensors 2018, 18, 2539 3 of 26

motion trajectory and impedance gains using DMPs. This algorithm has been successfully extended

to high-dimensional robotic tasks such as opening door, picking up pens [34], box flipping task [35]

and sliding switch task for tendon-driven hand [36]. Stulp [29] further studied the applicability of

PI2 in stochastic force field, and it was able to find motor policies that qualitatively replicate human

movement. Considering the coupling between degrees of freedom (DoFs), Winter [37] developed

a C-PI2 algorithm based on PI2 . Its learning speed was much higher than that of previous algorithms.

However, these methods usually require hundreds or thousands of rollouts to achieve satisfactory

performance, which is unexpected for a force control system. None of these works address the issue

of sampling efficiency. Improving data-efficiency is critical to learning to perform force-sensitive

tasks, such as operational tasks of fragile components, because too many physical interactions with

the environment during the learning process is usually infeasible. Alternatively, model-based RL is

a promising way to improve the sampling efficiency. Fast convergence towards an optimal strategy

could be guaranteed. Shadmehr [38] demonstrated that humans learn an internal model of the force

field and compensate for external perturbations. Franklin [39] presented a model which combines three

principles to learn stable, accurate, and efficient movements. It is able to accurately model empirical

data gathered in force field experiments. Koropouli [27] investigated the generalization problem of

force control policy. The force-motion mapping policy was learned from a set of demonstrated data to

endow robots with certain human-like adroitness. Mitrovic [28] proposed to learn both the dynamics

and the noise properties through supervised learning, using locally weighted projection regression.

This model was then used in a model-based stochastic optimal controller to control a one-dimensional

antagonistic actuator. It improved the accuracy performance significantly, but the analytical dynamic

model still had accuracy limitations. These methods did not solve the key problem of model bias well.

Therefore, in order to improve the sampling efficiency, we propose a data-efficient learning

variable impedance control method based on model-based RL that enables the industrial robots to

automatically learn to control the contact force in the unstructured environment. A probabilistic

Gaussian process (GP) model is approximated as a faithful proxy of the transition dynamics of

the system. Then the probabilistic model is used for internal system simulation to improve the

data-efficiency by predicting the long-term state evolution. This method reduces the effects of

model-bias effectively by incorporating model uncertainty into long-term planning. The impedance

profiles are regulated automatically according to the (sub)optimal impedance control strategy learned

by the model-based RL algorithm to track the desired contact force. Additionally, we present a way of

taking the penalty of control actions into account during planning to achieve the desired impedance

characteristics. The performances of the system are verified through simulations and experiments on

a six-DoF industrial manipulator. This system outperforms other learning variable impedance control

methods by at least one order of magnitude in terms of learning speed.

The main contributions of this paper can be summarized as follows: (1) A data-efficient learning

variable impedance control method is proposed to improve the sampling efficiency, which could

significantly reduce the required physical interactions with environment during force control learning

process. (2) The proposed method learns an impedance regulation strategy, based on which the

impedance profiles are regulated online in a real-time manner to track the desired contact force. In this

way, the flexibility and adaptivity of compliance control could be enhanced. (3) The impedance strategy

with humanlike impedance characteristics is learned automatically through continuous explorations.

There is no need to use additional sampling devices, such as EMG electrodes, to transfer human skills

to the robot through demonstrations.

When the robot interacts with the rigid environment, the robot could be presented by a second

order mass-spring-damper system, and the environment could be modeled as a spring-damping

Sensors 2018, 18, 2539 4 of 26

model with stiffness k e and damping be . The interaction model of the system is illustrated in Figure 1.

Figure 1d 1d

1. Figure shows a contact

shows force

a contact diagram

force when

diagram robotrobot

when comes in contact

comes with the

in contact withenvironment. m, b, and,

the environment.

k ,denote the mass,

and k denote thedamping, and stiffness

mass, damping, of the robot

and stiffness of theend-effector, respectively.

robot end-effector, Let f beLet

respectively. the contact

be the

force applied

contact by the robot

force applied to robot

by the the environment once a contact

to the environment once a between both is established.

contact between both is established.

Figure 1.

Figure 1. The interaction model

The interaction of the

model of the system.

system. (a)

(a) Without

Without any

any contact

contact between

between the robot and

the robot and the

the

environment; (b) critical point when contact occurs; (c) stable contact with the environment; (d)

environment; (b) critical point when contact occurs; (c) stable contact with the environment; (d) contact

contact

force force diagram

diagram when

when robot robotincomes

comes in with

contact contact

thewith the environment.

environment.

The contact process between the robot and the environment can be divided into three phases. In

The contact process between the robot and the environment can be divided into three phases.

the first phase, the robot is approaching toward the environment (Figure 1a). There is no contact

In the first phase, the robot is approaching toward the environment (Figure 1a). There is no contact

during this phase, and the contact force is zero (Figure 1d, − ). In the second phase, the robot is

during this phase, and the contact force is zero (Figure 1d, t0 − t1 ). In the second phase, the robot is

in contact with the environment (Figure 1b). During this phase, the contact force increases rapidly

in contact with the environment (Figure 1b). During this phase, the contact force increases rapidly

(Figure 1d, − ). Collision is inevitable, and the collision is transient and strongly nonlinear. In

(Figure 1d, t1 − t2 ). Collision is inevitable, and the collision is transient and strongly nonlinear. In the

the third phase, the robot contacts with the environment continuously (Figure 1c) and the contact

third phase, the robot contacts with the environment continuously (Figure 1c) and the contact force is

force is stabilized to the desired value (Figure 1d, − ).

stabilized to the desired value (Figure 1d, t2 − t3 ).

High values of contact force are generally undesirable since they may stress both the robot and

High values of contact force are generally undesirable since they may stress both the robot and

the manipulated object. Therefore, the overshoot and the oscillation caused by the inevitable collision

the manipulated object. Therefore, the overshoot and the oscillation caused by the inevitable collision

should be suppressed effectively.

should be suppressed effectively.

2.2. Contact Force Observer Based on Kalman Filter

2.2. Contact Force Observer Based on Kalman Filter

The contact force is the quantity describing the state of interaction in the most complete fashion.

The contact force is the quantity describing the state of interaction in the most complete fashion.

To this end, the availability of force measurements is expected to provide enhanced performance for

To this end, the availability of force measurements is expected to provide enhanced performance for

controlling interaction [1]. Recently, several methods have been proposed to make the force control

controlling interaction [1]. Recently, several methods have been proposed to make the force control

possible without dedicated sensors, such as virtual force sensor [40] and contact force estimation

possible without dedicated sensors, such as virtual force sensor [40] and contact force estimation

methods based on motor signals [41,42]. However, to realize precise force control, industrial robots

methods based on motor signals [41,42]. However, to realize precise force control, industrial robots

are typically equipped with F/T sensors at the wrist to measure the contact force. The measurements

are typically equipped with F/T sensors at the wrist to measure the contact force. The measurements

of force sensors do not correspond to the actual environmental interaction forces which usually

of force sensors do not correspond to the actual environmental interaction forces which usually

contain inertial force and gravity. Moreover, the raw signals sampled from the force sensor may be

contain inertial force and gravity. Moreover, the raw signals sampled from the force sensor may be

corrupted by noise, especially in the industrial environment where equipped with large equipment.

corrupted by noise, especially in the industrial environment where equipped with large equipment.

The electromagnetic noise, vibration noise, electrostatic effect, and thermal noise are very strong and

The electromagnetic noise, vibration noise, electrostatic effect, and thermal noise are very strong and

complex. These disturbances seriously affect the measurement of the F/T sensor which will degrade

complex. These disturbances seriously affect the measurement of the F/T sensor which will degrade

the quality of force control. Hence, a contact force observer based on the Kalman filter is designed to

the quality of force control. Hence, a contact force observer based on the Kalman filter is designed to

estimate the actual contact force and moment applied at the contact point.

estimate the actual contact force and moment applied at the contact point.

Figure 2 illustrates the contact force and moment applied at the end-effector. The center of mass

Figure 2 illustrates the contact force and moment applied at the end-effector. The center of mass

of the end-effector locates at C. E is the contact point between the end-effector and the environment.

of the end-effector locates at C. E is the contact point between the end-effector and the environment.

The center of mass of the F/T sensor is S. As shown in Figure 2, the corresponding coordinate frames

The center of mass of the F/T sensor is S. As shown in Figure 2, the corresponding coordinate frames

with the principal axes are denoted by Σ , Σ , and Σ , respectively. The world frame is denoted by

with the principal axes are denoted by ΣC , Σ E , and ΣS , respectively. The world frame is denoted

Σ .

by ΣW .

Sensors 2018, 18, 2539 5 of 26

Sensors 2018, 18, x FOR PEER REVIEW 5 of 26

Figure

Figure 2.

2. Contact

Contact force

force and

and moment applied at

moment applied at the

the end-effector.

end-effector.

Assume that the robot moves slowly during the tasks, the inertial effect could be negligible. The

Assume that the robot moves slowly during the tasks, the inertial effect could be negligible.

contact force model could be built using the Newton–Euler equations [43]

The contact force model could be built using the Newton–Euler equations [43]

RC #" 0 F #E " RS

E C

0 #" FSS mR C

" W

E

= − gmR,C

" # W #

REC

C 0 C

E

FE R C 0C S FS

S (1)

C

SC( rCE )CRE RE E M E= S( rCS

C E

C

) RSC RSC MS S 0−

CS C W gW , (1)

S(rCE ) R E R E ME S(rCS ) RS RS MS 0

where the matrix is the rotation matrix of frame Σ with respect to frame Σ . and are

where the matrix R i is the rotation matrix of frame Σ with respect to frame Σ . F E and M E are the

the actual contact forces

j and moments applied at thej end-effector. and i are E the measured

E

forces contact

actual and moments

forces andby the F/T sensor.

moments applied at is the

the end-effector. FSS and MSS are

mass of the end-effector. gWthe

is the gravitational

measured forces

acceleration.

and moments by the denotes the vector

F/T sensor. m is from

the massC toofE thewith respect to frame

end-effector. gW is theΣ .gravitational

is the vector from C

acceleration.

C with respect

rto

CES denotes to frame

the vector from Σ C. The

to Eskew-symmetry

with respect to frame operatorΣC .applied

C is the

rCS to vector

vector fromis donated

C to S with ( )=

as respect

to .frame ΣC . The

To define the state vector

skew-symmetry and the measurement

operator applied to vectorvector b is donated as S(b) = Sb . To define the

state vector x and the measurement vector yS

x = [ FS M SS FSS M S ]T ,

S (2)

.S . S T

x = [ FS MS FS MS ] , (2)

S S

y = [F E

M ]E T

, (3)

E E

T

y = [ FEE MEE ] , (3)

According to (1), the system model and measurement model can be established

According to (1), the system model and measurement model can be established

x (t ) = A0 x(t ) + w x

.

x (yt)(t= ) =AH00xx((tt)) +

+D wxg W + v , (4)

0 W y , (4)

y(t) = H0 x (t) + D0 g + vy

0 " I # "RSE 0 3×3 0 3×3 0 3×3 −#RW E

where A0 = 6×6 06×6×66 ,I6×H 0 = R E 0 , D

0 0 = m0 .

where A0 = 0 6×6 0 6×6 6

, HR0CE [S= C

(rCS C

) − S(ErCE )]RCSC SRSE C 0 3×3 C 0 3×33×E3 3×3 ×E3

3R C

C S( r,CED

C

)R0W =

06×6 06×6 RC [S(rCS ) − S(rCE )] RS RS 03×3 03×3

" and are the# process noise and measurement noise. They are assumed to be independent of

E

− RW

each other

m C with C normal

. ωx and probability

vy are thedistribution

process noise and ( )~ (0, ) , noise.

measurement ~ They

(0, are

). assumed

, aretothe be

RCE S(rCE ) RW

corresponding covariance matrices which are assumed as diagonal matrices with constant diagonal

independent of each other with normal probability distribution p(ωx ) ∼ N (0, Q), p vy ∼ N (0, R).

elements. The model (4) can be discretized resulting in the discrete-time linear system

Q, R are the corresponding covariance matrices which are assumed as diagonal matrices with constant

diagonal elements. The model (4) can bexdiscretized k = Ax k − 1 + resulting

w k −1 , in the discrete-time linear system

(5)

y k = Hx k + Dg W + v k ,

xk = Axk−1 + wk−1 ,

I 6× 6 I 6× 6

(5)

yk = Hxk + DgW + vk ,

where A = , H = H0 , D = D0 . The Kalman filter update consists of two steps:

" 0 6×6 I 6×6 #

I 6 I6×6

where A = the6×state

1. Predict ,H=H

estimate | 0 , Dand = Dthe0 . The

error Kalman filter update

covariance | consists

at of two

time step steps: on the

k based

06×6 I6×6

results from the previous time step k − 1:

xˆ −k|k −1 = Axˆ k −1 ,

(6)

Pk−|k −1 = APk −1 AT + Q.

Sensors 2018, 18, 2539 6 of 26

1. Predict the state estimate x̂k−|k−1 and the error covariance Pk−|k−1 at time step k based on the results

from the previous time step k − 1:

x̂ − k|k−1 = A x̂k−1 ,

(6)

Pk−|k−1 = APk−1 A T + Q.

Sensors 2018, 18, x FOR PEER REVIEW 6 of 26

2. Correct the predictions x̂k based on the measurements yk :

2. Correct the predictions based on the measurements :

x̂k = x̂−k−|k−1 + Kk (yk − H x̂ − ),

xˆ k = xˆ k|k −1 + Kk ( yk − Hxˆ k−|k −1k|),k−1

−1

Kk = P−k−|k−1 H T ( HP− H + R) , (7)

Kk = Pk|k −1 HT ( HPk−|k −k1|kH

−1+ R)−1 , (7)

Pk = ( I − Kk H )−Pk−|k−1 ,

Pk = ( I − Kk H )Pk|k −1 ,

where Kk is the Kalman gain, Pk is the updated error covariance. In practice, the covariance

where is the Kalman gain, is the updated error covariance. In practice, the covariance

matrices Q and R can automatically be calibrated based on the offline experimental data [41].

matrices and can automatically be calibrated based on the offline experimental data [41]. It

It should be mentioned that the larger the weights of Q are chosen, the more the observer will

should be mentioned that the larger the weights of are chosen, the more the observer will rely on

rely on the measurements. Larger diagonal elements in Q result in faster response time of the

the measurements. Larger diagonal elements in result in faster response time of the corresponding

corresponding estimates, but this also results in increased noise amplification. An implementation

estimates, but this also results in increased noise amplification. An implementation example of the

example of the contact force observer is shown in Figure 3.

contact force observer is shown in Figure 3.

Figure 3.

Figure 3. An

An implementation example of

implementation example of the

the contact

contact force

force observer.

observer.

3. Position-Based Impedance Control for Force Control

One possible approach to achieve compliant behavior for robotics is the classical impedance

One possible approach to achieve compliant behavior for robotics is the classical impedance

control [2], which sometimes is also called force-based impedance control. In typical implementations,

control [2], which sometimes is also called force-based impedance control. In typical implementations,

the controller has the positions as inputs and gives the motor torques as outputs. Force-based

the controller has the positions as inputs and gives the motor torques as outputs. Force-based

impedance control has been widely studied [44]. However, most commercial industrial robots

impedance control has been widely studied [44]. However, most commercial industrial robots

emphasize the accuracy of trajectory following, and do not provide joint torque or motor current

emphasize the accuracy of trajectory following, and do not provide joint torque or motor current

interfaces for users. Therefore, force-based impedance control is impossible on these robots.

interfaces for users. Therefore, force-based impedance control is impossible on these robots.

Alternatively, another possible approach, which is typically suited for industrial robots, is the

Alternatively, another possible approach, which is typically suited for industrial robots, is the

concept of admittance control [45], sometimes also called position-based impedance control. It maps

concept of admittance control [45], sometimes also called position-based impedance control. It maps

from generalized forces to generalized positions. This control structure consists of an inner position

from generalized forces to generalized positions. This control structure consists of an inner position

control loop and an outer indirect force control loop. In constrained motion, the contact force

control loop and an outer indirect force control loop. In constrained motion, the contact force measured

measured by the F/T sensor modifies the desired trajectory in the outer impedance controller loop

by the F/T sensor modifies the desired trajectory in the outer impedance controller loop resulting in the

resulting in the compliant desired trajectory, which is to be tracked by the inner position controller.

compliant desired trajectory, which is to be tracked by the inner position controller. The position-based

The position-based impedance control schematic for force control is shown in Figure 4. , , and

impedance control schematic for force control is shown in Figure 4. Xd , Xc , and Xm denote the reference

denote the reference position trajectory, the commanded position trajectory which is sent to the

position trajectory, the commanded position trajectory which is sent to the robot, and the measured

robot, and the measured position trajectory, respectively. Assuming good tracking performance of

position trajectory, respectively. Assuming good tracking performance of the inner position controller

the inner position controller for slow motions, the commanded trajectory is equal to the measured

for slow motions, the commanded trajectory is equal to the measured trajectory, i.e., Xm = Xc .

trajectory, i.e., = .

resulting in the compliant desired trajectory, which is to be tracked by the inner position controller.

The position-based impedance control schematic for force control is shown in Figure 4. , , and

denote the reference position trajectory, the commanded position trajectory which is sent to the

robot, and the measured position trajectory, respectively. Assuming good tracking performance of

the inner position controller for slow motions, the commanded trajectory is equal to the measured

Sensors 2018, 18, 2539 7 of 26

trajectory, i.e., = .

Figure 4.

Figure 4. The

The position-based impedance control

position-based impedance control schematic.

schematic.

.. .. . .

Md ( X c − X d ) + Bd ( X c − X d ) + Kd ( Xc − Xd ) = Fe − Fd , (8)

where Md , Bd , and Kd are, respectively, the target inertial, damping, and stiffness matrices. Fd is the

desired contact force. Fe is the actual contact force. The transfer-function of the impedance model is

E(s) 1

H (s) = = , (9)

∆F (s) Md s2 + Bd s + Kd

where ∆F = Fe − Fd denotes the force tracking error. E = Xc − Xd is the desired position increment,

and it is used to modify the reference position trajectory Xd to produce the commanded trajectory

Xc = Xd + E, which is then tracked by the servo driver system.

To compute the desired position increment, discretize (9) using bilinear transformation

T 2 ( z + 1)2

H ( z ) = H ( s ) s = 2 z −1 = , (10)

T z +1 ω1 z 2 + ω2 z + ω3

ω1 = 4Md + 2Bd T + Kd T 2 ,

ω2 = −8Md + 2Kd T 2 , (11)

ω3 = 4Md − 2Bd T + Kd T 2 .

Here, T is the control cycle. The desired position increment at time n is derived as

n o

E(n) = ω1−1 T 2 [∆F (n) + 2∆F (n − 1) + ∆F (n − 2)] − ω2 δX (n − 1) − ω3 δX (n − 2) . (12)

Xe is the location of the environment. The contact force between the robot and the environment is then

simplified as

. .

Fe = Be ( X c − X e ) + Ke ( Xc − Xe ). (13)

Replacing Xd in (8) with the initial environment location Xe , and substituting (13) into (8), the new

impedance equation is then converted to

.. .. . .

Md ( X c − X e ) + ( Bd + Be )( X c − X e ) + (Kd + Ke )( Xc − Xe ) = Fd . (14)

Due to the fact that it is difficult to obtain accurate environment information. Therefore, if the

environment stiffness Ke and damping Be change, the dynamic characteristics of the system will change

consequently. To guarantee the dynamic performance, the impedance parameters [ Md , Bd , Kd ] should

be adjusted correspondingly.

Sensors 2018, 18, 2539 8 of 26

Generally, too many physical interactions with the environment are infeasible for learning to

execute force-sensitive tasks. In order to reduce the required interactions with the environment during

force control learning process, a learning variable impedance control approach with model-based RL

algorithm is proposed in the following to learn the impedance regulation strategy.

The scheme of the method is illustrated in Figure 5. Fd is the desired value of contact force,

F is the actual contact force estimated by the force observer, and ∆F is the force tracking error.

The desired position increment of the end-effector E is calculated using the variable impedance

controller. Xd is the desired reference trajectory. The desired joints positions qd are calculated by

adopting inverse kinematics according to the commanded trajectory Xc . The actual Cartesian position

Sensors 2018, 18, x FOR PEER REVIEW 8 of 26

of the end-effector X could be achieved using the measured joints positions q by means of forward

kinematics.

position ofThe the joints are controlled

end-effector couldby be the joint motion

achieved controller

using the measured ofjoints

the industrial

positions robot. KE and

by means ofBE

denote the unknown stiffness and damping of the environment, respectively.

forward kinematics. The joints are controlled by the joint motion controller of the industrial robot.

The

andtransition

denote dynamics

the unknownof thestiffness

systemand is approximated

damping of the byenvironment,

the GP model which is trained using

respectively.

The transition

the collected data. Thendynamics of the system

the learning is approximated

algorithm by the

is used to learn theGP model whichimpedance

(sub)optimal is trained using

control

the collected

strategy π while data. Then the

predicting thelearning

systemalgorithm

evolutionisusing usedthe

to learn the (sub)optimal

GP model. impedance

Instead of the motion control

trajectory,

thestrategy while predicting

proposed method the system regulation

learns an impedance evolution strategy,

using thebased

GP model.

on which Instead of the motion

the impedance profiles

aretrajectory,

regulatedthe proposed

online method learns

in a real-time manner antoimpedance

control theregulation strategy,

contact force. based

In this on the

way, which the

dynamic

impedance profiles are regulated online in a real-time manner to control the contact

relationship between contact force and robot movement could be controlled in a continuous manner. force. In this way,

the dynamic

Moreover, relationship

the flexibility andbetween

adaptivity contact force and robot

for compliance movement

control could be controlled

could be enhanced. in a

Positive definite

continuous manner. Moreover, the flexibility and adaptivity for compliance

impedance parameters could ensure the asymptotic stability of the original desired dynamics, control could be

enhanced. Positive definite impedance parameters could ensure the asymptotic stability of the

but it is not recommended to modify the target inertia matrix because it is easy to cause the system to

original desired dynamics, but it is not recommended to modify the target inertia matrix because it

instability [45]. To simplify the calculation, the target inertial matrix is chosen as Md = I. Consequently,

is easy to cause the system to instability [45]. To simplify the calculation, the target inertial matrix is

the target stiffness Kd and the damping Bd are the parameters that should be tuned in variable

chosen as = . Consequently, the target stiffness and the damping are the parameters

impedance control. Based on the learned impedance strategy, the impedance parameters u = [Kd Bd ]

that should be tuned in variable impedance control. Based on the learned impedance strategy, the

areimpedance

calculatedparameters

according to the states of the contact force and the position of the end-effector, which are

= [ ] are calculated according to the states of the contact force and the

then transferred

position of the to the variable

end-effector, impedance

which are thencontroller.

transferred to the variable impedance controller.

Figure 5. Scheme of the data-efficient learning variable impedance control.

The learning process of variable impedance strategy consists of seven main steps:

The learning process of variable impedance strategy consists of seven main steps:

1. Initializing the strategy parameters stochastically, applying the random impedance parameters

1. to the system

Initializing the and recording

strategy the sampled

parameters data. applying the random impedance parameters to

stochastically,

2. Training the system transition dynamics,

the system and recording the sampled data. i.e., the GP model, using all historical data.

2. 3. Inferring and predicting the long-term evolution of the states according to the GP model.

Training the system transition dynamics, i.e., the GP model, using all historical data.

4. Evaluating the total expected cost ( ) in steps, and calculating the gradients of the cost

( )/ with respect to the strategy parameters .

5. Learning the optimal strategy ∗ ← ( ) using the gradient-based policy search algorithm.

6. Applying the impedance strategy to the system, then executing a force tracking task using the

learned variable impedance strategy and recording the sampled data simultaneously.

7. Repeating steps (2)–(6) until the performance of force control is satisfactory.

Sensors 2018, 18, 2539 9 of 26

3. Inferring and predicting the long-term evolution of the states according to the GP model.

4. Evaluating the total expected cost J π (θ ) in T steps, and calculating the gradients of the cost

dJ π (θ )/dθ with respect to the strategy parameters θ.

5. Learning the optimal strategy π ∗ ← π (θ ) using the gradient-based policy search algorithm.

6. Applying the impedance strategy to the system, then executing a force tracking task using the

learned variable impedance strategy and recording the sampled data simultaneously.

7. Repeating steps (2)–(6) until the performance of force control is satisfactory.

The impedance control strategy is defined as π : x 7→ u = π ( x, θ ) , where the inputs of the strategy

are the observed states of the robot x = [ X F ] ∈ RD , the outputs of the strategy are target stiffness Kd

and damping Bd which can be written as matrix u = [Kd Bd ] ∈ RF , and θ are the strategy parameters

that to be learned. Here, the GP controller is chosen as the control strategy

n

πt = π ( xt , θ ) = ∑ βπ,i k(xπ , xt ) = βTπ K(Xπ , xt ), (15)

i =1

2 −1

β π = (Kπ ( Xπ , Xπ ) + σε,π I) yπ , (16)

1

k( xπ , xt ) = σ2f ,π exp(− ( xπi − xt ) T Λ−1 ( xπi − xt )), (17)

2

where xt is the test input. Xπ = [ xπ1 , . . . , xπn ] are the training inputs, and they are the centers

of the Gaussian basis functions. n is the number of the basis functions. yπ is the training targets,

which are initialized to values close to zero. K is the covariance matrix with entries Kij = k xi , x j .

Λ = diag l12 , . . . , l D

2 is the length-scale matrix where l is the characteristic length-scale of each input

i

2

dimension, σ f ,π is the signal variance, which is fixed to one here, σε,π 2 is the measurement noise variance,

h i

and θ = Xπ , yπ , l1 , . . . , l D , σ2f ,π , σε,π

2 is the hyper-parameters of the controller. Using the GP controller,

more advanced nonlinear tasks could be performed thanks for its flexibility and smoothing effect.

Obviously, the GP controller is functionally equivalent to a regularized RBF network if σ2f ,π = 1 and

2 6 = 0. The impedance parameters are calculated in real-time according to the impedance strategy π

σε,π

and the states xt . The relationship between the impedance parameters and the control strategy can be

written as

[ Kd Bd ] = u = π ( xt , θ ) = β Tπ K ( Xπ , xt ). (18)

In practical systems, the physical limits of the impedance parameters should be considered.

The preliminary strategy π should be squashed coherently through a bounded and differentiable

saturation function. The saturation function has to be on a finite interval, such that a maximum and

minimum are obtained for finite function inputs. Furthermore, the function should be monotonically

increasing. The derivative and second derivative of the saturation function have to be zero at the

boundary points to require stationary points at these boundaries. Specifically, consider the third-order

Fourier series expansion of a trapezoidal wave κ ( x ) = [9 sin( x ) + sin(3x )]/8, which is normalized to

the interval [−1, 1]. Given the boundary conditions, the saturation function is defined as

9 sin πt + sin(3πt )

S(πt ) = umin + umax + umax . (19)

8

If the function is considered on the domain [3π/2, 2π ], the function is monotonically increasing,

and the control signal u is squashed to the interval [umin umin + umax ].

Sensors 2018, 18, 2539 10 of 26

Transition models have a large impact on the performance of model-based RL, since the learned

strategy inherently relies on the quality of the learned forward model, which essentially serves as

a simulator of the system. The transition models that have been employed for model-based RL can be

classified into two main categories [25]: the deterministic models and the stochastic models. Despite

the intensive computation, the state-of-the-art approach for learning the transition models is the

GP model [46], because it is capable of modeling a wide spread of nonlinear systems by explicitly

incorporating model uncertainty into long-term planning, which is a key problem in model-based

learning. In addition, the GP model shows good convergence properties which are necessary for

implementation of the algorithm. A GP model can be thought as a probabilistic distribution over

possible functions, and it is completely specified by a mean function m(·) and a positive semi-definite

covariance function k(·, ·), also called a kernel.

Here, we consider the unknown function that describes the system dynamics

x t = f ( x t −1 , u t −1 ),

(20)

yt = xt + ε t ,

transition dynamics f , and i.i.d. system noise ε ∼ N 0, σε2 . In order to take the model uncertainties

into account during prediction and planning, the proposed approach does not make a certainty

equivalence assumption on the learned model. Instead, it learns a probabilistic GP model and infers the

posterior distribution over plausible function f from the observations. For computation convenience,

we consider a prior mean m ≡ 0 and the squared exponential kernel

f ( x ) ∼ GP (m( x ), k ( x, x 0 )), (21)

1 T

k ( x, x 0 ) = α2 exp − ( x − x 0 ) Λ−1 ( x − x 0 ) + σε2 I, (22)

2

where α2 is the variance of the latent function f , the weighting matrix Λ = diag l12 , . . . l D

2

depends

on the different characteristic length-scale li of each input dimension. Given N training inputs

T

hX = [ xi1 , . . . x N ] and corresponding training targets y = [y1 , . . . y N ] , the GP hyper-parameters

Λ α2 σ2ε could be learned using evidence maximization algorithm [46].

Given a deterministic test input x∗ , the posterior prediction p( f ∗ | x∗ ) of the function value

f ∗ = f ( x∗ ) is Gaussian distributed

p ( f ∗ | x ∗ ) ∼ N ( µ ∗ , ∑ ), (23)

∗

−1

µ∗ = m( x∗ ) + k ( x∗ , X )(K + σε2 I ) (y − m( X )) = m( x∗ ) + k( x∗ , X ) β, (24)

−1

∑ = k(x∗ , x∗ ) − k(x∗ , X )(K + σε2 I ) k( X, x∗ ), (25)

∗

−1

where β = K + σε2 I (y − m( X )), and K = k( X, X ) is the kernel matrix.

In our force control learning system, the function of the GP model is defined as

f : RD+ F → RE , ( xt−1 , ut−1 ) 7−→ ∆t = xt − xt−1 + δt , where x̂t−1 = ( xt−1 , ut−1 ) is the training input

tuples. Take the state increments ∆t = xt − xt−1 + δt as training targets, where δt ∼ N (0, Σε ) is i.i.d.

measurement noise. Since the state differences vary less than the absolute values, the underlying

function that describes these differences varies less. Therefore, it implies that the learning process is

easier and that less data is needed to find an accurate model. Moreover, when the predictions leave the

training set, the prediction will not fall back to zero but remain constant.

Sensors 2018, 18, 2539 11 of 26

For the sake of reducing the required physical interactions with the robots while getting

an effective control strategy, the effective utilization of sampled data must be increased. To this end,

the learned probabilistic GP model is used as the faithfully dynamics of the system, which is then used

for internal simulations and predictions about how the real system would behave. The (sub)optimal

strategy is improved based on the evaluations of these internal virtual trials. Thus, the data-efficiency

is improved.

To evaluate the strategy, the long-term predictions of state p( x1 ), . . . , p( x T ) should be computed

iteratively from the initial state distribution p( x0 ) by cascading one-step predictions [47]. Since the

GP model can map the Gaussian-distributed states space to the targets space, the uncertainties of the

inputs can pass through the model, and the uncertainties of the model are taken into account in the

long-term planning. A conceptual illustration of long-term predictions of state evolution [48] is shown

in Figure

Sensors6.

2018, 18, x FOR PEER REVIEW 11 of 26

Figure 6. A conceptual illustration of long-term predictions of state evolution.

The one-step prediction of the states can be summarized as

The one-step prediction of the states can be summarized as

p(xt−1 ) → p(ut−1 ) → p(xt−1 , ut −1 ) → p(Δt ) → p(xt ). (26)

p ( x t −1 ) → p ( u t −1 ) → p ( x t −1 , u t −1 ) → p ( ∆ t ) → p ( x t ). (26)

As = ( ) is a function of state and ( ) is known, the calculation of ( )

requires

As ut−1a = joint

π (distribution ( ) =of( state

, x ). First, we( calculate the predictive control signal

xt−1 ) is a function t −1 and p x t−1 ) is known, the calculation of p ( xt )

( ) and subsequently the cross-covariance cov[ , ]. Then, ( , ) is approximated

requires a joint distribution p( x̂t−1 ) = p( xt−1 , ut−1 ). First, we calculate the predictive control signal

by a Gaussian distribution [47]

p(ut−1 ) and subsequently the cross-covariance cov[ xt−1 , ut−1 ]. Then, p( xt−1 , ut−1 ) is approximated by

a Gaussian distribution [47] μx Σ x Σ x ,u

t −1 , # " t −1

p( xˆ t −1 ) = p( xt −1 , ut −1 ) = ( μˆ t −1 , Σˆ t −1 ) = "

t −1 t −1

. (27)

μµu ΣTx ,u Σ

#!

ˆ xt −t− ∑ ∑

t −1 t −1 xt−1 t −1 xt−1 ,ut−1

p( x̂t−1 ) = p( xt−1 , ut−1 ) = N (µ̂t−1 , ∑t−1 ) = N

11 u

, . (27)

µ u t −1 ∑ Txt−1 ,ut−1 ∑ u t −1

The distribution of the training targets Δ are predicted as

The distribution of the training t ) = p(∆

p ( Δtargets f (txˆare

t − 1 )|predicted

xˆ t − 1 ) p ( xˆ t − 1 )as

dxˆ t − 1 , (28)

Z

p(∆tdistribution

where the posterior predictive ) = p( f ( x̂oft−the

1 )| x̂transition 1 ) d x̂t−1 , ( (

t−1 ) p ( x̂t−dynamics )| ) could be (28)

calculated using the Formulas (23)–(25). Using moment matching [49], (Δ ) could be approximated

as the

where a Gaussian

posteriordistribution ( , Σ ) . Then,

predictive distribution a Gaussian

of the transition approximation

dynamics pto ( f (the

x̂t−1desired

)| x̂t−1 ) state

could be

distribution ( ) is given as

calculated using the Formulas (23)–(25). Using moment matching [49], p(∆t ) could be approximated as

a Gaussian distribution N (µ∆ , Σ∆ ). Then,

p( xt |aμˆGaussian

ˆ

t −1 , Σ t −1 ) ~ ( μt , Σt ),

approximation to the desired state distribution

(29)

p( xt ) is given as

p( xt |µ̂t−1μ,t∑ ˆ= μ )+∼ , ( µ t , ∑ ),

μΔN (30) (29)

t−t −11

t

= µxt− Δt µ

t −11 ,+ ] +∆cov[

, Δt , xt −1 ], (31) (30)

∑cov[

= ∑ + ∑ +cov[ xt−1 , ∆t ]−1+ cov[∆t , xt−1 ],

x , Δ ] = cov[ x , u ]Σ cov[u , Δ ]. (32)

(31)

t t−t 1− 1 t∆ t −1 t −1 u t −1 t

−1

cov

4.5. Gradient-Based Strategy [ xt−1 , ∆t ] = cov[ xt−1 , ut−1 ] ∑ cov[ut−1 , ∆t ].

Learning (32)

u

The goal of the learning algorithm is to find the strategy parameters that minimize the total

expected costs ∗ = arg min ( ) . The search direction can be selected using the gradient

information. The total expected cost ( ) in steps is calculated according to the state evolution

Jπ (θ ) = t =0 [c( xt )], x0 ~ ( μ0 , Σ0 ),

T

(33)

Sensors 2018, 18, 2539 12 of 26

The goal of the learning algorithm is to find the strategy parameters that minimize the total

expected costs θ ∗ = arg minJ π (θ ). The search direction can be selected using the gradient information.

The total expected cost J π (θ ) in T steps is calculated according to the state evolution

T

J π (θ ) = ∑ E[c(xt )], x0 ∼ N ( µ0 , ∑ ), (33)

t =0 0

Z

E[ct ] = ct N ( xt |µt , ∑) dxt . (34)

t

where c( xt ) is the instantaneous cost at time t, and E[c( xt )] is the expected values of the instantaneous

cost with respect to the predictive state distributions.

The cost function in RL usually penalizes the Euclidean distance from the current state to the target

state, without considering other prior knowledge. However, in order to make robots with the ability of

compliance, the control gains should not be high for practical interaction tasks. Generally, high gains

will result in instability in stiff contact situations due to the inherent manipulator compliance, especially

for an admittance-type force controller. In addition, low gains lead to several desirable properties

of the system, such as compliant behavior (safety and/or robustness), lowered energy consumption,

and less wear and tear. This is similar to the impedance regulation rules of humans. Humans learn

a task-specific impedance regulation strategy that combines the advantages of high stiffness and

compliance. The general rule of thumb thus seems to be “be compliant when possible; stiffen up only

when the task requires it”. In other words, impedance increasing ensures tracking accuracy while

impedance decreasing ensures safety.

To make the robots with these impedance characteristics, we present a way of taking the penalty

of control actions into account during planning. The instantaneous cost function is defined

c t = c b ( x t ) + c e ( u t ), (35)

1

cb ( xt ) = 1 − exp(− d( xt , xtarget )2 ) ∈ [0, 1], (36)

2σc2

Here, cb ( xt ) is the cost caused by the state error, denoted by a quadratic binary saturating function,

which saturates at unity for large deviations to the desired target state. d(·) is the Euclidean distance

between the current state xt to the target state xtarget and σc is the width of the cost function. ce (ut ) is

the cost caused by the control actions, i.e., the mean squared penalty of impedance gains. The suitable

impedance gains could be reduced by punishing the control actions. ζ is the action penalty coefficient.

ut is the current control signal, and umax is the maximum control signal amplitude.

The gradients of J π (θ ) with respect to the strategy parameters θ are given by

dJ π (θ ) d∑tT=0 E[c( xt )] T

dE[c( xt )]

= =∑ . (38)

dθ dθ t =0

dθ

The expected immediate cost E[c( xt )] requires averaging with respect to the state distribution

p( xt ) ∼ N (µt , Σt ), where µt and Σt are the mean and the covariance of p( xt ), respectively.

The derivative in Equation (38) can be written as

= = + . (39)

dθ dp( xt ) dθ ∂µt dθ ∂ ∑t dθ

Given c( xt ), the item ∂E[c( xt )]/∂µt and ∂E[c( xt )]/∂Σt could be calculated analytically.

Then we will focus on the calculation of dµt /dθ and dΣt /dθ. Due to the computation sequence

Sensors 2018, 18, 2539 13 of 26

of (26), we know that the predicted mean µt and the covariance Σt are functionally dependent on

p( xt−1 ) ∼ N (µt−1 , Σt−1 ) and the strategy parameters θ through µt−1 . We thus obtain

dµt ∂µt dp( xt−1 ) ∂µt ∂µt dµt−1 ∂µt d ∑t−1 ∂µt

= + = + + , (40)

dθ ∂p( xt−1 ) dθ ∂θ ∂µt−1 dθ ∂ ∑t−1 dθ ∂θ

= + = + + , (41)

dθ ∂p( xt−1 ) dθ ∂θ ∂µt−1 dθ ∂ ∑t−1 dθ ∂θ

∂µ∆ ∂p(ut−1 ) ∂µ∆ ∂ ∑u

= ∆

∂µt ∂µ ∂µu

= + , (42)

∂θ ∂p(ut−1 ) ∂θ ∂µu ∂θ ∂ ∑u ∂θ

Sensors 2018, 18, x FOR PEER REVIEW 13 of 26

∂ ∑t ∂ ∑∆ ∂p(ut−1 ) ∂ ∑∆ ∂µu ∂ ∑∆ ∂ ∑u

= = + . (43)

By repeated application

∂θ of ∂pthe(uchain-rule,

t −1 ) ∂θ the Equations ∂ ∑u can

∂µu ∂θ(39)–(43) ∂θ be computed analytically.

We omit further application

By repeated lengthy details here

of the and referthe

chain-rule, to Equations

[47] for more information.

(39)–(43) Then the non-convex

can be computed analytically.

gradient-based optimization algorithm—e.g., conjugate gradient—can be applied

We omit further lengthy details here and refer to [47] for more information. Then to find the strategy

the non-convex

∗

parameters

gradient-based optimization ( ).

that minimizealgorithm—e.g., conjugate gradient—can be applied to find the strategy

parameters θ ∗ that minimize J π (θ ).

5. Simulations and Experiments

5. Simulations

To verify the andproposed

Experiments force control learning system based on variable impedance control, a

seriesToofverify

simulation and experiment

the proposed force control studies on the

learning Reinovo

system basedREBot-V-6R-650 industrialcontrol,

on variable impedance robot

a(Shenzhen

series of Reinovo

simulationTechnology CO., LTD,

and experiment Shenzhen,

studies on theChina) are REBot-V-6R-650

Reinovo conducted and presented

industrialinrobot

this

section. Reinovo REBot-V-6R-650 is a six-DoF industrial manipulator with a

(Shenzhen Reinovo Technology CO., LTD, Shenzhen, China) are conducted and presented in this six-axis Bioforcen F/T

sensor (Anhui

section. Reinovo Biofrcen Intelligent Technology

REBot-V-6R-650 is a six-DoF CO., LTD, Hefei,

industrial China) with

manipulator mounted at theBioforcen

a six-axis wrist. TheF/T

F/T

is used Biofrcen

sensor (Anhui to percept the contact

Intelligent force of the

Technology CO.,end-effector.

LTD, Hefei, The sensing

China) rangeatof

mounted the

the F/T sensor

wrist. The F/Tis

±625 N , , ±1250 N , ±25 Nm , , and ±12.5 Nm with the total accuracy

sensor is used to percept the contact force of the end-effector. The sensing range of the F/T sensor isless than ≤

±1% F.

625 NS. Fx , Fy , ±1250 N Fz , ±25 Nm Tx , Ty , and ±12.5 Nm Tz with the total accuracy less than ≤ 1% F.S.

We first

We first evaluated

evaluated our

our system

system through a simulation of force control using MATLAB (R2015b

Version 8.6)

Version 8.6) Simulink.

Simulink. The

The block

block diagram of simulation is shown in Figure 7. In the simulation

simulation setup,

placed under

a stiff contact plane is placed under the

the end-effector

end-effector of the

the robot.

robot. The stiffness and damping of the

are set

plane are set 5000

5000 N/m

N/m and 1 Ns/m,

Ns/m,respectively.

respectively. The

The robot’s

robot’s base is located at [[0,

base is 0, 0,

0, 0] m

0] m while the

positionofofthe

original position theplane

plane is located

OP is located at [0.2,[0.2,

at −0.5,−0.5,

0.15] 0.15] m. The plane’s

m. The plane’s length,length,

width,width, and

and height

height

are 0.9,are 0.9,0.2

1, and 1, and 0.2 m, respectively.

m, respectively. The robotThe robot automatically

should should automatically

learn to learn to the

control control the force

contact contact

to

force

the to the value.

desired desired value.

Figure 7. The block diagram of simulation in MATLAB Simulink.

In the simulation, the episode length is set as = 1 s. The control period of the impedance

controller is 0.01 s, and the calculation period of the learning algorithm is 0.01 s. The number of total

learning iterations, excluding the random initialization, is = 20. The training inputs of the learning

algorithm are the position and contact force of the end-effector = [ , , , , , ] ∈ ℝ . The

training targets of the learning algorithm are the desired position and the desired contact force =

, , , , , = [0.41, 0, 0.2265, 0, 0, 15] ∈ ℝ . The desired position is the initial position of

Sensors 2018, 18, 2539 14 of 26

In the simulation, the episode length is set as T = 1 s. The control period of the impedance

controller is 0.01 s, and the calculation period of the learning algorithm is 0.01 s. The number of

total learning iterations, excluding the random initialization, is N = 20. The training inputs of the

learning algorithm are the position and contact force of the end-effector x = X, Y, Z, Fx , Fy , Fz ∈ R6 .

The training

h targets of the learning

i algorithm are the desired position and the desired contact force

y = Xd , Yd , Zd , Fdx , Fdy , Fdz = [0.41, 0, 0.2265, 0, 0, 15] ∈ R6 . The desired position is the initial

position of the end-effector in Cartesian space. The desired contact force in z-axis direction is 15 N.

If the steady state error of contact force | Fz − Fzd | ≤ 1 N and the overshoot is less than 3 N, the task is

successful; otherwise, it is failed. The target state in cost function is xtarget = [0.41, 0, 0.2265, 0, 0, 15].

The strategy outputs are the impedance parameters u = [Kd Bd ]. The number of the GP controller

is n = 10. The ranges of the impedance parameters are set as Kd ∈ [0.1 25] and Bd ∈ [50 1000].

The

Sensorsaction penalty

2018, 18, x FOR PEER REVIEW of the cost function is ζ = 0.03. The measurement

coefficient noise 14

of ofthe

26

joint position, joint velocities and the contact force are set as δ ∼ N 0, 0.012 respectively. In the

impedance impedanceare

parameters parameters are initialized

initialized to stochastic

to stochastic variablesvariables

that that

are aresubject

subject to

N((u0|0.1

|0.1umax , 0.1

0.1umax ))..

5.2. Simulation

5.2. Simulation Results

Results

Figure 88 shows

Figure shows the the simulation

simulation results

results of

of the

the force

force control

control learning

learning system. The block

system. The block marks

marks in in

Figure 8a indicate whether the task is successful or not. The light blue dash-dotted

Figure 8a indicate whether the task is successful or not. The light blue dash-dotted line represents the line represents the

cumulative cost

cumulative cost during

during the the explorations.

explorations. FigureFigure 99 details

details thethe learning

learning process

process of

of the

the total

total 20

20 learning

learning

iterations. Figure

iterations. Figure 10 10 shows

shows thethe joint

joint trajectories

trajectories after

after 4,4, 5,

5, 10,

10, 15,

15, and

and 2020 updates.

updates. Note that N =

Note that = 00

denotes the initial trial, and the whole learning process is implemented

denotes the initial trial, and the whole learning process is implemented automatically. automatically.

In the

In the initial

initial trial,

trial, the

the robot

robot moves

moves down down slowly

slowly to to search

search the

the contact

contact plane

plane and

and the

the end-effector

end-effector

begins to contact the plane at T = 0.7 s. Due to the large overshoot

begins to contact the plane at = 0.7 s. Due to the large overshoot of the contact force, the of the contact force, the task

task is

is

failed until the fifth learning iteration. After the fifth learning iteration, the end-effector

failed until the fifth learning iteration. After the fifth learning iteration, the end-effector contacts the contacts the

plane at T =

plane at = 0.34 s, which

0.34 s, which is is faster

faster than

than the

the initial

initial trial.

trial. The contact force

The contact force reaches

reaches to

to the

the desired

desired value

value

rapidly with a little overshoot. The learning algorithm optimizes the control

rapidly with a little overshoot. The learning algorithm optimizes the control strategy continuously to strategy continuously to

reduce the cumulative cost. After seven iterations, the strategy’s performance

reduce the cumulative cost. After seven iterations, the strategy’s performance is stable, and the is stable, and the contact

force reaches

contact force the desired

reaches thevalue

desired quickly

valuewithout

quicklyovershoot

without by adjusting

overshoot bythe impedance

adjusting the parameters

impedance

dynamically. The optimal strategy is learned after 20 iterations, and its

parameters dynamically. The optimal strategy is learned after 20 iterations, and its cumulative cumulative cost is the smallest.

cost

The end-effector can contact the plane at T = 0.3

is the smallest. The end-effector can contact the plane at = 0.3 s.s.

Figure 8. Simulation

Figure 8. Simulation results

results of

of force

force control

control learning

learning system.

system. (a)

(a) The

The cost

cost curve

curve of

of learning

learning process;

process;

the

the blue dotted line and the blue shade are the predicted cost mean and the 95% confidence interval.

blue dotted line and the blue shade are the predicted cost mean and the 95% confidence interval.

(b) The

(b) The performances

performances ofof force

force control

control during

during learning

learning process.

process.

Sensors 2018, 18, 2539 15 of 26

Sensors 2018, 18, x FOR PEER REVIEW 15 of 26

Sensors 2018, 18, x FOR PEER REVIEW 15 of 26

Figure

Figure9.9.Learning

Learningprocess

processof the totaltotal

of the 20 learning iterations.

20 learning The first

iterations. row:

The contact

first row: force in z-axis

contact forcedirection;

in z-axis

the second

Figure

direction; row:second

the Cartesian

9. Learning processstiffness

row: of schedules;

the total

Cartesian 20 theschedules;

learning

stiffness third row: the

iterations.Cartesian

The damping

first row:

third row: schedules;

contact force

Cartesian the fourth

in z-axis

damping row:

direction;

schedules;

position of

the second the end-effector

row: Cartesian in z-axis

stiffness direction;

schedules; the

the fifth

third row:

row: velocity

Cartesianof the end-effector

damping in

schedules; z-axis

the

fourth row: position of the end-effector in z-axis direction; the fifth row: velocity of the end-effector direction.

fourth row:

position

in z-axis of the end-effector in z-axis direction; the fifth row: velocity of the end-effector in z-axis direction.

direction.

Figure 10. Joint trajectories after 4, 5, 10, 15, and 20 updates for the second, third, and fifth joint of the

Reinovo

Figure robot. trajectories after 4, 5, 10, 15, and 20 updates for the second, third, and fifth joint of the

Figure 10.

10. Joint

Joint trajectories after 4, 5, 10, 15, and 20 updates for the second, third, and fifth joint of the

Reinovo robot.

Reinovo robot.

Figure 11 shows the evolutions of force control and impedance profiles. The state evolutions

predicted

Figure by11the GP model

shows can be represented

the evolutions by theand

of force control blue dotted line

impedance and theThe

profiles. blue shade,

state which

evolutions

denoteFigure

the 11 shows

mean and the95%

the evolutions of force

confidence control

interval of and

the impedance

state prediction profiles. The state

respectively. The evolutions

historical

predicted by the GP model can be represented by the blue dotted line and the blue shade, which

predicteddata

sampled by the

areGP model can

constantly be represented

enriched theby

with interval the blue

increase ofdotted line and

interaction theand

time, blue

theshade, which model

learned denote

denote the mean and the 95% confidence of the state prediction respectively. TheGP

historical

the

is mean

optimized and the 95% confidence interval of the state prediction respectively. The historical sampled

sampled dataand

arestabilized

constantlygradually. When

enriched with thea increase

good GPofmodel is learned,

interaction time,itand

canthe

be learned

used as GP

a faithful

model

data are

proxy of constantly

the real enriched

system with

(Figure the increase of interaction time, and the learned GP model is

11b–d).

is optimized and stabilized gradually. When a good GP model is learned, it can be used as a faithful

proxy of the real system (Figure 11b–d).

Sensors 2018, 18, 2539 16 of 26

optimized and stabilized gradually. When a good GP model is learned, it can be used as a faithful

proxy of the

Sensors 2018, 18,real system

x FOR (Figure 11b–d).

PEER REVIEW 16 of 26

Figure

Figure 11.

11. States

States evolution

evolution ofof force

force control.

control. Columns

Columns (a–d)

(a–d) are

are the

the state

state evolutions

evolutions of

of the

the 1st,

1st, 6th,

6th, 12th,

12th,

and 20th learning iteration, respectively. The top row is the change of contact force Fz , the second row

and 20th learning iteration, respectively. The top row is the change of contact force , the second row

is

is the

the target stiffness K ,,and

target stiffness andthethethird

thirdrow

rowisisthe

thetarget

targetdamping

dampingB . .

dz dz

As shown in Figure 11d, the process of impedance regulation can be divided into three phases:

As shown in Figure 11d, the process of impedance regulation can be divided into three phases:

(1) the phase before contacting the plane; (2) the phase of coming into contact with the plane; and (3)

(1) the phase before contacting the plane; (2) the phase of coming into contact with the plane; and (3) the

the stable phase of contact with the plane. When the end-effector contact the environment, the

stable phase of contact with the plane. When the end-effector contact the environment, the manipulator

manipulator is suddenly converted from free space motion to constrained space motion, and the

is suddenly converted from free space motion to constrained space motion, and the collision is

collision is inevitable. The stiffness of the environment increases suddenly. Consequently, the

inevitable. The stiffness of the environment increases suddenly. Consequently, the stiffness of the

stiffness of the controller declines rapidly to make the system ‘soft’ to ensure safety. Meanwhile, the

controller declines rapidly to make the system ‘soft’ to ensure safety. Meanwhile, the damping

damping continues to increase to make the system ‘stiff’ to suppress the impact of environmental

continues to increase to make the system ‘stiff’ to suppress the impact of environmental disturbance.

disturbance.

Throughout the simulation, only five interactions (5 s of interaction time) are required to learn to

Throughout the simulation, only five interactions (5 s of interaction time) are required to learn

complete the force tracking task successfully. Besides, the impedance characteristics of the learned

to complete the force tracking task successfully. Besides, the impedance characteristics of the learned

strategy are similar to that of humans for force tracking. The simulation results verify the effectiveness

strategy are similar to that of humans for force tracking. The simulation results verify the

and the efficiency of the proposed system.

effectiveness and the efficiency of the proposed system.

5.3. Experimental Study

5.3. Experimental Study

The hardware architecture of the system is shown in Figure 12. The test platform consists of

The hardware architecture of the system is shown in Figure 12. The test platform consists of six

six parts, an industrial manipulator, servo driver, Galil-DMC-2163 motion controller, Bioforcen F/T

parts, an industrial manipulator, servo driver, Galil-DMC-2163 motion controller, Bioforcen F/T

sensor, an industrial PC, and a workstation. All parts communicate with others through the switch.

sensor, an industrial PC, and a workstation. All parts communicate with others through the switch.

The motion controller communicates with the industrial PC through TCP/IP protocol running at

The motion controller communicates with the industrial PC through TCP/IP protocol running at 100

100 Hz. The sensor communicates with the industrial PC via Ethernet interface through TCP/IP and

Hz. The sensor communicates with the industrial PC via Ethernet interface through TCP/IP and

samples the data at 1 kHz. The specific implementation diagram of the algorithm is shown in Figure 13.

samples the data at 1 kHz. The specific implementation diagram of the algorithm is shown in Figure

The motion control of the robot is executed using Visual Studio 2010 on the industrial PC. The learning

13. The motion control of the robot is executed using Visual Studio 2010 on the industrial PC. The

algorithm is implemented by MATLAB on the workstation. The workstation communicates with the

learning algorithm is implemented by MATLAB on the workstation. The workstation communicates

industrial PC through UDP protocol.

with the industrial PC through UDP protocol.

The motion controller communicates with the industrial PC through TCP/IP protocol running at 100

Hz. The sensor communicates with the industrial PC via Ethernet interface through TCP/IP and

samples the data at 1 kHz. The specific implementation diagram of the algorithm is shown in Figure

13. The motion control of the robot is executed using Visual Studio 2010 on the industrial PC. The

learning

Sensors algorithm

2018, 18, 2539 is implemented by MATLAB on the workstation. The workstation communicates17 of 26

with the industrial PC through UDP protocol.

Figure12.

Figure 12.Hardware

Hardware architecture

architecture of

of the

thesystem.

system.

Sensors

Sensors 2018,

2018, 18,18, x FOR

x FOR PEER

PEER REVIEW

REVIEW 17 17

of 26

of 26

Figure 13. Implementation diagram of the algorithm.

Figure 13. Implementation diagram of the algorithm.

To imitate the nonlinear variable characteristics of the circumstance during the force tracking

To aimitate

task, the nonlinear

combination of springvariable

nonlinear and ropecharacteristics

is taken as the of the circumstance

unstructured contact during the force

environment. tracking

As shown

task, a combination

combination

in Figure of

of spring

springand

14, the experimental and rope

rope

setup isistaken

mainly taken asasthe

consists theunstructured

of unstructured contact

contact

a spring dynamometer environment.

attached to As

environment. the shown

As at in

shown

tool

Figure

in 14,14,

Figure thethe

experimental

the end-effector and a ropesetup

experimental mainly

of setup

unknown mainly consists

consists

length tiedoftoaofspring

the dynamometer

a spring

spring dynamometer

with the otherattached

attached

end to on

fixed the tool

tothe

the at the

tool

table. at

theThe contactand

end-effector

end-effector force

and isa controlled

a rope ropeof unknown by stretching

of unknown length

length the

tied rope

tiedtototheand

the the spring.

spring

springwith

withthe Here,

the theend

other

other specifications

end fixed on the of table.

the

table.

Therope are unknown,

contact and the rope

force is controlled is in a natural

by stretching the state

rope of andrelaxation.

the spring.TheHere,

measurement range, length,

the specifications of the

and

rope arediameter

are unknown, of

unknown,and the spring

andthe theropedynamometer

ropeisisinina natural are

a natural 30

state Kg, 0.185 m,

of relaxation.

state and

of relaxation. 29

TheThe mm, respectively.

measurement

measurement range, The exact

length,

range, and

length,

andvalues

diameter of

of the

diameter theofstiffness

spring

the spring and damping

dynamometer

dynamometer of the

are springs

30 are

Kg, 0.185

30 are

Kg, unknown

m,0.185

and m,

29 mm,to the

and 29 system.

respectively. In The

the experiments,

mm, respectively. exactThe

values of

exact

thethe episode

stiffness length

and (i.e.,

damping the

of prediction

the springs horizon)

are is

unknown = 3 s.

to Other

the settings

system.

values of the stiffness and damping of the springs are unknown to the system. In the experiments,In are

the consistent

experiments, with those

the of

episode

thesimulations

length (i.e., the

episode inprediction

lengthSection 5.1.horizon)

(i.e., the prediction is =

= 3 s. Other

is Thorizon) 3 s. Other

settings are consistent

settings arewith those ofwith

consistent simulations

those of

in Section 5.1.

simulations in Section 5.1.

Figure 14. (a) Experimental setup; (b) simplified model of the contact environment.

5.4. Experimental

FigureResults

14. (a) Experimental setup; (b) simplified model of the contact environment.

Figure 14. (a) Experimental setup; (b) simplified model of the contact environment.

The experimental results are shown in Figure 15. Figure 16 details main iterations of the learning

5.4.process.

Experimental

From Results

the experimental results, we can see that in the initial trial, the manipulator moves

slowly and the rope begins

The experimental resultstoare

be stretched

shown inat = 2.5 s

Figure to increase

15. Figure the contact

16 details force. The of

main iterations contact force

the learning

reaches 10.5 N at the end of the test, which implies that the task failed. In the second trial, the rope

process. From the experimental results, we can see that in the initial trial, the manipulator moves

and the spring can be stretched at = 1.5 s, which is faster than that of the first trial, and the contact

slowly and the rope begins to be stretched at = 2.5 s to increase the contact force. The contact force

force reaches the desired values rapidly, but the task failed because the overshoot is greater than 3 N.

reaches 10.5 N at the end of the test, which implies that the task failed. In the second trial, the rope

Only two learning iterations are needed to complete the task successfully. After 17 iterations, the

and the spring can be stretched at = 1.5 s, which is faster than that of the first trial, and the contact

cumulative cost is the smallest and the force control performance is also the best. The rope can be

Sensors 2018, 18, 2539 18 of 26

The experimental results are shown in Figure 15. Figure 16 details main iterations of the learning

process. From the experimental results, we can see that in the initial trial, the manipulator moves

slowly and the rope begins to be stretched at T = 2.5 s to increase the contact force. The contact force

reaches 10.5 N at the end of the test, which implies that the task failed. In the second trial, the rope

and the spring can be stretched at T = 1.5 s, which is faster than that of the first trial, and the contact

force reaches the desired values rapidly, but the task failed because the overshoot is greater than 3 N.

Only two learning iterations are needed to complete the task successfully. After 17 iterations, the

cumulative cost is the smallest and the force control performance is also the best. The rope can be

stretched

Sensors 18,Tx =

2018,at 1 s,

FOR andREVIEW

PEER the overshoot is suppressed effectively. 18 of 26

Sensors 2018, 18, x FOR PEER REVIEW 18 of 26

Experimental results

results of force

force control

control learning system. (a)

(a)The

Thecost

costcurve

curveof oflearning

learningprocess.

Figure

Figure 15.

15. Experimental results of

of force control learning

learning system.

system. (a) The cost curve of learning process.

process.

(b)

(b) The performances of force control during learning process, including a total 20 learning iterations

(b) The

The performances

performances of

of force

force control

control during

during learning

learning process,

process, including

including aa total

total 20

20 learning

learning iterations

iterations

throughout

throughout the

the experiment.

experiment.

throughout the experiment.

Figure 16. Main iterations of the learning process. (a) Contact force; (b) Cartesian stiffness schedules;

Figure 16.

Figure 16. Main

Main iterations

iterations of

of the

the learning

learning process.

process. (a)

(a) Contact

Contact force;

force; (b)

(b) Cartesian

Cartesian stiffness

stiffness schedules;

schedules;

(c) Cartesian damping schedules.

(c) Cartesian damping schedules.

(c) Cartesian damping schedules.

Figure 17 shows the evolutions of force control and impedance profiles. The bottom row shows

Figure 17 shows the evolutions of force control and impedance profiles. The bottom row shows

the stretching processthe

Figure 17 shows ofevolutions

the combination

of force of the rope

control and the spring.

and impedance TheThe

profiles. joint trajectories

bottom and

row shows

the stretching process of the combination of the rope and the spring. The joint trajectories and

Cartesian trajectories

the stretching process during the 20th experiment

of the combination of the rope iteration

and theare shown

spring. Theinjoint

Figure 18. The trajectories

trajectories of

and Cartesian

Cartesian trajectories during the 20th experiment iteration are shown in Figure 18. The trajectories of

other iterations

trajectories are the

during similar

20thtoexperiment

those of the 20th iteration.

iteration are shownThe instretching

Figure 18.process of the combination

The trajectories of other

other iterations are similar to those of the 20th iteration. The stretching process of the combination

can be divided

iterations into four

are similar phases,

to those just20th

of the as the shadedThe

iteration. areas shown. Table

stretching process1 summarizes the keycan

of the combination states

be

can be divided into four phases, just as the shaded areas shown. Table 1 summarizes the key states

of the four

divided phases.

into The corresponding

four phases, subscripts

just as the shaded of , Table

areas shown. , and are shown

1 summarizes in Figure

the key 17d

states of thewhile

four

of the four phases. The corresponding subscripts of , , and are shown in Figure 17d while

the subscripts

phases. of position ( subscripts

The corresponding ) and velocity

of F (, K) are shown

, and in Figure

Bdz are shown18c,d.

in Figure 17d while the subscripts

the subscripts of position ( ) and velocityz ( )dzare shown in Figure 18c,d.

of position (P) and velocity (V) are shown in Figure 18c,d.

Sensors 2018, 18, 2539 19 of 26

Sensors 2018, 18, x FOR PEER REVIEW 19 of 26

Sensors 2018, 18, x FOR PEER REVIEW 19 of 26

States evolution

evolution of force

force control.

control. Columns

Columns (a–d)

(a–d) are

are the

the state

state evolutions

evolutions of of the

the 1st,

1st, 6th,

6th, 12th,

12th,

Figure 17. States evolution of force control. Columns (a–d) are the state evolutions of the 1st, 6th, 12th,

and 20th

20th learning

learningiteration,

iteration,respectively.

respectively.TheThe top

toptop row

rowrow

showsshows the contact

the contact force

force force , the

Fz , the second second

row showsrow

and 20th learning iteration, respectively. The shows the contact , the second row

shows the

the profile profile of

of stiffnessstiffness , and

Kdz , and the, and the

thirdthe third

row row

shows shows the profile of damping

Bdz . .

shows the profile of stiffness third rowthe profile

shows theof damping

profile of damping .

Figure 18. Trajectories during the 20th experiment iteration. (a) Joint position; (b) Joint velocity; (c)

Figure

Figure 18.

18. Trajectories

Trajectoriesduring

duringthe

the20th

20thexperiment

experimentiteration.

iteration.(a)(a)

Joint position;

Joint (b) (b)

position; Joint velocity;

Joint (c)

velocity;

Cartesian position of the end-effector; (d) Cartesian velocity of the end-effector.

Cartesian

(c) position

Cartesian of the

position end-effector;

of the (d)(d)

end-effector; Cartesian velocity

Cartesian of the

velocity end-effector.

of the end-effector.

Table 1. Key states of the four phases during the 20th iteration

Table 1. Key states of the four phases during the 20th iteration

Subscript

Subscript // // // ⋅⋅ // // ⋅⋅ // ⋅⋅

00 0.00

0.00 0.2265

0.2265 0.0000

0.0000 0.393

0.393 6.246

6.246 588.97

588.97

11 1.28

1.28 0.2642

0.2642 0.0327

0.0327 0.326

0.326 8.468

8.468 653.99

653.99

22 1.48

1.48 0.2701

0.2701 0.0292

0.0292 6.095

6.095 5.888

5.888 673.83

673.83

33 2.00

2.00 0.2744

0.2744 0.0010

0.0010 14.625

14.625 2.469

2.469 601.93

601.93

44 3.00

3.00 0.2745

0.2745 0.0000

0.0000 14.874

14.874 2.379

2.379 598.53

598.53

Sensors 2018, 18, 2539 20 of 26

Table 1. Key states of the four phases during the 20th iteration.

0 0.00 0.2265 0.0000 0.393 6.246 588.97

1 1.28 0.2642 0.0327 0.326 8.468 653.99

2 1.48 0.2701 0.0292 6.095 5.888 673.83

3 2.00 0.2744 0.0010 14.625 2.469 601.93

4 3.00 0.2745 0.0000 14.874 2.379 598.53

The four phases of impedance regulation are corresponded to the force control process

mentioned above:

1. T0 − T1 : Phase before stretching the rope. The manipulator moves freely in the free space.

To tighten the rope quickly, the movement of the end-effector increases as the impedance increase.

The contact force is zero in this phase.

2. T1 − T2 : Phase of stretching the rope. When the rope is stretched, the manipulator is suddenly

converted from free space motion to constrained space motion. The stiffness of the environment

increases suddenly, and this can be seen as a disturbance of environment. Consequently,

the stiffness of the controller declines rapidly to make the system ‘soft’ to ensure safety.

Meanwhile, the damping continues to increase to make the system ‘stiff’ to suppress the

impact of environmental disturbance and avoid oscillation. On the whole, the system achieves

an appropriate strategy by weighting ‘soft’ and ‘stiff’. In this phase, the contact force increases

rapidly until the rope is tightened.

3. T2 − T3 : Phase of stretching the spring. The spring begins to be stretched after the rope is tightened.

Although the environment changes suddenly, the controller does not select the strategy as Phase

2; it makes the system ‘soft’ by gradually reducing the stiffness and damping to suppress the

disturbances. In this way, the contact force increases slowly to avoid overshoot when approaching

the desired value.

4. T3 − T4 : Stable phase of stretching the spring. The manipulator contacts with the environment

continuously and the contact force is stabilized to the desired value. In this phase, the stiffness

and damping of the controller are kept at minimum so that the system maintains the ability

of compliance.

There are total 20 learning iterations throughout the experiment. In the early stage of learning,

the uncertainties of the GP model are large due to the lack of collected data. With the increase of

interaction time, the learned GP model can be improved gradually. After two learning iterations,

which means that only 6 s of interaction time is required, a sufficient dynamic model and strategy can

be learned to complete the force tracking task successfully. The experimental results above verify that

the proposed force control learning system is data-efficient. It is mainly because the system explicitly

establishes the transition dynamics that are used for internal virtual simulations and predictions, and

the optimal strategy is improved by evaluations. In this way, more efficient information could be

extracted from the sampled data.

The results above have verified that the learned strategy could adjust the impedance profiles

to adapt to the environmental change in the episodic case. However, what happens if the whole

contact environment changes? In order to verify the environmental adaptability of the proposed

learning variable impedance control method, we use another different spring dynamometer to build

the unstructured contact environment. The measurement range, length, and diameter of the second

spring dynamometer are 15 Kg, 0.155 m, and 20 mm, respectively. It implies that the stiffness and

Sensors 2018, 18, 2539 21 of 26

location of the environment are all changed. Other experimental setups are consistent with the

Section 5.3. The initial strategy for the second environment is the learned (sub)optimal strategy for the

Sensors 2018, 18, x FOR PEER REVIEW 21 of 26

first spring

Sensors (Nx =

2018, 18, 17PEER

FOR in Section

REVIEW5.4). The results are illustrated in Figure 19. 21 of 26

Figure 19. Experimental results of environmental adaptability. (a) Cost curve; (b) contact force; (c)

Figure 19. Experimental

Experimental results

results ofofenvironmental adaptability. (a)(a)

environmental Cost curve; (b) (b)

contact force; (c)

Figure 19.stiffness

Cartesian schedules; (d) Cartesian damping adaptability.

schedules. Cost curve; contact force;

Cartesian

(c) stiffness

Cartesian schedules;

stiffness (d)(d)

schedules; Cartesian damping

Cartesian schedules.

damping schedules.

From the experimental results, we can see that in the initial application of the control strategy,

From the experimental results, we can see that in the initial application of the control strategy,

even From

though thethe impedance profile

experimental results,iswe regulated online

can see that in according to the learned

the initial application strategy,

of the control thestrategy,

task is

even though the impedance profile is regulated online according to the learned strategy, the task is

failed. This is the

even though mainly becauseprofile

impedance the whole environment

is regulated online is according

changed ato lot,thethelearned

strategy learnedthe

strategy, fortask

the

failed. This is mainly because the whole environment is changed a lot, the strategy learned for the

previous

is failed. environment

This is mainlyisbecause

not suitable for current

the whole environment.

environment However,

is changed a lot, theas the learning

strategy process

learned for

previous environment is not suitable for current environment. However, as the learning process

continues to optimize, the learned (sub)optimal strategy could adapt to

the previous environment is not suitable for current environment. However, as the learning process the new environment and

continues to optimize, the learned (sub)optimal strategy could adapt to the new environment and

successfully

continues tocomplete

optimize,the thetask after(sub)optimal

learned two learningstrategy

iterations. Therefore,

could adapt to the

the proposed method could

new environment and

successfully complete the task after two learning iterations. Therefore, the proposed method could

adapt to newcomplete

successfully environments,

the tasktaking

afteradvantage

two learning of its learning ability.

iterations. Therefore, the proposed method could

adapt to new environments, taking advantage of its learning ability.

adapt to new environments, taking advantage of its learning ability.

5.5.2. Comparison of Force Control Performance

5.5.2. Comparison

5.5.2. Comparison of of Force Control Performance

Force Control Performance

Variable impedance control can regulate the task-specific impedance parameters at different

Variable impedance

Variable impedance control

control cancan regulate

regulate the

the task-specific impedance

impedance parameters

parameters at at different

different

phases to complete the task more effectively, whichtask-specific

is the characteristic that the constant impedance

phases to

phases to complete

complete thethe task

task more

more effectively, whichwhich is is the

the characteristic

characteristic thatthat the

the constant

constant impedance

impedance

control is not equipped. Next, theeffectively,

proposed learning variable impedance control is compared with

control

control is not equipped.

is notimpedance

equipped. Next,Next, the

theand proposed

proposed learning

learning variable impedance

variable impedance control

control is compared

is compared withwith

the

the constant control the adaptive impedance control [13] to verify the effectiveness

the constant impedance control and the adaptive impedance control [13] to verify the effectiveness

constant

of force impedance

control. The control and of

stiffness thethe

adaptive impedance

constant impedance control [13] to verify

controller is setthe = 10 while

as effectiveness of force

the

of force The control. Theofstiffness of theimpedance

constant impedanceiscontroller =is 10 as the=stiffness

setwhile 10 while the

control.

stiffness stiffness

of the adaptive the constant

impedance controller iscontroller

set as =set

0 [13].

as KdThe damping of the

of the controller

stiffness

adaptive of the adaptive impedance controller is

set as Kd = 0 [13]. set as = 0 [13]. The damping of the controller

could be impedance controllerorisautomatically.

adjusted manually The damping

Experimental of the controller

comparison results could be adjusted

are illustrated in

could

manually be adjusted manually or automatically. Experimental comparison results are illustrated in

Figure 20. or automatically. Experimental comparison results are illustrated in Figure 20.

Figure 20.

Figure

Figure 20.

20. Experimental

Experimental comparison

comparison results.

results. (a)

(a) Force

Force control

control performance;

performance; (b)

(b) target

target stiffness;

stiffness; (c)

(c)

Figure 20. Experimental comparison results. (a) Force control performance; (b) target stiffness; (c)

target

target damping.

target damping.

In order to quantitatively compare the performances, we use four indicators to quantify the

In order to quantitatively compare the performances, we use four indicators to quantify the

performances. The first indictor is the time when the force begins to increase, i.e., the

performances. The Thefirst

firstindictor is the

indictor is the T f ree when the

timetime force

when thebegins

forcetobegins

increase,

to i.e., the movement

increase, i.e., the

movement time of the robot in free space. The second indicator is the accumulated π cost ( ) which

time of the robot

movement inthe

time of freerobot

space.

in The

free second indicator

space. The second is indicator

the accumulated

is the accumulated cost is(defined

cost J (θ ) which ) which in

is defined in Equation (33). Note that the accumulated cost includes two parts: the cost caused by the

Equation (33). Note that the accumulated cost includes two parts: the cost caused by

is defined in Equation (33). Note that the accumulated cost includes two parts: the cost caused by the the error to the

error to the target state (Equation (36)) and the cost caused by the control gains (Equation (37)). The

target

error tostate

the (Equation

target state(36)) and the(36))

(Equation costand

caused

the by costthe control

caused bygains (Equation

the control gains(37)). The third

(Equation one

(37)). Theis

third one is the root-mean-square error (RMSE) of the contact force

the root-mean-square

third error (RMSE)

one is the root-mean-square of the

error contact

(RMSE) offorce

the contact force

H 2

H ( F (t ) − F )

RMSE =

RMSE =

tt==11 ( FzzH(t) − Fzdzd )2 ,, (44)

(44)

H

where ( ) is the actual contact force in Z-axis direction, and is the desired contact force. is

where ( ) is the actual contact force in Z-axis direction, and is the desired contact force. is

the total number of the samples during the episode.

the total number of the samples during the episode.

An energy indicator is defined to indicate the accumulated consumed energy during the control

An energy indicator is defined to indicate the accumulated consumed energy during the control

Sensors 2018, 18, 2539 22 of 26

s

2

∑tH=1 ( Fz (t) − Fzd )

RMSE = , (44)

H

where Fz (t) is the actual contact force in Z-axis direction, and Fzd is the desired contact force. H is the

total number of the samples during the episode.

An

Sensors energy

2018, indicator

18, x FOR is defined to indicate the accumulated consumed energy during

PEER REVIEW the

22 of 26

control process

H 6

1 .2

Ey = ∑H ∑6 m

t=1 i

θ i θ(t2)(,t),

Ey = t =1 i2=1 21 im i i

(45)

(45)

=1

.

wheremi is is

where thethe mass

mass of the ith joint joint

of the and θand is the angular

i is the angular of the ithofjoint.

velocityvelocity the Actually,

joint. the

Actually,

massesthe

of

masses

the jointsofarethe joints are

unknown unknown

accurately. accurately.

Without Without loss

loss of generality, of generality,

the masses the masses can

can be approximately be

set as

[approximately

m1 , m2 , m3 , m4set, mas [

5 , m6 ] ,=

[1,, 1, ,1, 0.5,

, 0.3,

, 0.1 ] =]. [1, 1, 1, 0.5, 0.3, 0.1].

Table22reveals

Table revealsthetheperformance

performanceindicators

indicators of ofthe

thethree

threeimpedance

impedancecontrollers.

controllers. Compared

Compared with

with

the constant/adaptive

the constant/adaptive impedance control, the learning variable impedance control has the minimum

control has the minimum

indicator values,

indicator values, which

which indicates

indicates that that the

the proposed

proposed system system is is effective.

effective. Obviously,

Obviously, thethe performance

performance

of the

of the adaptive

adaptive impedance

impedance control

control is is better

better thanthan the the constant

constant one,

one, but

but still

still worse

worse than

than the

the optimal

optimal

impedance strategy learned by the proposed learning

impedance strategy learned by the proposed learning variable impedance control.variable impedance control.

Table2.2.Performance

Table Performanceindicators.

indicators

Mode

Mode

Name

Name

/s

Tfree /s

CostCost RMSERMSE (×

Ey ×103

) Overshoot

Overshoot

Constant

Constant Bd = 450

Bd = 450

1.70

1.70

43.29

43.29

11.29 11.29 5.51

5.51 No

No

Impedance control

Impedance control

Bd Bd = 400

= 400 1.50

1.50 41.09

41.09 10.83 10.83 5.85

5.85 Yes Yes

Adaptive

Adaptive

Bd-Adj

Bd-Adj 1.50

1.50 37.23

37.23 11.09 11.09 5.38

5.38 No No

Impedance control

Impedance control

Learning variable

Learning variable N =N

20= 20 1.25

1.25 39.71

39.71 10.19 10.19 2.33

2.33 No No

Impedance control N =N

17= 17 1.00 33.64

Impedance control 1.00 33.64 9.23 9.23 2.08

2.08 No No

5.5.3.

5.5.3. Learning

Learning Speed

Speed

In

In order

order to to further

further illustrate

illustrate the

the efficiency

efficiency ofof the

the proposed

proposed method,

method, we we compared

compared the the learning

learning

speed

speed with

with state-of-the-art

state-of-the-art learning

learning variable

variable impedance

impedance control

controlmethod, C-PI22,, through

method,C-PI through the

the via-gain

via-gain

task [37].

task [37]. The cost function of C-PI is chosen as t = 1 ( − 0.4)‖ 1 − t ‖ with 1 = 1 × 10

cost function of C-PI 22 is chosen as r = w δ ( t − 0.4)k K − K P k with w = 1 × 108

cost function

function ofofour

ourmethod

methodisischosen

chosenasasthetheEquation

Equation(35) with xtarget =

(35)with = 15,

15, ζ = 0.03.

The cost

The cost curve of learning process is shown in Figure 21a. Similar to other studies [37],

learning process is shown in Figure 21a. Similar to other studies [37], to indicate to indicate

the

the data-efficiency

data-efficiency of the

of the method,

method, wewe take

take therequired

the requiredrollouts

rolloutstotoget

getthe

the satisfactory

satisfactory strategy as thethe

indicator

indicatorofof the

the learning

learningspeed.

speed. The

The results

results show

show that C-PI22 converges

that C-PI converges after

after about

about 92 rollouts,

rollouts, whereas

whereas

our method needs only 4 rollouts.

our method needs only 4 rollouts.

Figure 21.

Figure 21. (a)

(a) The

The cost

cost curve

curve of

of the

thelearning

learning process.

process. (b)

(b) Comparison

Comparison of

of learning

learning speed

speed with

with other

other

learning variable impedance control methods.

learning variable impedance control methods.

Current learning variable impedance control methods usually require hundreds or thousands

of rollouts to get a stable strategy. For tasks that are sensitive to the contact force, too many physical

interactions with the environment during the learning process are often infeasible. Improving the

efficiency of learning method is critical. Figure 21b shows the comparison of learning speed with

other learning variable impedance control methods. From the results of [4,37,50], we can see that, to

get a stable strategy, PI2 needs more than 1000 rollouts, whereas PSO requires 360 rollouts. The

Sensors 2018, 18, 2539 23 of 26

Current learning variable impedance control methods usually require hundreds or thousands

of rollouts to get a stable strategy. For tasks that are sensitive to the contact force, too many physical

interactions with the environment during the learning process are often infeasible. Improving the

efficiency of learning method is critical. Figure 21b shows the comparison of learning speed with other

learning variable impedance control methods. From the results of [4,37,50], we can see that, to get

a stable strategy, PI2 needs more than 1000 rollouts, whereas PSO requires 360 rollouts. The efficiency

of PoWER is almost the same as that of C-PI2 , which requires 200 and 120 rollouts, respectively.

The learning speed of C-PI2 is much higher than that of previous methods but is still slower than our

method. Our method outperforms other learning variable impedance control methods by at least one

Sensors

order of2018, 18, x FOR PEER REVIEW

magnitude. 23 of 26

6. Discussion

According

According to tothe

thedefinition

definition of the

of the costcost function

function (35)–(37),

(35)–(37), we can we canthat

learn learn that decreasing

decreasing the

the distance

distance

between between

xt and xtargetand and keepingand the keeping

impedance the parameters

impedance u parameters

t at a low level at a

will low

be level will

beneficial be

for

beneficial

minimizing forthe

minimizing

accumulated thecost.

accumulated

Consequently, cost. Consequently,

small dampingsmall damping

parameters willparameters

make the robot will make

move

the robot

quickly to move

contactquickly

with the toenvironment

contact withtothe environment

reduce the distance to reduce

between thext distance

and xtargetbetween and

. Unfortunately,

. Unfortunately,

small damping smallthe

could reduce damping

positioningcould reduce of

accuracy thethe

positioning

robot and thus accuracymakeofthe the robot with

system and poor

thus

abilitythe

make to suppress

system with disturbances.

poor ability Onto thesuppress

contrary,disturbances.

large damping Oncould improve the

the contrary, largesystem’s

damping ability

couldof

suppressing

improve the disturbances

system’s ability andofreduce the speed

suppressing of motion.and

disturbances It will leadthe

reduce to task

speedfailure if the impedance

of motion. It will lead

parameters

to task failure cannot

if thebe regulatedparameters

impedance to suppresscannot the overshoot.

be regulatedHence, the learning

to suppress algorithm must

the overshoot. Hence, make

the

a tradeoffalgorithm

learning between rapidity

must make andastability

tradeofftobetween

achieve rapidity

the proper andcontrol

stabilitystrategy. By regulating

to achieve the proper thecontrol

target

stiffness and

strategy. damping independently

By regulating the target stiffness at different

and damping phases, the robot achieves

independently rapidphases,

at different contactthe with the

robot

environment

achieves rapidwhile the with

contact overshoot is effectivelywhile

the environment suppressed.

the overshoot is effectively suppressed.

The impedance characteristics of the learned strategy are similar to the strategy that employed

by humans for force tracking. Reduce Reduce the the impedance

impedance by by muscle

muscle relaxation

relaxation to to make

make the the system

system ‘soft’

‘soft’

when it needs to guarantee safety, while increase the impedance by muscle contraction to makes the

system ‘stiff’ when it needs to guarantee fast-tracking or to suppress disturbances. When the contact

environment isisstable,

environment stable,thethearmarmis kept

is keptin aincompliant

a compliant statestate

by muscle

by musclerelaxation to minimize

relaxation the energy

to minimize the

consumption.

energy Our system

consumption. Ourlearns

system thelearns

optimal theimpedance strategy automatically

optimal impedance through continuous

strategy automatically through

explorations.

continuous This is different

explorations. This from the methods

is different from theof methods

imitating of human impedance

imitating human behavior,

impedance such as [18]

behavior,

and [24],

such which

as [18] and usually

[24], need

whichadditional

usually need device—e.g.,

additional EMG electrodes—to

device—e.g., EMG transfer human skills

electrodes—to to the

transfer

robots by

human demonstration.

skills to the robots by demonstration.

proposed learning

The proposed learningvariable

variableimpedance

impedance control

control method

method emphasizes

emphasizes the data-efficiency,

the data-efficiency, i.e.,

i.e., sample

sample efficiency,

efficiency, by learning

by learning a GPamodel

GP model forsystem.

for the the system.

This isThis is critical

critical for learning

for learning to perform

to perform force-

force-sensitive

sensitive tasks.tasks.

Note Note that the

that only onlyapplication

the applicationof theofstrategy

the strategy requires

requires physical

physical interacting

interacting withwith

the

robot; internal

the robot; internalsimulations

simulations andandstrategy

strategy learning

learning only

only use

usethe

thelearned

learnedGP GPmodel.

model. Although

Although aa fast

converging controller is found, the proposed method still has the computation computation intensive

intensive limitation.

limitation.

computationaltime

The computational timeforforeach

each rollout

rollout on on a workstation

a workstation computer

computer withwith an Intel(R)

an Intel(R) Core(TM) Core(TM)

i7-6700i7-K

6700

CPU@4.00K CPU@4.00

GHz is GHz is detailed

detailed in Figure in 22.

Figure 22.

Figure

Figure 22.

22. Computational

Computational time

time for

for each

each rollout.

rollout.

Obviously, between the trials the method requires approximately 9 min to find the (sub)optimal

strategy. With the increase of sample data set, the computational time is increasing gradually, because

the kernel matrices need to be stored and inverted repeatedly. The most demanding computations

are the predictive distribution and the derivatives for prediction using the GP model.

Sensors 2018, 18, 2539 24 of 26

Obviously, between the trials the method requires approximately 9 min to find the (sub)optimal

strategy. With the increase of sample data set, the computational time is increasing gradually, because

the kernel matrices need to be stored and inverted repeatedly. The most demanding computations are

the predictive distribution and the derivatives for prediction using the GP model.

In this paper, we presented a data-efficient learning variable impedance control method that

enables the industrial robots automatically learn to control the contact force in the unstructured

environment. The goal was to improve the sampling efficiency and reduce the required physical

interactions during learning process. To do so, a GP model was learned as the faithful dynamics of

the system, which is then used for internal simulations to improve the data-efficiency by predicting

the long-term state evolution. This method learned an impedance regulation strategy, based on

which the impedance profiles were regulated online to track the desired contact force. In this way,

the flexibility and adaptivity of the system were enhanced. It is worth noting that the optimal

impedance control strategy, which is equipped with the similar impedance characteristics of humans,

is automatically learned through several iterations. There is no need to transfer human skills to the

robot with additional sampling devices. The effectiveness and data-efficiency of the system were

verified through simulations and experiments on the six-DoF Reinovo industrial robot. The learning

speed of this system outperforms other learning variable impedance control methods by at least one

order of magnitude.

Currently, the described work only focuses on the efficient learning of force control. In the

future work, we will extend this system to learn to complete more complex tasks that are sensitive

to contact force, such as assembly tasks of fragile components. Furthermore, parallel and online

implementation of this method to improve computational efficiency would be a meaningful and

interesting research direction.

Author Contributions: G.X. conceived the main idea; C.L. designed the algorithm and system; Z.Z. designed the

simulation; C.L. and X.X. designed and performed the experiments; Z.Z. and Q.Z. analyzed the data; C.L. and Z.Z.

wrote the paper; Q.Z. discussed and revised the paper; G.X. supervised the research work.

Funding: This research was supported by the National Natural Science Foundation of China (grant no. U1530119)

and China Academy of Engineering Physics.

Conflicts of Interest: The authors declare no conflict of interest.

References

1. Siciliano, B.; Sciavicco, L.; Villani, L.; Oriolo, G. Robotics: Modelling, planning and control. Adv. Textb.

Control Signal Process. 2009, 4, 76–82.

2. Hogan, N. Impedance control: An approach to manipulation: Part I—Theory. J. Dyn. Syst. Meas. Control

1985, 107, 1–7. [CrossRef]

3. Burdet, E.; Osu, R.; Franklin, D.W.; Milner, T.E.; Kawato, M. The central nervous system stabilizes unstable

dynamics by learning optimal impedance. Nature 2001, 414, 446–449. [CrossRef] [PubMed]

4. Kieboom, J.V.D.; Ijspeert, A.J. Exploiting natural dynamics in biped locomotion using variable impedance

control. In Proceedings of the 13th IEEE-RAS International Conference on Humanoid Robots, Atlanta, GA,

USA, 15–17 October 2013; pp. 348–353. [CrossRef]

5. Buchli, J.; Stulp, F.; Theodorou, E.; Schaal, S. Learning variable impedance control. Int. J. Robot. Res. 2011, 30,

820–833. [CrossRef]

6. Calinon, S.; Sardellitti, I.; Caldwell, D.G. Learning-based control strategy for safe human-robot interaction

exploiting task and robot redundancies. In Proceedings of the IEEE/RSJ International Conference on

Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; pp. 249–254. [CrossRef]

7. Kormushev, P.; Calinon, S.; Caldwell, D.G. Robot motor skill coordination with em-based reinforcement

learning. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems,

Taipei, Taiwan, 18–22 October 2010; pp. 3232–3237. [CrossRef]

Sensors 2018, 18, 2539 25 of 26

8. Kronander, K.; Billard, A. Online learning of varying stiffness through physical human-robot interaction.

In Proceedings of the IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA,

14–18 May 2012; pp. 1842–1849. [CrossRef]

9. Yang, C.; Ganesh, G.; Haddadin, S.; Parusel, S.; Albu-Schaeffer, A.; Burdet, E. Human-like adaptation of

force and impedance in stable and unstable interactions. IEEE Trans. Robot. 2011, 27, 918–930. [CrossRef]

10. Kronander, K.; Billard, A. Stability considerations for variable impedance control. IEEE Trans. Robot. 2016,

32, 1298–1305. [CrossRef]

11. Medina, J.R.; Sieber, D.; Hirche, S. Risk-sensitive interaction control in uncertain manipulation tasks.

In Proceedings of the IEEE International Conference on Robotics and Automation, Karlsruhe, Germany,

6–10 May 2013; pp. 502–507. [CrossRef]

12. Braun, D.; Howard, M.; Vijayakumar, S. Optimal variable stiffness control: Formulation and application to

explosive movement tasks. Auton. Robots 2012, 33, 237–253. [CrossRef]

13. Duan, J.; Gan, Y.; Chen, M.; Dai, X. Adaptive variable impedance control for dynamic contact force tracking

in uncertain environment. Robot. Auton. Syst. 2018, 102, 54–65. [CrossRef]

14. Ferraguti, F.; Secchi, C.; Fantuzzi, C. A tank-based approach to impedance control with variable stiffness.

In Proceedings of the IEEE International Conference on Robotics and Automation, Karlsruhe, Germany,

6–10 May 2013; pp. 4948–4953. [CrossRef]

15. Takahashi, C.; Scheidt, R.; Reinkensmeyer, D. Impedance control and internal model formation when

reaching in a randomly varying dynamical environment. J. Neurophysiol. 2001, 86, 1047–1051. [CrossRef]

[PubMed]

16. Tsuji, T.; Tanaka, Y. Bio-mimetic impedance control of robotic manipulator for dynamic contact tasks. Robot.

Auton. Syst. 2008, 56, 306–316. [CrossRef]

17. Lee, K.; Buss, M. Force tracking impedance control with variable target stiffness. IFAC Proc. Volumes 2008, 41,

6751–6756. [CrossRef]

18. Yang, C.; Zeng, C.; Liang, P.; Li, Z.; Li, R.; Su, C.Y. Interface design of a physical human-robot interaction

system for human impedance adaptive skill transfer. IEEE Trans. Autom. Sci. Eng. 2017, 15, 329–340.

[CrossRef]

19. Kronander, K.; Billard, A. Learning compliant manipulation through kinesthetic and tactile human-robot

interaction. IEEE Trans. Haptics 2014, 7, 367–380. [CrossRef] [PubMed]

20. Li, M.; Yin, H.; Tahara, K.; Billard, A. Learning object-level impedance control for robust grasping and

dexterous manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation

(ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 6784–6791. [CrossRef]

21. Yang, C.; Zeng, C.; Fang, C.; He, W.; Li, Z. A dmps-based framework for robot learning and generalization of

humanlike variable impedance skills. IEEE/ASME Trans. Mechatron. 2018, 23, 1193–1203. [CrossRef]

22. Yang, C.; Luo, J.; Pan, Y.; Liu, Z.; Su, C.Y. Personalized variable gain control with tremor attenuation for

robot teleoperation. IEEE Trans. Syst. Man Cybern. Syst. 2017, PP, 1–12. [CrossRef]

23. Ajoudani, A.; Tsagarakis, N.; Bicchi, A. Tele-impedance: Teleoperation with impedance regulation using

a body–machine interface. Int. J. Robot. Res. 2012, 31, 1642–1656. [CrossRef]

24. Howard, M.; Braun, D.J.; Vijayakumar, S. Transferring human impedance behavior to heterogeneous variable

impedance actuators. IEEE Trans. Robot. 2013, 29, 847–862. [CrossRef]

25. Polydoros, A.S.; Nalpantidis, L. Survey of model-based reinforcement learning: Applications on robotics.

J. Intell. Robot. Syst. 2017, 86, 153–173. [CrossRef]

26. Kober, J.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32, 1238–1274.

[CrossRef]

27. Koropouli, V.; Hirche, S.; Lee, D. Generalization of force control policies from demonstrations for constrained

robotic motion tasks. J. Intell. Robot. Syst. 2015, 80, 1–16. [CrossRef]

28. Mitrovic, D.; Klanke, S.; Howard, M.; Vijayakumar, S. Exploiting sensorimotor stochasticity for learning

control of variable impedance actuators. In Proceedings of the 10th IEEE-RAS International Conference on

Humanoid Robots, Nashville, TN, USA, 6–8 December 2010; pp. 536–541. [CrossRef]

29. Stulp, F.; Buchli, J.; Ellmer, A.; Mistry, M.; Theodorou, E.A.; Schaal, S. Model-free reinforcement learning of

impedance control in stochastic environments. IEEE Trans. Auton. Ment. Dev. 2012, 4, 330–341. [CrossRef]

30. Du, Z.; Wang, W.; Yan, Z.; Dong, W.; Wang, W. Variable admittance control based on fuzzy reinforcement

learning for minimally invasive surgery manipulator. Sensors 2017, 17, 844. [CrossRef] [PubMed]

Sensors 2018, 18, 2539 26 of 26

31. Li, Z.; Liu, J.; Huang, Z.; Peng, Y.; Pu, H.; Ding, L. Adaptive impedance control of human-robot cooperation

using reinforcement learning. IEEE Trans. Ind. Electron. 2017, 64, 8013–8022. [CrossRef]

32. Deisenroth, M.; Neumann, G.; Peters, J. A survey on policy search for robotics. J. Intell. Robot. Syst. 2013, 2,

1–142. [CrossRef]

33. Nagabandi, A.; Kahn, G.; Fearing, R.S.; Levine, S. Neural network dynamics for model-based deep

reinforcement learning with model-free fine-tuning. arXiv, 2017.

34. Kalakrishnan, M.; Righetti, L.; Pastor, P.; Schaal, S. Learning force control policies for compliant manipulation.

In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco,

CA, USA, 25–30 September 2011; pp. 4639–4644. [CrossRef]

35. Pastor, P.; Kalakrishnan, M.; Chitta, S.; Theodorou, E. Skill learning and task outcome prediction

for manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation,

Shanghai, China, 9–13 May 2011; pp. 3828–3834. [CrossRef]

36. Rombokas, E.; Malhotra, M.; Theodorou, E.; Matsuoka, Y.; Todorov, E. Tendon-driven variable impedance

control using reinforcement learning. In Robotics: Science and Systems VIII; MIT Press: Cambridge, MA,

USA, 2012.

37. Winter, F.; Saveriano, M.; Lee, D. The role of coupling terms in variable impedance policies learning.

In Proceedings of the 9th International Workshop on Human-Friendly Robotics, Genova, Italy, 29–30

September 2016.

38. Shadmehr, R.; Mussa-Ivaldi, F.A. Adaptive representation of dynamics during learning of a motor task.

J. Neurosci. 1994, 14, 3208–3224. [CrossRef] [PubMed]

39. Franklin, D.W.; Burdet, E.; Tee, K.P.; Osu, R.; Chew, C.-M.; Milner, T.E.; Kawato, M. Cns learns stable,

accurate, and efficient movements using a simple algorithm. J. Neurosci. 2008, 28, 11165–11173. [CrossRef]

[PubMed]

40. Villagrossi, E.; Simoni, L.; Beschi, M.; Pedrocchi, N.; Marini, A.; Molinari Tosatti, L.; Visioli, A. A virtual force

sensor for interaction tasks with conventional industrial robots. Mechatronics 2018, 50, 78–86. [CrossRef]

41. Wahrburg, A.; Bös, J.; Listmann, K.D.; Dai, F.; Matthias, B.; Ding, H. Motor-current-based estimation of

cartesian contact forces and torques for robotic manipulators and its application to force control. IEEE Trans.

Autom. Sci. Eng. 2018, 15, 879–886. [CrossRef]

42. Wahrburg, A.; Morara, E.; Cesari, G.; Matthias, B.; Ding, H. Cartesian contact force estimation for robotic

manipulators using Kalman filters and the generalized momentum. In Proceedings of the IEEE International

Conference on Automation Science and Engineering (CASE), Gothenburg, Sweden, 24–28 August 2015;

pp. 1230–1235. [CrossRef]

43. Wittenburg, J. Dynamics of Multibody Systems; Springer Science & Business Media: Berlin/Heidelberg,

Germany, 2007; ISBN 9783540739142.

44. Floreano, D.; Husbands, P.; Nolfi, S. Springer Handbook of Robotics; Springer: Berlin/Heidelberg, Germany, 2008.

45. Ott, C. Cartesian Impedance Control of Redundant and Flexible-Joint Robots; Springer: Berlin/Heidelberg,

Germany, 2008; Volume 49, ISBN 978-3-540-69253-9.

46. Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning (Adaptive Computation and Machine

Learning); MIT Press: Cambridge, MA, USA, 2006; pp. 69–106. ISBN 026218253X.

47. Deisenroth, M.P.; Fox, D.; Rasmussen, C.E. Gaussian processes for data-efficient learning in robotics and

control. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 408–423. [CrossRef] [PubMed]

48. Diesenroth, M.P. Highlight Talk: Gaussian Processes for Data Efficient Learning in Robotics and

Control. AISTATS 2014. Available online: https://www.youtube.com/watch?v=dWsjjszwfi0 (accessed on

12 November 2014).

49. Deisenroth, M.P.; Rasmussen, C.E. Pilco: A model-based and data-efficient approach to policy search.

In Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA,

28 June–2 July 2011; pp. 465–472.

50. Kober, J.; Peters, J. Learning motor primitives for robotics. In Proceedings of the IEEE International

Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; pp. 2112–2118. [CrossRef]

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access

article distributed under the terms and conditions of the Creative Commons Attribution

(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

- Adaptive Traffic Signal Control Using Approximate Dynamic ProgrammingUploaded byTimuçin Çakırgil
- Model-Based H∞ Control of a UPQCUploaded byrakheep123
- ENGD3038 - Rectilinear 2.0Uploaded byLegendaryN
- PID Tuning Guide V_1Uploaded byibubu23
- ROUTING IN ALL-OPTICAL NETWORKS USING RECURSIVE STATE SPACE TECHNIQUEUploaded bysipij
- proposal.pdfUploaded bymarco
- Safe Multi Agent Reforcement Learning for Autonomous DrivingUploaded byChangsong yu
- Optimal State Feedback Control of Brushless Direct-current MotorUploaded byDinçer Töremen
- Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingUploaded byInternational Journal for Scientific Research and Development - IJSRD
- State Space Reduction for Hierarchical Reinforcement LearningUploaded byravigobi
- Autonomous VehiclesUploaded byCraneo Loco
- Multi AgentUploaded byhn317
- Hydrological Model VidraUploaded byrobinvarzaru
- An Empiral Study and Discussions on Multiple Object Tracking: Current State of ArtUploaded byEditor IJRITCC
- Canuto 2013Uploaded byFiandjafar
- controlTheory_12Uploaded byJason Lee
- Super l High CapUploaded byMarcelo Sánchez
- ReferencesUploaded byramiyamin
- Aa Automation Dynamic Simulation Part3Uploaded bypotad
- Edmundo-COB03-0738Uploaded byvsreddy001
- Ramana Than 2018Uploaded byWolf Alpha
- itcaUploaded byCS & IT
- Fixed Order ControllersUploaded byAsghar Ali
- 36140919-STEP-7-PID-Temperature-ControlUploaded bysbarizo
- Temprature control of heat exchanger.Uploaded byAkash verma
- Papers Ball and PlateUploaded byRodri
- DesignofElectronicThrottleValvePositionControlSystemUsingNonlinearPIDControllerUploaded byDaniel Velázquez
- Literatrure Review 1Uploaded byShubham Gupta
- 536841Uploaded bysamadazad
- merazka2017.pdfUploaded bySalim Boulkroune

- Robotics AssignmentUploaded bySameer Pidadi
- Robot RelationsUploaded byLuxObnubilata
- A Survey of Methods for Safe Human-Robot InteractionUploaded byAnonymous IXswcnW
- Are We Ready for Sex Robots?Uploaded byGina Smith
- ICARCV 2010 Conference GuideUploaded bykamleshyadavmoney
- Sequential Learning for Multimodal 3D HumanUploaded byullisrinu
- A human-tracking robot using ultra wideband technology-20180820.pdfUploaded byhome cec
- NITRD Program Supplement to the President’s Budget – FY 2017Uploaded byThe Networking and Information Technology Research and Development (NITRD) Program
- Dissertation DeUploaded byNguyen Loc
- Jia 2007Uploaded bymenzactive
- Designing artificial companionship through explorative research in order to prevent loneliness in the future elderly of the baby boom generationUploaded byNatalía Papadopoúlou
- A Steward Robot for Human-Friendly Human-MachineUploaded byBharath S L Bharath
- New International Standards Boon to Personal Care Robotics - Robotics Business ReviewUploaded bymivano
- Applications for Android Science and the Uncanny ValleyUploaded byCristian Mejía
- 2010_RobotGuide-2Uploaded byMurasa2009
- Imitation LearningUploaded bysanthosh
- 07ITRO_hri.pdfUploaded bysaswati235
- pubmed resultUploaded bycstave
- Designing Sociable Robots - Cynthia L BreazealUploaded byAlexander Cuervo
- Innovation Urn Nbn Si Doc-p2qliqrxUploaded byDharmendra Kumar
- Draft Paper Andra Keay Robot NamesUploaded byAndra Keay
- NIST.IR.8022Uploaded byChristoph Schmittner
- illahcv.pdfUploaded byusman379
- 1290673085 Eu Robotics Industry Jrc61539Uploaded bymartinimartiini
- tmpD54BUploaded byFrontiers
- CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology ContextUploaded bylellobot
- A Survey of Socially Interactive RobotsUploaded byWins Goyal
- Assignment 2Uploaded byKastuv Mani Tuladhar
- 00 InformationUploaded byAland Bravo
- tmpFE8DUploaded byFrontiers