# Multi-Agent Quadrotor Testbed Control Design

:

Integral Sliding Mode vs. Reinforcement Learning

∗

Steven L. Waslander

†

, Gabriel M. Hoffmann

†,‡

Ph.D. Candidate

Aeronautics and Astronautics

Stanford University

{stevenw, gabeh}@stanford.edu

Jung Soon Jang

Research Associate

Aeronautics and Astronautics

Stanford University

jsjang@stanford.edu

Claire J. Tomlin

Associate Professor

Aeronautics and Astronautics

Stanford University

tomlin@stanford.edu

Abstract—The Stanford Testbed of Autonomous Rotorcraft

for Multi-Agent Control (STARMAC) is a multi-vehicle testbed

currently comprised of two quadrotors, also called X4-ﬂyers,

with capacity for eight. This paper presents a comparison

of control design techniques, speciﬁcally for outdoor altitude

control, in and above ground effect, that accommodate the

unique dynamics of the aircraft. Due to the complex airﬂow in-

duced by the four interacting rotors, classical linear techniques

failed to provide sufﬁcient stability. Integral Sliding Mode and

Reinforcement Learning control are presented as two design

techniques for accommodating the nonlinear disturbances. The

methods both result in greatly improved performance over

classical control techniques.

I. INTRODUCTION

As ﬁrst introduced by the authors in [1], the Stanford

Testbed of Autonomous Rotorcraft for Multi-Agent Con-

trol (STARMAC) is an aerial platform intended to validate

novel multi-vehicle control techniques and present real-world

problems for further investigation. The base vehicle for

STARMAC is a four rotor aircraft with ﬁxed pitch blades,

referred to as a quadrotor, or an X4-ﬂyer. They are capable

of 15 minute outdoor ﬂights in a 100m square area [1].

Fig. 1. One of the STARMAC quadrotors in action.

There have been numerous projects involving quadrotors

to date, with the ﬁrst known hover occurring in October,

1922 [2]. Recent interest in the quadrotor concept has been

sparked by commercial remote control versions, such as the

∗

Research supported by ONR under the MURI contract N00014-02-1-

0720 called “CoMotion: Computational Methods for Collaborative Motion”,

by the NASA Joint University Program under grant NAG 2-1564, and by

NASA grant NCC 2-5536.

†

These authors contributed equally to this work.

‡

Funding provided by National Defense Science and Engineering Grant.

DraganFlyer IV [3]. Many groups [4]–[7] have seen signif-

icant success in developing autonomous quadrotor vehicles.

To date, however, STARMAC is the only operational multi-

vehicle quadrotor platform capable of autonomous outdoor

ﬂight, without tethers or motion guides.

The ﬁrst major milestone for STARMAC was autonomous

hover control, with closed loop control of attitude, altitude

and position. Using inertial sensing, the attitude of the

aircraft is simple to control, by applying small variations in

the relative speeds of the blades. In fact, standard integral

LQR techniques were applied to provide reliable attitude

stability and tracking for the vehicle. Position control was

also achieved with an integral LQR, with careful design in

order to ensure spectral separation of the successive loops.

Unfortunately, altitude control proves less straightforward.

There are many factors that affect the altitude loop specif-

ically that do not amend themselves to classical control

techniques. Foremost is the highly nonlinear and destabilizing

effect of four rotor downwashes interacting. In our experi-

ence, this effect becomes critical when motion is not damped

by motion guides or tethers. Empirical observation during

manual ﬂight revealed a noticeable loss in thrust upon descent

through the highly turbulent ﬂow ﬁeld. Similar aerodynamic

phenomenon for helicopters have been studied extensively

[8], but not for the quadrotor, due to its relative obscurity

and complexity. Other factors that introduce disturbances into

the altitude control loop include blade ﬂex, ground effect

and battery discharge dynamics. Although these effects are

also present in generating attitude controlling moments, the

differential nature of the control input eliminates much of the

absolute thrust disturbances that complicate altitude control.

Additional complications arise from the limited choice

in low cost, high resolution altitude sensors. An ultrasonic

ranging device [9] was used, which suffers from non-

Gaussian noise—false echoes and dropouts. The resulting

raw data stream includes spikes and echoes that are difﬁcult

to mitigate, and most successfully handled by rejection of

infeasible measurements prior to Kalman ﬁltering.

In order to accommodate this combination of noise and

disturbances, two distinct approaches are adopted. Integral

Sliding Mode (ISM) control [10]–[12] takes the approach

that the disturbances cannot be modeled, and instead designs

468 0-7803-8912-3/05/$20.00 ©2005 IEEE.

2005 IEEE/RSJ International Conference on Intelligent Robots and Systems

a control law that is guaranteed to be robust to disturbances as

long as they do not exceed a certain magnitude. Model-based

reinforcement learning [13] creates a dynamic model based

on recorded inputs and responses, without any knowledge

of the underlying dynamics, and then seeks an optimal

control law using an optimization technique based on the

learned model. This paper presents an exposition of both

methods and contrasts the techniques from both a design and

implementation point of view.

II. SYSTEM DESCRIPTION

STARMAC consists of a ﬂeet of quadrotors and a ground

station. The system communicates over a Bluetooth Class 1

network. The core of the aircraft are microcontroller circuit

boards designed and assembled at Stanford, for this project.

The microcontrollers run real-time control code, interface

with sensors and the ground station, and supervise the system.

The aircraft are capable of sensing position, attitude, and

proximity to the ground. The differential GPS receiver is the

Trimble Lassen LP, operating on the L1 band, providing 1

Hz updates. The IMU is the MicroStrain 3DM-G, a low cost,

light weight IMU that delivers 76 Hz attitude, attitude rate,

and acceleration readings. The distance from the ground is

found using ultrasonic ranging at 12 Hz.

The ground station consists of a laptop computer, to

interface with the aircraft, and a GPS receiver, to provide

differential corrections. It also has a battery charger, and

joysticks for control-augmented manual ﬂight, when desired.

III. QUADROTOR DYNAMICS

The derivation of the nonlinear dynamics is performed in

North-East-Down (NED) inertial and body ﬁxed coordinates.

Let {e

N

, e

E

, e

D

} denote the inertial axes, and {x

B

, y

B

, z

B

}

denote the body axes, as deﬁned in Figure 2. Euler angles of

the body axes are {φ, θ, ψ} with respect to the e

N

, e

E

and

e

D

axes, respectively, and are referred to as roll, pitch and

yaw. Let r be deﬁned as the position vector from the inertial

origin to the vehicle center of gravity (CG), and let ω

B

be

deﬁned as the angular velocity in the body frame. The current

velocity direction is referred to as e

v

in inertial coordinates.

Fig. 2. Free body diagram of a quadrotor aircraft.

The rotors, numbered 1 −4, are mounted outboard on the

x

B

, y

B

, −x

B

and −y

B

axes, respectively, with position

vectors r

i

with respect to the CG. Each rotor produces an

aerodynamic torque, Q

i

, and thrust, T

i

, both parallel to the

rotor’s axis of rotation, and both used for vehicle control.

Here, T

i

≈ u

i

kt

1+0.1s

, where u

i

is the voltage applied to the

motors, as determined from a load cell test. In ﬂight, T

i

can

vary greatly from this approximation. The torques, Q

i

, are

proportional to the rotor thrust, and are given by Q

i

= k

r

T

i

.

Rotors 1 and 3 rotate in the opposite direction as rotors 2

and 4, so that counteracting aerodynamic torques can be used

independently for yaw control. Horizontal velocity results in

a moment on the rotors, R

i

, about −e

v

, and a drag force,

D

i

, in the direction, −e

v

.

The body drag force is deﬁned as D

B

, vehicle mass is

m, acceleration due to gravity is g, and the inertia matrix is

I ∈ R

3×3

. A free body diagram is depicted in Figure 2. The

total force, F, and moment, M, can be summed as,

F = −D

B

e

v

+mge

D

+

4

¸

i=1

(−T

i

z

B

−D

i

e

v

) (1)

M =

4

¸

i=1

(Q

i

z

B

−R

i

e

v

−D

i

(r

i

×e

v

) +T

i

(r

i

×z

B

))

(2)

The full nonlinear dynamics can be described as,

m¨r = F

I ˙ ω

B

+ω

B

×Iω

B

= M

(3)

where the total angular momentum of the rotors is assumed

to be near zero, because they are counter-rotating. Near hover

conditions, the contributions by rolling moment and drag can

be neglected in Equations (1) and (2). Deﬁne the total thrust

as T =

¸

4

i=1

T

i

. The translational motion is deﬁned by,

m¨r = F = −R

ψ

· R

θ

· R

φ

Tz

B

+mge

D

(4)

where R

φ

, R

θ

, and R

ψ

are the rotation matrices for roll,

pitch, and yaw, respectively. Applying the small angle ap-

proximation to the rotation matrices,

m

⎡

⎣

¨ r

x

¨ r

y

¨ r

z

⎤

⎦

=

⎡

⎣

1 ψ θ

ψ 1 φ

θ −φ 1

⎤

⎦

⎡

⎣

0

0

−T

⎤

⎦

+

⎡

⎣

0

0

mg

⎤

⎦

(5)

Finally, assuming total thrust approximately counteracts grav-

ity, T ≈

¯

T = mg, except in the e

D

axis,

m

⎡

⎣

¨ r

x

¨ r

y

¨ r

z

⎤

⎦

=

⎡

⎣

0

0

mg

⎤

⎦

+

⎡

⎣

0 −

¯

T 0

¯

T 0 0

0 0 1

⎤

⎦

⎡

⎣

φ

θ

T

⎤

⎦

(6)

For small angular velocities, the Euler angle accelerations

are determined from Equation (3) by dropping the second

order term, ω × Iω, and expanding the thrust into its four

constituents. The angular equations become,

⎡

⎣

I

x

¨

φ

I

y

¨

θ

I

z

¨

ψ

⎤

⎦

=

⎡

⎣

0 l 0 −l

l 0 −l 0

K

r

−K

r

K

r

−K

r

⎤

⎦

⎡

⎢

⎢

⎣

T

1

T

2

T

3

T

4

⎤

⎥

⎥

⎦

(7)

469

where the moment arm length l = ||r

i

×z

B

|| is identical for

all rotors due to symmetry. The resulting linear models can

now be used for control design.

IV. ESTIMATION AND CONTROL DESIGN

Applying the concept of spectral separation, inner loop

control of attitude and altitude is performed by commanding

motor voltages, and outer loop position control is performed

by commanding attitude requests for the inner loop. Accurate

attitude control of the plant in Equation (7) is achieved with

an integral LQR controller design to account for thrust biases.

Position estimation is performed using a navigation ﬁlter that

combines horizontal position and velocity information from

GPS, vertical position and estimated velocity information

from the ultrasonic ranger, and acceleration and angular rates

from the IMU in a Kalman ﬁlter that includes bias estimates.

Integral LQR techniques are applied to the horizontal com-

ponents of the linear position plant described in Equation (6).

The resulting hover performance is shown in Figure 6.

As described above, altitude control suffers exceedingly

from unmodeled dynamics. In fact, manual command of

the throttle for altitude control remains a challenge for the

authors to this day. Additional complications arise from

the ultrasonic ranging sensor, which has frequent erroneous

readings, as seen in Figure 3. To alleviate the effect of this

noise, rejection of infeasible measurements is used to remove

much of the non-Gaussian noise component. This is followed

by altitude and altitude rate estimation by Kalman ﬁltering,

which adds lag to the estimate. This section proceeds with

a derivation of two control techniques that can be used to

overcome the unmodeled dynamics and the remaining noise.

180 190 200 210 220

0

50

100

150

Time [s]

S

o

n

i

c

R

a

n

g

e

[

c

m

]

Fig. 3. Characteristic unprocessed ultrasonic ranging data, displaying

spikes, false echoes and dropouts. Powered ﬂight commences at 185 seconds.

A. Integral Sliding Mode Control

A linear approximation to the altitude error dynamics of a

quadrotor aircraft in hover is given by,

˙ x

1

= x

2

˙ x

2

= u +ξ(g, x) (8)

where {x

1

, x

2

} = {(r

z,des

−r

z

), ( ˙ r

z,des

− ˙ r

z

)} are the altitude

error states, u =

¸

4

i=1

u

i

is the control input, and ξ(·) is a

bounded model of disturbances and dynamic uncertainty. It

is assumed that ξ(·) satisﬁes ξ ≤ γ, where γ is the upper

bounded norm of ξ(·).

In early attempts to stabilize this system, it was observed

that LQR control was not able to address the instability

and performance degradation due to ξ(g, x). Sliding Mode

Control (SMC) was adapted to provide a systematic approach

to the problem of maintaining stability and consistent perfor-

mance in the face of modeling imprecision and disturbances.

However, until the system dynamics reach the sliding mani-

fold, such nice properties of SMC are not assured. In order

to provide robust control throughout the ﬂight envelope, the

Integral Sliding Mode (ISM) technique is applied.

The ISM control is designed in two parts. First, a standard

successive loop closure is applied to the linear plant. Second,

integral sliding mode techniques are applied to guarantee

disturbance rejection. Let

u = u

p

+u

d

u

p

= −K

p

x

1

−K

d

x

2

(9)

where K

p

and K

d

are proportional and derivative loop gains

that stabilize the linear dynamics without disturbances. For

disturbance rejection, a sliding surface, s, is designed,

s = s

0

(x

1

, x

2

) +z

s

0

= α(x

2

+kx

1

) (10)

such that state trajectories are forced towards the manifold

s = 0. Here, s

0

is a conventional sliding mode design,

z is an additional term that enables integral control to be

included, and α, k ∈ R are positive constants. Based on

the following Lyapunov function candidate, V =

1

2

s

2

, the

control component, u

d

, can be determined such that

˙

V < 0,

guranteeing convergence to the sliding manifold.

˙

V = s ˙ s = s

α( ˙ x

2

+k ˙ x

1

) + ˙ z

= s

α(u

p

+u

d

+ξ(g, x) +kx

2

) + ˙ z

< 0 (11)

The above condition holds if ˙ z = −α(u

p

+kx

2

) and u

d

can

be guaranteed to satisfy,

s

u

d

+ξ(g, x)

< 0, α > 0 (12)

Since the disturbances, ξ(g, x), are bounded by γ, deﬁne u

d

to be u

d

= −λs with λ ∈ R. Equation (11) becomes,

˙

V = s

α(−λs +ξ(g, x)

≤ α

−λ|s|

2

+γ|s|

< 0 (13)

and it can be seen that λ|s| −γ > 0. As a result, for u

p

and

u

d

as above, the sliding mode condition holds when,

|s| >

γ

λ

(14)

With the input derived above, the dynamics are guaranteed

to evolve such that s decays to within the boundary layer,

470

γ

λ

, of the sliding manifold. Additionally, the system does

not suffer from input chatter as conventional sliding mode

controllers do, as the control law does not include a switching

function along the sliding mode.

V. REINFORCEMENT LEARNING CONTROL

An alternate approach is to implement a reinforcement

learning controller. Much work has been done on continuous

state-action space reinforcement learning methods [13], [14].

For this work, a nonlinear, nonparametric model of the

system is ﬁrst constructed using ﬂight data, approximating

the system as a stochastic Markov process [15], [16]. Then a

model-based reinforcement learning algorithm uses the model

in policy-iteration to search for an optimal control policy that

can be implemented on the embedded microprocessors.

In order to model the aircraft dynamics as a stochas-

tic Markov process, a Locally Weighted Linear Regression

(LWLR) approach is used to map the current state, S(t) ∈

R

ns

, and input, u(t) ∈ R

nu

, onto the subsequent state esti-

mate,

ˆ

S(t +1). In this application, S = [ r

z

˙ r

z

¨ r

z

V ],

where V is the battery level. In the altitude loop, the input,

u ∈ R, is the total motor power, u. The subsequent state

mapping is the summation of the traditional LWLR estimate,

using the current state and input, with the random vector,

v ∈ R

ns

, representing unmodeled noise. The value for v is

drawn from the distribution of output error as determined by

using a maximum likelihood estimate [16] of the Gaussian

noise in the LWLR estimate. Although the true distribution

is not perfectly Gaussian, this model is found to be adequate.

The LWLR method [17] is well suited to this problem, as it

ﬁts a non-parametric curve to the local structure of the data.

The scheme extends least squares by assigning weights to

each training data point according to its proximity to the input

value, for which the output is to be computed. The technique

requires a sizable set of training data in order to reﬂect the

full dynamics of the system, which is captured from ﬂights

ﬂown under both automatic and manually controlled thrust,

with the attitude states under automatic control.

For m training data points, the input training samples are

stored in X ∈ R

(m)×(ns+nu+1)

, and the outputs correspond-

ing to those inputs are stored in Y ∈ R

m×ns

. These matrices

are deﬁned as

X =

⎡

⎢

⎣

1 S(t

1

)

T

u(t

1

)

T

.

.

.

.

.

.

.

.

.

1 S(t

m

)

T

u(t

m

)

T

⎤

⎥

⎦, Y =

⎡

⎢

⎣

S(t

1

+ 1)

T

.

.

.

S(t

m

+ 1)

T

⎤

⎥

⎦

(15)

The column of ones in X enables the inclusion of a constant

offset in the solution, as used in linear regression.

The diagonal weighting matrix W ∈ R

m×m

, which acts on

X, has one diagonal entry for each training data point. That

entry gives more weight to training data points that are close

to the S(t) and u(t) for which

ˆ

S(t + 1) is to be computed.

The distance measure used in this work is

W

i,i

= exp

−||x

(i)

−x||

2τ

2

(16)

where x

(i)

is the i

th

row of X, x is the vector

[ 1 S(t)

T

u(t)

T

], and ﬁt parameter τ is used to adjust

the range of inﬂuence of training points. The value for τ can

be tuned by cross validation to prevent over- or under-ﬁtting

the data. Note that it may be necessary to scale the columns

before taking the Euclidean norm to prevent undue inﬂuence

of one state on the W matrix.

The subsequent state estimate is computed by summing

the LWLR estimate with v,

ˆ

S(t + 1) =

X

T

WX

−1

X

T

W

T

x +v (17)

Because W is a continuous function of x and X, as x is

varied, the resulting estimate is a continuous non-parametric

curve capturing the local structure of the data. The matrix

computations, in code, exploit the large diagonal matrix W;

as each W

i,i

is computed, it is multiplied by row x

(i)

, and

stored in WX.

The matrix being inverted is poorly conditioned, because

weakly related data points have little inﬂuence, so their

contribution cannot be accurately numerically inverted. To

more accurately compute the numerical inversion, one can

perform a singular value decomposition, (X

T

WX) = UΣV

T

.

Then, numerical error during inversion can be avoided by

using the n singular values σ

i

with values of

σmax

σi

< C

max

,

where the value of C

max

is chosen by cross validation. In

this work, C

max

≈ 10 was found to minimize numerical

error, and was typically satisﬁed by n = 1. The inverse can

be directly computed using the n upper singular values in the

diagonal matrix Σ

n

∈ R

n×n

, and the corresponding singular

vectors, in U

n

∈ R

m×n

and V

n

∈ R

m×n

. Thus, the stochastic

Markov model becomes

ˆ

S(t + 1) = V

n

Σ

−1

n

U

T

n

X

T

W

T

x +v (18)

Next, model-based reinforcement learning is implemented,

incorporating the stochastic Markov model, to design a

controller. A quadratic reward function is used,

R(S, S

ref

) = −c

1

(r

z

−r

z,ref

)

2

−c

2

˙ r

2

z

(19)

where R : R

2ns

→ R, c

1

> 0 and c

2

> 0 are constants

giving reward for accurate tracking and good damping re-

spectively, and S

ref

= [ r

z,ref

˙ r

z,ref

¨ r

z,ref

V

ref

] is

the reference state desired for the system.

The control policy maps the observed state S onto the input

command u. In this work, the state space has the constraint

of r

z

≥ 0, and the input command has the constraint of

0 ≤ u ≤ u

max

. The control policy is chosen to be

π(S, w) = w

1

+w

2

(r

z

−r

z,ref

) +w

3

˙ r

z

+w

4

¨ r

z

(20)

471

where w ∈ R

nc

is the vector of policy coefﬁcients

w

1

, . . . , w

nc

. Linear functions were sufﬁcient to achieve good

stability and performance. Additional terms, such as battery

level and integral of altitude error, could be included to make

the policy more resilient to differing ﬂight conditions.

Policy iteration is performed as explained in Algorithm 1.

The algorithm aims to ﬁnd the value of w that yields the

greatest total reward R

total

, as determined by simulating the

system over a ﬁnite horizon from a set of random initial

conditions, and summing the values of R(S, S

ref

) at each

state encountered.

Algorithm 1 Model-Based Reinforcement Learning

1: Generate set S

0

of random initial states

2: Generate set T of random reference trajectories

3: Initialize w to reasonable values

4: R

best

←−∞, w

best

←w

5: repeat

6: R

total

←0

7: for s

0

∈ S

0

, t ∈ T do

8: S(0) ←s

0

9: for t = 0 to t

max

−1 do

10: u(t) ←π(S(t), w)

11: S(t + 1) ←LWLR(S(t), u(t)) +v

12: R

total

←R

total

+R(S(t + 1))

13: end for

14: end for

15: if R

total

> R

best

then

16: R

best

←R

total

, w

best

←w

17: end if

18: Add Gaussian random vector to w

best

, store as w

19: until w

best

converges

In policy iteration, a ﬁxed set of random initial conditions

and reference trajectories are used to simulate ﬂights at each

iteration, with a given policy parameterized by w. It is neces-

sary to use the same random set at each iteration in order for

convergence to be possible [15]. After each iteration, the new

value of w is stored as w

best

if it outperforms the previous

best policy, as determined by comparing R

total

to R

best

, the

previous best reward encountered. Then, a Gaussian random

vector is added to w

best

. The result is stored as w, and

the simulation is performed again. This is iterated until the

value of w

best

remains ﬁxed for an appropriate number of

iterations, as determined by the particular application. The

simulation results must be examined to predict the likely

performance of the resulting control policy.

By using a Gaussian update rule for the policy weights, w,

it is possible to escape local maxima of R

total

. The highest

probability steps are small, and result in reﬁnement of a

solution near a local maximum of R

total

. However, if the

algorithm is not at the global maximum, and is allowed to

continue, there exists a ﬁnite probability that a sufﬁciently

large Gaussian step will be performed such that the algorithm

can keep ascending.

VI. FLIGHT TEST RESULTS

A. Integral Sliding Mode

The results of an outdoor ﬂight test with ISM control

can be seen in Figure 4. The response time is on the order

of 1-2 seconds, with 5 seconds settling time, and little to

no steady state offset. Also, an oscillatory character can be

seen in the response, which is most likely being triggered

by the nonlinear aerodynamic effects and sensor data spikes

described earlier.

0 5 10 15 20 25 30

−200

−150

−100

−50

0

50

100

Time [s]

r

z

[cm]

v

z

[cm/s]

r

z,des

[cm]

v

z,des

[cm/s]

Fig. 4. Integral sliding mode step response in outdoor ﬂight test.

Compared to linear control design techniques implemented

on the aircraft, the ISM control proves a signiﬁcant enhance-

ment. By explicitly incorporating bounds on the unknown

disturbance forces in the derivation of the control law, it

is possible to maintain stable altitude on a system that has

evaded standard approaches.

B. Reinforcement Learning Control

One of the most exciting aspects of RL control design

is its ease of implementation. The policy iteration algorithm

arrived at the implemented control law after only 3 hours on

a Pentium IV computer. Figure 5 presents ﬂight test results

for the controller. The high ﬁdelity model of the system, used

for RL control design, provides a useful tool for comparison

of the RL control law with other controllers. In fact, in

simulation with linear controllers that proved unstable on

the quadrotor, ﬂight paths with growing oscillations were

predicted that closely matched real ﬂight data.

The locally weighted linear regression model showed many

relations that were not reﬂected in the linear model, but that

reﬂect the physics of the system well. For instance, with all

other states held ﬁxed, an upward velocity results in more

acceleration at the subsequent time step for a throttle level,

and a downward velocity yields the opposite effect. This is

essentially negative damping. The model also shows a strong

ground effect. That is, with all other states held ﬁxed, the

closer the vehicle is to the ground, the more acceleration it

will have at the subsequent time step for a given throttle level.

472

4 6 8 10 12 14 16

−150

−100

−50

0

50

100

Time [s]

r

z

[cm]

v

z

[cm/s]

r

z,des

[cm]

v

z,des

[cm/s]

Fig. 5. Reinforcement learning controller response to manually applied

step input, in outdoor ﬂight test. Spikes in state estimates are from sensor

noise passing through the Kalman ﬁlter.

The reinforcement learning control law is susceptible to

system disturbances for which it is not trained. In particular,

varying battery levels and blade degradation may cause a

reduction in stability or steady state offset. Addition of

an integral error term to the control policy may prove an

effective means of mitigating steady state disturbances, as

was seen in the ISM control law.

Comparison of the step response for ISM and RL control

reveals both stable performance and similar response times,

although the transient dynamics of the ISM control are more

pronounced. RL does, however, have the advantage that it

incorporates accelerometer measurement into its control, and

as such uses a more direct measurement of the disturbances

imposed on the aircraft.

C. Autonomous Hover

Applying ISM altitude control and integral LQR position

control techniques, ﬂight tests were performed to achieve the

goal of autonomous hover. Position response was maintained

within a 3m circle for the duration of a two minute ﬂight (see

Figure 6), which is well within the expected error bound for

the L1 band differential GPS used.

−3 −2 −1 0 1 2 3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

East [m]

N

o

r

t

h

[

m

]

Trajectory

Start Point

End Point

3m error

Fig. 6. Autonomous hover ﬂight recorded position, with 3m error circle.

VII. CONCLUSION

This paper summarizes the development of an autonomous

quadrotor capable of extended outdoor trajectory tracking

control. This is the ﬁrst demonstration of such capabilities

on a quadrotor known to the authors, and represents a

critical step in developing a novel, easy to use, multi-vehicle

testbed for validation of multi-agent control strategies for au-

tonomous aerial robots. Speciﬁcally, two design approaches

were presented for the altitude control loop, which proved

a challenging hurdle. Both techniques resulted in stable

controllers with similar response times, and were a signiﬁcant

improvement over linear controllers that failed to stabilize the

system adequately.

Acknowledgments

The authors would like to thank Dev Gorur Rajnarayan and

David Dostal for their many contributions to STARMAC

development and testing, as well as Prof. Andrew Ng of

Stanford University for his advice and guidance in developing

the Reinforcement Learning control.

REFERENCES

[1] Hoffmann, G., Rajnarayan, D. G., Waslander, S. L., Dostal, D., Jang,

J. S., and Tomlin, C. J., “The Stanford Testbed of Autonomous Ro-

torcraft for Multi-Agent Control (STARMAC),” 23rd Digital Avionics

System Conference, Salt Lake City, UT, November 2004.

[2] Lambermont, P., Helicopters and Autogyros of the World, 1958.

[3] DraganFly-Innovations, “www.rctoys.com,” 2003.

[4] Pounds, P., Mahony, R., Hynes, P., and Roberts, J., “Design of a

Four-Rotor Aerial Robot,” Australian Conference on Robotics and

Automation, Auckland, November 2002.

[5] Altug, E., Ostrowski, J. P., and Taylor, C. J., “Quadrotor Control Using

Dual Camera Visual Feedback,” ICRA, Taipei, September 2003.

[6] Bouabdallah, S., Murrieri, P., and Siegwart, R., “Design and Control

of an Indoor Micro Quadrotor,” ICRA, New Orleans, April 2004.

[7] Castillo, P., Dzul, A., and Lozano, R., “Real-Time Stabilization and

Tracking of a Four-Rotor Mini Rotorcraft,” IEEE Transactions on

Control Systems Technology, Vol. 12, No. 4, 2004.

[8] Bramwell, A., Done, G., and Blamford, D., Bramwell’s Helicopter

Dynamics, Butterworth-Heinemann, 2nd ed., 2001.

[9] Devantech, “http://www.robot-electronics.co.uk/htm/srf08tech.shtml,”

SRF08 Ultrasonic Ranger.

[10]

¨

Utkin, V., Guldner, J., and Shi, J., Sliding Mode Control in Electro-

mechanical Systems, Taylor-Francis Inc., 1999.

[11] Khalil, H. K., Nonlinear Systems, Prentice Hall, 1996.

[12] Jang, J. S., Nonlinear Control Using Discrete-Time Dynamic Inversion

Under Input Saturation: Theory and Experiment on the Stanford

DragonFly UAVs, Ph.D. thesis, Stanford University, 2004.

[13] Sutton, R. S. and Barto, A. G., Reinforcement Learning: An Introduc-

tion, MIT Press, Cambridge, MA, 1998.

[14] Doya, K., Samejima, K., ichi Katagiri, K., and Kawato, M., “Multiple

Model-based Reinforcement Learning,” Tech. rep., Kawato Dynamic

Brain Project Technical Report, KDB-TR-08, Japan Science and Tech-

nology Corporatio, June 2000.

[15] Ng, A. Y. and Jordan, M. I., “PEGASUS: A policy search method

for large MDPs and POMDPs,” Uncertainty in Artiﬁcial Intelligence,

2000.

[16] Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse,

B., Berger, E., and Liang, E., “Autonomous inverted helicopter ﬂight

via reinforcement learning,” International Symposium on Experimental

Robotics,, 2004.

[17] Atkeson, C. G., Moore, A. W., and Schaal, S., “Locally Weighted

Learning,” Artiﬁcial Intelligence Review, Vol. 11, No. 1-5, 1997,

pp. 11–73.

473

and proximity to the ground. yB . and moment. and let ωB be deﬁned as the angular velocity in the body frame. and both used for vehicle control. where ui is the voltage applied to the motors. both parallel to the rotor’s axis of rotation. as deﬁned in Figure 2. Ri . and thrust. and acceleration readings. θ. ω × Iω. and supervise the system.
aerodynamic torque. The differential GPS receiver is the Trimble Lassen LP. ⎤⎡ ⎤ ⎤ ⎡ ⎤ ⎡ ⎡ ¯ φ 0 0 −T 0 rx ¨ ¯ 0 0 ⎦ ⎣ θ ⎦ (6) ¨ m ⎣ ry ⎦ = ⎣ 0 ⎦ + ⎣ T 0 0 1 T rz ¨ mg For small angular velocities. assuming total thrust approximately counteracts grav¯ ity. and yaw. eE and eD axes. operating on the L1 band. The body drag force is deﬁned as DB . the Euler angle accelerations are determined from Equation (3) by dropping the second order term. The translational motion is deﬁned by. light weight IMU that delivers 76 Hz attitude. numbered 1 − 4. except in the eD axis. The distance from the ground is found using ultrasonic ranging at 12 Hz. The core of the aircraft are microcontroller circuit boards designed and assembled at Stanford. respectively. about −ev . ψ} with respect to the eN . respectively. and a GPS receiver. The total force. Model-based reinforcement learning [13] creates a dynamic model based on recorded inputs and responses. eD } denote the inertial axes. The aircraft are capable of sensing position. and are referred to as roll. II. Let {eN . Q UADROTOR DYNAMICS The derivation of the nonlinear dynamics is performed in North-East-Down (NED) inertial and body ﬁxed coordinates. The microcontrollers run real-time control code. 2. zB } denote the body axes. Rotors 1 and 3 rotate in the opposite direction as rotors 2 and 4. Di . and then seeks an optimal control law using an optimization technique based on the learned model. m¨ = F = −Rψ · Rθ · Rφ T zB + mgeD r (4) where Rφ . III. eE . M. because they are counter-rotating.1s . and the inertia matrix is I ∈ R3×3 . without any knowledge of the underlying dynamics.
Free body diagram of a quadrotor aircraft. T ≈ T = mg. so that counteracting aerodynamic torques can be used independently for yaw control. pitch. This paper presents an exposition of both methods and contrasts the techniques from both a design and implementation point of view. are mounted outboard on the xB . −xB and −yB axes. respectively. attitude. The angular equations become. The system communicates over a Bluetooth Class 1 network. are proportional to the rotor thrust. Applying the small angle approximation to the rotation matrices. and joysticks for control-augmented manual ﬂight. Qi . Rθ .
The rotors. a low cost. yB . (3)
where the total angular momentum of the rotors is assumed to be near zero. The torques. for this project. and a drag force. Ti ≈ ui 1+0. Euler angles of the body axes are {φ. the contributions by rolling moment and drag can be neglected in Equations (1) and (2). attitude rate.
4
F = −DB ev + mgeD +
4 i=1
(−Ti zB − Di ev )
(1)
M=
i=1
(Qi zB − Ri ev − Di (ri × ev ) + Ti (ri × zB )) (2) m¨ r I ωB + ωB × IωB ˙ = F = M
The full nonlinear dynamics can be described as. acceleration due to gravity is g. Let r be deﬁned as the position vector from the inertial origin to the vehicle center of gravity (CG). vehicle mass is m. ⎡ ⎤ ⎤ T1 ⎡ ¨ ⎤ ⎡ 0 Ix φ l 0 −l ⎢ T ⎥ ¨ ⎣ Iy θ ⎦ = ⎣ l 0 −l 0 ⎦ ⎢ 2 ⎥ (7) ⎣ T3 ⎦ ¨ Kr −Kr Kr −Kr Iz ψ T
4
Fig. Qi . pitch and yaw. In ﬂight. Deﬁne the total thrust 4 as T = i=1 Ti . S YSTEM D ESCRIPTION STARMAC consists of a ﬂeet of quadrotors and a ground station. F. The IMU is the MicroStrain 3DM-G. Horizontal velocity results in a moment on the rotors. The ground station consists of a laptop computer. It also has a battery charger. as determined from a load cell test. to interface with the aircraft. −ev . Near hover conditions. A free body diagram is depicted in Figure 2. ⎤ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎡ 0 0 1 ψ θ rx ¨ ¨ m ⎣ ry ⎦ = ⎣ ψ 1 φ ⎦ ⎣ 0 ⎦ + ⎣ 0 ⎦ (5) mg −T θ −φ 1 rz ¨ Finally. can be summed as. when desired. Ti . and are given by Qi = kr Ti . interface with sensors and the ground station. kt Here. Each rotor produces an
469
. Ti can vary greatly from this approximation. and Rψ are the rotation matrices for roll.a control law that is guaranteed to be robust to disturbances as long as they do not exceed a certain magnitude. to provide differential corrections. and expanding the thrust into its four constituents. in the direction. with position vectors ri with respect to the CG. providing 1 Hz updates. The current velocity direction is referred to as ev in inertial coordinates. and {xB .

deﬁne ud to be ud = −λs with λ ∈ R. the dynamics are guaranteed to evolve such that s decays to within the boundary layer.
Sonic Range [cm] 150 100 50 0 180 190 200 Time [s] 210 220
is assumed that ξ(·) satisﬁes ξ ≤ γ. For disturbance rejection.
Since the disturbances. s ud + ξ(g. x). and acceleration and angular rates from the IMU in a Kalman ﬁlter that includes bias estimates. can be determined such that V < 0. s0 is a conventional sliding mode design. ud . x). k ∈ R are positive constants. This is followed by altitude and altitude rate estimation by Kalman ﬁltering. vertical position and estimated velocity information from the ultrasonic ranger. x) (8)
˙ where {x1 . Here. manual command of the throttle for altitude control remains a challenge for the authors to this day. where γ is the upper bounded norm of ξ(·). rejection of infeasible measurements is used to remove much of the non-Gaussian noise component. which adds lag to the estimate. Sliding Mode Control (SMC) was adapted to provide a systematic approach to the problem of maintaining stability and consistent performance in the face of modeling imprecision and disturbances. is designed. This section proceeds with a derivation of two control techniques that can be used to overcome the unmodeled dynamics and the remaining noise. and ξ(·) is a bounded model of disturbances and dynamic uncertainty. The resulting linear models can now be used for control design. the ˙ control component. The resulting hover performance is shown in Figure 6. ξ(g. x2 } = {(rz. E STIMATION AND C ONTROL D ESIGN Applying the concept of spectral separation. a standard successive loop closure is applied to the linear plant. It
and it can be seen that λ|s| − γ > 0. The ISM control is designed in two parts. x) ≤ α − λ|s|2 + γ|s| < 0 (13)
A. z is an additional term that enables integral control to be included. First. Powered ﬂight commences at 185 seconds. displaying spikes. IV. Characteristic unprocessed ultrasonic ranging data. a sliding surface. V = 2 s2 .
470
. Let u up = u p + ud = −Kp x1 − Kd x2 (9)
where Kp and Kd are proportional and derivative loop gains that stabilize the linear dynamics without disturbances. Integral LQR techniques are applied to the horizontal components of the linear position plant described in Equation (6). 3. as seen in Figure 3. ˙ V = s α(−λs + ξ(g. x) < 0. the Integral Sliding Mode (ISM) technique is applied.des −r˙z )} are the altitude 4 error states. Equation (11) becomes. In early attempts to stabilize this system. such nice properties of SMC are not assured. In order to provide robust control throughout the ﬂight envelope. As a result. altitude control suffers exceedingly from unmodeled dynamics. are bounded by γ. Accurate attitude control of the plant in Equation (7) is achieved with an integral LQR controller design to account for thrust biases. x2 ) + z s0 = α(x2 + kx1 ) (10)
such that state trajectories are forced towards the manifold s = 0. Position estimation is performed using a navigation ﬁlter that combines horizontal position and velocity information from GPS. However. As described above. integral sliding mode techniques are applied to guarantee disturbance rejection. γ |s| > (14) λ With the input derived above. x1 ˙ x2 ˙ = x2 = u + ξ(g. α>0 (12)
Fig. In fact. (rz. To alleviate the effect of this noise. inner loop control of attitude and altitude is performed by commanding motor voltages. and outer loop position control is performed by commanding attitude requests for the inner loop. false echoes and dropouts. s. until the system dynamics reach the sliding manifold. Based on 1 the following Lyapunov function candidate. Integral Sliding Mode Control A linear approximation to the altitude error dynamics of a quadrotor aircraft in hover is given by.des −rz ). and α. for up and ud as above. the sliding mode condition holds when. which has frequent erroneous readings. guranteeing convergence to the sliding manifold. Additional complications arise from the ultrasonic ranging sensor. it was observed that LQR control was not able to address the instability and performance degradation due to ξ(g. x) + kx2 ) + z < 0 (11) The above condition holds if z = −α(up + kx2 ) and ud can ˙ be guaranteed to satisfy. s = s0 (x1 . Second. u = i=1 ui is the control input. ˙ V = ss ˙ = ˙ ˙ s α(x2 + k x1 ) + z ˙ ˙ = s α(up + ud + ξ(g.where the moment arm length l = ||ri × zB || is identical for all rotors due to symmetry.

The scheme extends least squares by assigning weights to each training data point according to its proximity to the input value. the system does not suffer from input chatter as conventional sliding mode controllers do. In this application. S = [ rz rz rz V ]. In the altitude loop.of the sliding manifold. with the attitude states under automatic control. [14]. to design a controller. The LWLR method [17] is well suited to this problem. c1 > 0 and c2 > 0 are constants giving reward for accurate tracking and good damping re˙ ¨ spectively. To more accurately compute the numerical inversion. as x is varied. . as it ﬁts a non-parametric curve to the local structure of the data. which acts on X. representing unmodeled noise. and the outputs corresponding to those inputs are stored in Y ∈ Rm×ns . ⎦ ⎦. In order to model the aircraft dynamics as a stochastic Markov process. a Locally Weighted Linear Regression (LWLR) approach is used to map the current state. the state space has the constraint of rz ≥ 0. The value for v is drawn from the distribution of output error as determined by using a maximum likelihood estimate [16] of the Gaussian noise in the LWLR estimate. where V is the battery level. as the control law does not include a switching function along the sliding mode. Cmax ≈ 10 was found to minimize numerical error.ref )2 − c2 rz ˙2 (19)
u(tm )
T
S(tm + 1)
T
where R : R2ns → R. R EINFORCEMENT L EARNING C ONTROL An alternate approach is to implement a reinforcement learning controller. so their contribution cannot be accurately numerically inverted. Then a model-based reinforcement learning algorithm uses the model in policy-iteration to search for an optimal control policy that can be implemented on the embedded microprocessors. V. a nonlinear. and was typically satisﬁed by n = 1. The inverse can be directly computed using the n upper singular values in the diagonal matrix Σn ∈ Rn×n .
The distance measure used in this work is Wi. nonparametric model of the system is ﬁrst constructed using ﬂight data. A quadratic reward function is used. The control policy is chosen to be π(S. as used in linear regression. model-based reinforcement learning is implemented. S(t + 1). and Sref = [ rz. in code. has one diagonal entry for each training data point. . Then. with the random vector. for which the output is to be computed. because weakly related data points have little inﬂuence.ref Vref ] is the reference state desired for the system. S(t) ∈ Rns . 1 S(tm )
T
γ λ. For m training data points. For this work.i is computed.ref rz. w) = w1 + w2 (rz − rz. u. onto the subsequent state estiˆ ˙ ¨ mate. In this work. Thus. this model is found to be adequate. Note that it may be necessary to scale the columns before taking the Euclidean norm to prevent undue inﬂuence of one state on the W matrix. as each Wi. . These matrices are deﬁned as ⎤ ⎤ ⎡ ⎡ T T T u(t1 ) S(t1 + 1) 1 S(t1 ) ⎥ ⎥ ⎢ ⎢ . . and input. That entry gives more weight to training data points that are close ˆ to the S(t) and u(t) for which S(t + 1) is to be computed. numerical error during inversion can be avoided by using the n singular values σi with values of σmax < Cmax . incorporating the stochastic Markov model. v ∈ Rns . [16]. In this work. the stochastic Markov model becomes
T T T ˆ S(t + 1) = Vn Σ−1 Un X W x + v n
(18)
Next. and ﬁt parameter τ is used to adjust the range of inﬂuence of training points.ref ) + w3 r˙z + w4 rz ¨ (20)
471
. exploit the large diagonal matrix W . in Un ∈ Rm×n and Vn ∈ Rm×n . u(t) ∈ Rnu . u ∈ R.i = exp −||x(i) − x|| 2τ 2 (16)
where x(i) is the ith row of X. . which is captured from ﬂights ﬂown under both automatic and manually controlled thrust. the input. Additionally. it is multiplied by row x(i) . The subsequent state estimate is computed by summing the LWLR estimate with v.or under-ﬁtting the data. and stored in W X. (X W X) = U ΣV . (15) The column of ones in X enables the inclusion of a constant offset in the solution.ref rz. . and the input command has the constraint of 0 ≤ u ≤ umax . and the corresponding singular vectors. . R(S.
T ˆ S(t + 1) = X W X
−1
X W x+v
T
T
(17)
Because W is a continuous function of x and X. Sref ) = −c1 (rz − rz. the resulting estimate is a continuous non-parametric curve capturing the local structure of the data. approximating the system as a stochastic Markov process [15]. The subsequent state mapping is the summation of the traditional LWLR estimate. is the total motor power. the input training samples are stored in X ∈ R(m)×(ns +nu +1) . using the current state and input. X=⎣ . The value for τ can be tuned by cross validation to prevent over. The diagonal weighting matrix W ∈ Rm×m . The matrix being inverted is poorly conditioned. The technique requires a sizable set of training data in order to reﬂect the full dynamics of the system. The control policy maps the observed state S onto the input command u. Much work has been done on continuous state-action space reinforcement learning methods [13]. Although the true distribution is not perfectly Gaussian. . x is the vector T T [ 1 S(t) u(t) ]. . one can T T perform a singular value decomposition. Y = ⎣ . σi where the value of Cmax is chosen by cross validation. The matrix computations.

and the simulation is performed again. the new value of w is stored as wbest if it outperforms the previous best policy. Algorithm 1 Model-Based Reinforcement Learning 1: Generate set S0 of random initial states 2: Generate set T of random reference trajectories 3: Initialize w to reasonable values 4: Rbest ← −∞. w. as determined by comparing Rtotal to Rbest . For instance. Then. VI. the more acceleration it will have at the subsequent time step for a given throttle level. wbest ← w 5: repeat 6: Rtotal ← 0 7: for s0 ∈ S0 . wnc . with all other states held ﬁxed. used for RL control design.
100 50 0 −50 −100 −150 −200 0 5 10 rz [cm] v [cm/s]
z
r
z.
Compared to linear control design techniques implemented on the aircraft. and result in reﬁnement of a solution near a local maximum of Rtotal . and little to no steady state offset. and a downward velocity yields the opposite effect. By explicitly incorporating bounds on the unknown disturbance forces in the derivation of the control law. in simulation with linear controllers that proved unstable on the quadrotor. Policy iteration is performed as explained in Algorithm 1. with 5 seconds settling time. . This is iterated until the value of wbest remains ﬁxed for an appropriate number of iterations. .des z. with a given policy parameterized by w. it is possible to maintain stable altitude on a system that has evaded standard approaches. could be included to make the policy more resilient to differing ﬂight conditions. F LIGHT T EST R ESULTS A. there exists a ﬁnite probability that a sufﬁciently
large Gaussian step will be performed such that the algorithm can keep ascending. . Sref ) at each state encountered. B. as determined by the particular application. t ∈ T do 8: S(0) ← s0 9: for t = 0 to tmax − 1 do 10: u(t) ← π(S(t). it is possible to escape local maxima of Rtotal . This is essentially negative damping. 4. That is. with all other states held ﬁxed. w) 11: S(t + 1) ← LW LR(S(t). . the ISM control proves a signiﬁcant enhancement. and is allowed to continue. store as w 19: until wbest converges In policy iteration. provides a useful tool for comparison of the RL control law with other controllers. Additional terms. The response time is on the order of 1-2 seconds. wbest ← w 17: end if 18: Add Gaussian random vector to wbest . By using a Gaussian update rule for the policy weights. The locally weighted linear regression model showed many relations that were not reﬂected in the linear model. The highest probability steps are small.des
[cm] [cm/s] 20 25 30
v
15 Time [s]
Fig.where w ∈ Rnc is the vector of policy coefﬁcients w1 . but that reﬂect the physics of the system well. The simulation results must be examined to predict the likely performance of the resulting control policy. u(t)) + v 12: Rtotal ← Rtotal + R(S(t + 1)) 13: end for 14: end for 15: if Rtotal > Rbest then 16: Rbest ← Rtotal . such as battery level and integral of altitude error. Also.
Integral sliding mode step response in outdoor ﬂight test. as determined by simulating the system over a ﬁnite horizon from a set of random initial conditions. a Gaussian random vector is added to wbest . Linear functions were sufﬁcient to achieve good stability and performance. It is necessary to use the same random set at each iteration in order for convergence to be possible [15]. Reinforcement Learning Control One of the most exciting aspects of RL control design is its ease of implementation. the closer the vehicle is to the ground. In fact. However. ﬂight paths with growing oscillations were predicted that closely matched real ﬂight data. the previous best reward encountered. The policy iteration algorithm arrived at the implemented control law after only 3 hours on a Pentium IV computer.
472
. After each iteration. and summing the values of R(S. The high ﬁdelity model of the system. The algorithm aims to ﬁnd the value of w that yields the greatest total reward Rtotal . an oscillatory character can be seen in the response. an upward velocity results in more acceleration at the subsequent time step for a throttle level. The result is stored as w. Integral Sliding Mode The results of an outdoor ﬂight test with ISM control can be seen in Figure 4. if the algorithm is not at the global maximum. which is most likely being triggered by the nonlinear aerodynamic effects and sensor data spikes described earlier. a ﬁxed set of random initial conditions and reference trajectories are used to simulate ﬂights at each iteration. Figure 5 presents ﬂight test results for the controller. The model also shows a strong ground effect.

K... with 3m error circle. A. C. Taipei. MIT Press. [6] Bouabdallah. Japan Science and Technology Corporatio. 1996..com.” 2003. Ostrowski. P... 12.. and Taylor. S.. Speciﬁcally. R. rep. Murrieri... 2nd ed.. J. as was seen in the ISM control law. Andrew Ng of Stanford University for his advice and guidance in developing the Reinforcement Learning control. easy to use. G.robot-electronics. [4] Pounds. Y. J... varying battery levels and blade degradation may cause a reduction in stability or steady state offset.
473
. 2004. MA. Bramwell’s Helicopter Dynamics. R. Berger. C. 6. [5] Altug. which proved a challenging hurdle. “The Stanford Testbed of Autonomous Rotorcraft for Multi-Agent Control (STARMAC).. and Liang. No. S. [13] Sutton.. [8] Bramwell.. A. D. J.D.shtml. E. Both techniques resulted in stable controllers with similar response times. and Siegwart. 4.. G.” ICRA. Helicopters and Autogyros of the World. and Schaal. 5. and Jordan. Samejima. “www. Vol. “Multiple Model-based Reinforcement Learning. Dostal.5 0 −0. Sliding Mode Control in Electromechanical Systems. E. Done. November 2004. and Barto. Reinforcement Learning: An Introduction. R. Position response was maintained within a 3m circle for the duration of a two minute ﬂight (see Figure 6).” Tech. W.” 23rd Digital Avionics System Conference. I. Schulte. Salt Lake City. P. “Design of a Four-Rotor Aerial Robot. Rajnarayan... G. Coates. [7] Castillo. J.. [3] DraganFly-Innovations. November 2002. “Real-Time Stabilization and Tracking of a Four-Rotor Mini Rotorcraft. 2004.. C ONCLUSION This paper summarizes the development of an autonomous quadrotor capable of extended outdoor trajectory tracking control.. and as such uses a more direct measurement of the disturbances imposed on the aircraft. New Orleans. Ganapathi.. P. V. ichi Katagiri. and represents a critical step in developing a novel. 2004... Nonlinear Systems. P. KDB-TR-08. Tse. however. No.
The reinforcement learning control law is susceptible to system disturbances for which it is not trained. and Kawato..5 −1 −1.5 −3 −2 −1 0 East [m] 1 2 3 Trajectory Start Point End Point 3m error
Fig. UT. C. Vol.. P. and Blamford.” Australian Conference on Robotics and Automation. Hynes. Diel. K... V. Taylor-Francis Inc. [11] Khalil. Moore. J.. M. R EFERENCES
[1] Hoffmann. This is the ﬁrst demonstration of such capabilities on a quadrotor known to the authors. multi-vehicle testbed for validation of multi-agent control strategies for autonomous aerial robots. Waslander. H. Addition of an integral error term to the control policy may prove an effective means of mitigating steady state disturbances.. J. Nonlinear Control Using Discrete-Time Dynamic Inversion Under Input Saturation: Theory and Experiment on the Stanford DragonFly UAVs.100 50 0 −50 −100 −150 rz [cm] vz [cm/s] r
z. two design approaches were presented for the altitude control loop.. and were a signiﬁcant improvement over linear controllers that failed to stabilize the system adequately. A. RL does. and Lozano. in outdoor ﬂight test. Cambridge. [14] Doya. K. J.. [17] Atkeson. [9] Devantech. R.rctoys. Prentice Hall. [16] Ng. S. G. Jang. K.
3 2.” Uncertainty in Artiﬁcial Intelligence. as well as Prof. [15] Ng.” ICRA.. June 2000.5 −2 −2. D.5 2 1. Reinforcement learning controller response to manually applied step input... J. Auckland. D. and Roberts..5 1 North [m] 0. which is well within the expected error bound for the L1 band differential GPS used. 11. 1999.. S. 1-5. April 2004. September 2003. J. L. Stanford University. Spikes in state estimates are from sensor noise passing through the Kalman ﬁlter.. S. [12] Jang.. “http://www. 1958. “Quadrotor Control Using Dual Camera Visual Feedback. In particular. ¨ [10] Utkin. although the transient dynamics of the ISM control are more pronounced. M.. 1998.” IEEE Transactions on Control Systems Technology. Autonomous Hover Applying ISM altitude control and integral LQR position control techniques. E. “PEGASUS: A policy search method for large MDPs and POMDPs.. A. G. Guldner. B.” Artiﬁcial Intelligence Review. Ph.. Kawato Dynamic Brain Project Technical Report.. “Locally Weighted Learning.. “Design and Control of an Indoor Micro Quadrotor.co.” SRF08 Ultrasonic Ranger. C.” International Symposium on Experimental Robotics.. 2000. and Tomlin. thesis. Mahony. P.uk/htm/srf08tech. A. A. S. 1997. Acknowledgments The authors would like to thank Dev Gorur Rajnarayan and David Dostal for their many contributions to STARMAC development and testing. 2001. A.des
[cm] [cm/s] 16
v 4 6 8 10
Time [s]
12
14
Fig.
VII. Comparison of the step response for ISM and RL control reveals both stable performance and similar response times. and Shi. “Autonomous inverted helicopter ﬂight via reinforcement learning. Y. ﬂight tests were performed to achieve the goal of autonomous hover.. have the advantage that it incorporates accelerometer measurement into its control. 11–73. pp. M. [2] Lambermont. Butterworth-Heinemann.. Dzul.des z..
Autonomous hover ﬂight recorded position..