You are on page 1of 21

Simultaneous Communication and Control Author(s): P. Whittle and J. F. Rudge Source: Advances in Applied Probability, Vol. 8, No. 2 (Jun.

, 1976), pp. 365-384 Published by: Applied Probability Trust Stable URL: http://www.jstor.org/stable/1425909 . Accessed: 13/07/2013 02:02
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Applied Probability Trust is collaborating with JSTOR to digitize, preserve and extend access to Advances in Applied Probability.

http://www.jstor.org

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

Adv. Appl. Prob. 8, 365-384 (1976) Printed in Israel 0 Applied Probability Trust 1976

SIMULTANEOUS

COMMUNICATION

AND CONTROL

P. WHITTLE AND J. F. RUDGE, University of Cambridge

Abstract A dynamic co-operative two-person game with imperfect information is considered, in which the conflicting aspects of an optimal policy (choice of actions to minimise 'own cost', and choice of actions to optimise communication of information) can to a large extent be decoupled and solved separately. In deriving a strategy which is asymptotically optimal (in a sense explained) we derive formulae for the capacity of a vector Gaussian channel and the rate-distortion of a vector Gaussian source. It is shown that the optimal strategy is in general non-linear in observables, but a numerical example shows that, at least in one rather special but non-trivial case, the best linear strategy may be very nearly optimal. This conclusion is made explicit by the analysis of Section 7.
TEAM GAME; COMMUNICATION; LOG PROBLEMS

1. Introduction It is well known (see e.g. Aoki (1967)) that, for one-person LQG optimal control problems, the optimal control is a linear function of state estimate, and a linear function of noisy observations on the state under a wide range of conditions. (The description LQG is now standard: it implies linear dynamics, quadratic costs and Gaussian noise disturbing dynamics and observations.) It is by now also well recognised that this linearity of optimal control does not in general extend to the multi-person case, where several agents have to choose decision functions on a common optimisation criterion, but based upon different information. Ho and Chu (1972) have given conditions which are sufficient to ensure linearity of optimal control. For convenience of exposition consider a situation with just two agents or players, PI and PII. If a control by PI affects the information available to PII, and furthermore PII's information does not include all the information on which that control was based, then optimal controls will not in general be linear. Thus we see that the non-linearity arises when there is a possibility of communication between the players. PI will try to transmit information to PII in situations where this is appropriate, and in general there will be an exchange of (imperfect) information.
Received in revised form 18 November 1975.

365

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

366

P. WHITTLE AND J. F. RUDGE

The first and simplest example of such behaviour is that due to Witsenhausen (1968). This is a two-player situation in which PI observes a normal random variable Xoand forms xi = Xo+ u, where u, is his control variable, an arbitrary
Borel function of his observable Xo. PII then observes y = xi + E = Xo+ u1 +
E

where e is a normal random variable independent of Xo,and forms his control u2, an arbitrary Borel function of his observable y. The aim is to choose u, and u2 as functions of the relevant observables so as to minimise E[k2u2 + (x - u2)2]. That is, u, and u2 between them are to 'cancel' Xoas well as possible. Witsenhausen shows that the best linear strategy (i.e. with u1 and u2 linear functions of xo and y respectively) is not optimal. For some values of the parameters, a strategy which does better is one in which PI 'signals' to PII (at relatively low cost) the most likely value of xi, upon which PII can take appropriate cancelling action. The functional to be minimised is not convex and this makes the optimisation very difficult. The optimal strategy has not yet been determined, and in fact no substantial work has been done on investigating the different aspects of control which cause this difficulty. In the next section we shall formulate a problem in which control and communication are both required but the controls operate on separate parts of the system, so that joint control is not required. In this way the communications aspect can be isolated and, under conditions to be stated, the problem is solved. The controls which are optimal within the class of linear rules are found in Section 6, and it is shown that except in special cases, these are not overall optimal. Because of the communications aspect, the setting of the problem is that of conventional communications theory, i.e., arbitrarily long sequences from stationary processes are observed, and the solutions are asymptotic in nature. The situation we envisage is that in which PI can observe a time-sequence s, and on the basis of this chooses actions u which affect the variable x describing his state (so that u and x are also time-sequences). PII observes a noisecorrupted version y of x, and on the basis of this must infer 5. PI incurs costs corresponding to the departure of u and x from appropriate rest-values; PII incurs costs corresponding to errors in his estimate of s. PI can communicate with PII only through his state-variable x; his aim then must be to vary u and x so little from their rest-values that he himself incurs acceptable cost, while at the same time varying them sufficiently that he can communicate to PII enough information that PII can construct a reasonable estimate of s. He has then a combined control and communication problem. It is hard to find an example of such a situation which is not artificial in one way or another. However, one might imagine PI to be a helicopter pilot, flying his craft along an irregular line of magnetic anomaly. PII is a survey party on the ground, trying to determine the course of the line from the helicopter's path. s,

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

Simultaneous communication and control

367

u, x, y represent respectively co-ordinate of the currently-traversed spot on the line, helicopter controls, helicopter path and ground-party's observations on helicopter path. The helicopter dynamics will favour a straight-line path; the pilot will vary from this, not so much as to follow the line of anomaly exactly, but enough that the survey party can reasonably estimate the line from his flight-path (everybody's action and estimation rules being mutually agreed). One can imagine this happening in real time: the survey party following the helicopter in some ground vehicle. Alternatively, one might imagine that the survey party observes the complete helicopter run (by radar fixes, say) before attempting to estimate the course j of the line. This latter case is the one we shall consider, as it allows us to appeal to the concepts and results of statistical communication theory. We formulate the problem more exactly in the next section. 2. Formulation of the problem We suppose then that PI observes virtually a complete realisation of a stationary ergodic Gaussian process {(,}, where the variates 4, are, say, r-vectors. On the basis of this he produces a control sequence {u,} by a transformation which is time-invariant but not necessarily linear. The control sequence when injected into his own linear dynamics produces a sequence {x,} related to {u,} by
(1) x, = E .u, =
s=O

(U)u,

where ?(U)= E fsU' and U is the backward translation operator. We suppose the elements u, and x, to be m-vectors and n-vectors respectively, so that 4s is an n x m matrix. We imagine that it is not in PI's interest that either u, or x, should depart from zero, and so attach to his actions a cost (2) where the matrices (R()= O (o) =
E
-0o

L = E[

+ {x',_sRsx, u'-,Ou,]

(3)

Re-" R,e -'0s

are defined and non-negative definite (n.n.d.) for real w. The cost in (2) is independent of t since {u,} and {x,} are, like {,}, stationary. We suppose that PII observes the entire realisation {y,} where

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

368
(4)
Yt = Xt + Et

P. WHITTLE AND J. F. RUDGE

and {?,} is a stationary ergodic Gaussian disturbance process, independent of variables previously defined. On the basis of {y,} PII must form an estimate , of ,. Of course, 4, is a random variable, but one can still speak of 'estimation', (and has an inference problem which is naturally Bayesian). We shall suppose his cost function to be
(5)

L2 = E [

(t-S-

M -s)Ms(

)]

where M(w), defined as in (3), is also well defined and n.n.d. We suppose now that the two players must agree on decision rules (for the formation of {u,} from {(} and of {(} from {y,}) which will ensure the minimisation (or e-minimisation, i.e. reduction of the cost to within E of its infimum, if no true minimum exists) of
L = k2L1 + L2.

Thus PI has a control problem; he must choose his actions so that the cost LI is not too large. He also has a communication problem; he must choose his actions so that they convey information on {5,} to PII, in order that L2 is not too large. These requirements are in conflict, and may be seriously so. We can use the concepts of statistical communication theory to solve the problem. These are effective because they enable us to virtually decouple the different aspects of the optimisation problem. The transformation u - y, which is prescribed, can be regarded as defining a statistical communication channel with input {ut} and output {y,}. The operation of the channel is constrained by the fact that L, must not be too large and we can calculate the capacity (see Section 3 for definition and references) of the channel, C say, for a prescribed value of L1. In calculating C, we determine the input statistics {u,} which would realise this constrained capacity. One has then the problem of transmitting a signal {(} with known probability structure through a channel of capacity C in such a way that the processed output {(} is a minimally distorted version of {(}, in that L2 is minimised. This is a problem in source encoding, soluble by methods now well established (see Berger (1971)). By choosing various values of L1, and so of C, one determines corresponding minimal values of L2, and so obtains a graph of minimal L1/L2 values of the type of Figure 1. One can then readily determine the point on this graph for which L = k2L, + L2 is minimal. In fact, such points generate the graph as k2 is varied. In order to apply the results of statistical communication theory, it is necessary that realisations of arbitrarily large signal blocks are available at all stages, and so, effectively, complete realisations. We emphasise that this means that, in control theory terms, the solutions are unrealisable, in that actions taken at a

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

Simultaneous communication and control


1-0

369

12

05

i B

i I

50 Li

100

Figure 1 Evaluations of the LJ/L2curve for the special case of Section 7, when 4 has the form (51). The lowest curve corresponds to the case p = 0, when the optimal linear rule is also optimal. The middle and uppermost curves correspond respectively to the optimal and optimal linear rules in the case p =0.99.

given time depend not only on past observations, but also on observations which lie in the future. 3. Determination of the channel capacity The first task is to determine the capacity of the information channel u - y, under the constraint on u that L1, as given by (2), has a prescribed value. Brandenburg and Wyner (1974) have just given the solution to a problem which is identical except for a slightly different form of constraint on u. They obtained the result by establishing a coding theorem and its converse. We shall proceed along different lines. An upper limit to the capacity is given by max i(u, y), where i(u, y) is the mutual information rate between the two stationary series u and y, and the maximisation is taken over u-statistics. In the present ergodic case this upper bound will in fact give the capacity exactly (Shannon (1957)). Furthermore, given that {&,}is Gaussian and the constrained functional L1 is mean-quadratic, the

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

370

P. WHITTLE AND J. F. RUDGE

maximising {u,} will be Gaussian (Gallagher (1968)) and the mutual information rate will equal (Pinsker (1964)) (6) i(u, y) = f log f (6)fu IYY . fy

Here fuy,for example, denotes the spectral density matrix


fUY(c)

- C Euy'^ ee-'""
-00

which we assume to exist. Now, given the relations (1) and (4) between u, x and y we deduce that (7) (8) fu = f.uu fyy= ufu(t + f,? It follows

where by 4t = ((w)t, we mean the transpose of the matrix 4(e-'). from (6), (7) and (8), that (9) i(u, y) =

log 1Ofuut +

idw,

and from (2), (7) and (8) that (10) Li =

tr(fuuQ+ 4fuu+ tR)do,

all the matrices in the integrands being functions of w. We have now to maximise expression (9) with respect to the n.n.d. matrix fuu(w), subject to prescription of the value of the integral in (10). The maximisation can evidently be carried out separately for each value of o. Furthermore, since fuu takes values in a convex set, (the set of n.n.d. matrices) and integrals (9) and (10) are respectively concave and linear functions of fuu,we can tackle the constrained maximisation problem by maximising a Lagrangian form (11) ~1 = i(u, y)-?AL,.

Note that the quotient of determinants in (9) is to be assigned the value + oo unless the null space of f,,fu t is included in that of fe,, in which case the quotient can be expressed as a quotient of non-zero determinants in a reduced space. We shall suppose that the matrix Q + 4tR4 is non-singular for all w, so that deviations of u from zero in any direction carry a positive cost. We shall also have occasion to factorise it as
(12)

+ Q(W))

(&W)tR(w) 4(w) = B(w)tB(w)

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

Simultaneous communication and control

371

where any square (and hence invertible) B satisfying (12) is acceptable for our purposes. If d is a scalar then we shall use d+ to denote the scalar which equals d if d > 0, zero otherwise. Correspondingly if D is a symmetric matrix, we shall use D+ to indicate the matrix obtained by retaining in the spectral representation of D only those components associated with positive eigenvalues. We shall also appeal to the fact that a symmetric matrix D can be uniquely represented as the difference of two positive definite mutually orthogonal matrices, and that this representation is just (13) D = D+- [- D]+.

Representation (13) obviously has the desired property, and uniqueness is easily established. Finally, relations D > 0 and D - 0 will indicate that D is symmetric, and is positive definite (p.d.) and n.n.d. respectively. Theorem 1. Suppose that the matrix +(w) is square and non-singular, and define the matrix (14) A() = B(a)(4(w))-l'f (w)((w)t))-lB(t)t.

Then the unique n.n.d. matrix fuu maximising expression (11) is (15) fuu= (B)-'[A -'I - A ]+(B)-

The corresponding values of loss L, and channel capacity C are (16) L1 = ] (A-I - yj(w))do

(17)

C=J

l [log[AyJ,(w)l-]+dw

where the yj(w) are the eigenvalues of A (o). Proof. Note that when 4 (w) is square and non-singular, we require f,, (w) to be likewise, since otherwise the channel would have infinite capacity. Thus A (w) is p.d. If we transform to f= Bf.aBt then expression (11) becomes

=J1,

[log

f+ A I

tr(f)] d

and a maximisation with respect to n.n.d. fu is equivalent to a maximisation with respect to n.n.d. f. At the maximising value, the functional l, will be nonincreasing with respect to permitted perturbations. Such perturbations are

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

372 f--- f + 8aa

P. WHITTLE AND J. F. RUDGE

where a is an arbitrary complex m-vector and 8 is a sufficiently small scalar, which must be non-negative if fa = 0, and can be either sign otherwise. In the case of 8 infinitesimal, this leads to the conclusion (18) at[(f + A)-'AI]a - 0

with equality if Noa # 0, where (19) -ANo = (f + A)--A.

Condition (18) and its qualification can be alternatively expressed (20) We see from (19) that (21) and we can invert (19) to f = -A + (1/A)(I - No)(22) where AN, = (I - No)- - I = No(I - No)- = (I- No)-'No. From (20)-(22) we see that, also for N,, (23) N > O, Nlf = fNi = 0 -A +(1/A)I + N, O No<I No 0 Nof = fNo= 0.

so that (22) effectively represents A-'I - A as f - Ni, the difference of two n.n.d. mutually orthogonal matrices. By the uniqueness of the decomposition (13), we deduce that (24) f =[A I - A]+,

whence (15) follows. For this evaluation of f.. one finds by standard calculations that integrals (10) and (9) have the evaluations (16) and (17) respectively. Integral (17) can actually be identified with the capacity of the channel on account of the ergodic character of the channel. By varying A in (16) and (17), one can obtain the channel capacity C for various values of L,. We have now to consider the case where 4bis noninvertible, and, if possible, find an analogue for the matrix B-'A(Bt)-' of (14) for such a case. We shall sketch this generalisation without completing details, which are straightforward but tedious. In the invertible case the matrix B-'A(Bt)-1 has the following interpretation.

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

Simultaneous communication and control

373

We see from (1) and (4) that the channel relation is


yt = 4(U)Ut + E,.

Suppose this is written in terms of a transformed(invertibly)output by


(U)-'yt = ut ++(U)-'E,

or
y, = u, + Et

say. Then B-'A (Bt)-' is just the spectraldensity matrixof the noise processfor this standardised expressionof the channel.By varyingdifferentcomponentsof u,, one utilises differentchannelmodes. Recall that m and n are dimensionsof inputvector u, and outputvector y, respectively,so that 4 is n x m. Let 4 have rank s. If s < m, this is a sign that the channel can utilise only s linearly independent functions of u, so that m can be effectively reduced to s by a redefinition of u. Correspondingly,if s < n, then only s channel modes are excited for any input u, so that n can be effectivelyreducedto s by a redefinition of y. Performanceof these two operationsreducesboth m and n to s with no loss of channel capacity, and (4 to a non-singulars x s matrix,so that we are back in the case of Theorem 1. 4. Determination of the optimal source encoding
We have the sequence of transformations - u - x -- y , omitting the t

subscriptfor simplicityin cases where this causes no confusion.The transformation u -- y is via a channel of capacity C, where C is fixed by the magnitude prescribedfor L1. The informationrate along the whole sequence { -- thus cannot exceed C, but may well be made to attain C, since the transformations
-*

u and y -

Let us for simplicity denote 4 by w. This variable must be regarded as a distortedor informationally degradedversion of s, in that it is what s becomes when transmittedthrougha channelof prescribedcapacityC, with pre-channel and post-channeloperations designed to ensure that the w - x deviation, as measured by L2 of (5), is minimal. This is a recognised type of problem in statistical communicationstheory; that of rate distortion, of minimisingthe distortion measure L2, subject to constraintson the informationrate of the transformations -> w. Under the ergodic hypotheses, the problem becomes equivalentto that of choosinga conditionaldistribution p({w,} {(,}) to minimise
L2 subject to (25) i(, w)C.

can be chosen freely by the two players.

Indeed, since {,} is Gaussianand the distortionmeasurequadratic,it is known

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

374

P. WHITTLE AND J. F. RUDGE

that this extremising conditional distribution will be linear-Gaussian in the sense that it is specified by a linear relation (26) = I(U) +,

where {rt} is a stationary Gaussian process independent of {4} (see Berger (1971) for theorems of this kind and a reference to Tsybakov (1969) for a similar result). The measures L2 and i(S, w) are then easily expressed in terms of the spectral density matrices of {,} and {r,}, which we shall write for simplicity as f (o)--g(o), These expressions are
L2

f,, () = h (o).

tr{[(-I)(

I)-t + h]M} do

i(4, w) where

log fgit+h

= r(e-i (o)

and the task can be regarded as one of minimising a Lagrangian form (27) + Y2= L2Li(, w)

with respect to (i(w) and h (w). The scalar / is a Lagrangian multiplier chosen to enforce the capacity constraint (25). The matrix function 4(w) may be chosen freely, while h (o) is constrained to be n.n.d. For simplicity of treatment, we shall suppose the matrices M and g to be non-singular for all o; the solution in singular cases is genuinely an obvious limit version of the solution for the non-singular case. We shall require a factorisation of M(w) into square factors M(w) = N(w )N(o) and shall then define the non-singular matrix P() = N(o)g (o)N(w)t

whose eigenvalues we shall denote by 7rj(w) (j = 1,2, * , r). Theorem 2. (27) are (28) (29) The admissible values of h (w) and ir(o) maximising expression h = N-'[I
q = t~-lhM

- i 2P-1'](Nt)-'

and the corresponding values of distortion and information rate are

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

Simultaneous communication and control

375 ' [min (,i, rj ())] dw

(30) (31)

L2 = = i(, w)= -

E [log(rJ,()/x)]+dw.

Proof. If we transform from h, lr to H = NhNt P = NN-1, where this P is not to be confused with that in (26), then minimisation of expression (27) is equivalent to minimisation of r 41F + *P IH H l+tr{(P-I)P(P-I)t+H} [log [Hlog ] do

with respect to P and n.n.d. H. As in the proof of Theorem 1, we obtain the stationarity conditions (32) (33) (P - I)+ IL (PPt + H)-'~ = 0

I + / (PPPt + H)- - /H-1 = So

where So is a matrix determined only to the extent that S _ O, HSo = SoH = 0.

Note that, since i(/, w) ' C < oo, then I(g*t + h I Ihl must be finite, save on an w set of Lebesgue measure zero. This implies that the matrices in both determinants must have the same null space and so (since g is non-singular) that the columns of +(tP) must belong to the orthogonal complement of the null space of h (H). Denote this orthogonal complement for H by X. Then the inverses in (32), (33) are all to be construed as proper inverses on t. We deduce from (32), (33) that I- So = =IH-1' and hence that = -IH + S where S, is a matrix whose columns lie in the null space of H. But, by what has just been said, S, = 0, so that (34)
A=

/x -'H.

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

376

P. WHITTLE AND J. F. RUDGE

Substituting for T back into (33) we deduce that


I - So = [H-1 ((HPHI/ 2) + H)-1]

= (L 2I + PH)-P so that 2I+ PH= P(Iand So)-l

H = z(I= where, as in (23), S2- O, As in (24), it again follows that (35) H = [gI-

So) -

2p-1

I-2/.2P-1+ S2

S2H = HS2 = 0.

e2p-1]+.

Relations (28), (29) now follow from (34), (35), and evaluations (30), (31) follow by routine calculations. 5. Synthesis We need now to collect the results of the last two sections and show how, together with classical results of statistical communication theory, they provide, in a sense to be explained, a solution of the problem formulated in Section 2. We assume that the Lagrangian multipliers A, ,L of Sections 3 and 4 are chosen so that
i(u,y)=i(, w)= C

(where the information rates i have those values given for the extremal solutions of Theorems 1 and 2). The value of capacity C thus parametrises the problem; as we vary it, we vary the separately minimised values of the losses L1 and L2 and there will be a value of C for which the solutions minimise expression (6) for any
k2.

Consider operation of the channel for T consecutive instants of time. The fact that the channel has capacity C means that one can distinguish from the output without error between just M = [eTC] input values (in that the probability of error tends to zero as T -oo if one attempts to transmit one of not more than [e T(-e)C] values for any E > 0, and is positive, indeed tends to 1 as T - oo,if one attempts to transmit one of [e T(+e)C] equiprobable values). This capacity can be
realised by choosing as the input wave-terms u'I, u2,
...

, u(M) where these are

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

communication and control Simultaneous independent

377

realisations of a sequence {ui, u2,.* , UT} chosen using the optimal statistics calculated in Theorem 1. The degradation which 5 unavoidably suffers in the channel can be performed on it before transmission by a compression or source-encoding step in which the values of {f(, 22,* , -T}are mapped on to M values w" , w2', *, wM) where the w's are M independent realisations of {wi, W2,** , WT} using the minimal distortion w statistics calculated in Theorem 2. Each value of {(1, 2, * * *, Tr}is coded on to a value w ) by choosing the w0 of the randomly constructed code-list that best approximates it in a time-averaged version of distortion measure L2. The M values w"' are then coded on to the M values u"' by an arbitrary one-to-one correspondence. This gives one an asymptotically (T-> o) optimal transformation 5 -- u. The values u 1, u2, **, u) can be distinguished with probability one (in the limit T-- o) from the channel output block {yl, y2, , yT} using a maximum-likelihood decision rule. One can then infer back to the intended value of w, say w(y), and take this as an estimate
{wI, w2,",

WT}

of {(1, 22, . *,

T}.

This sequence

of

codings and decodings achieves asymptotically (T-- o) minimal distortion (or estimation error) consistent with the constraints on information capacity. It is possible that all these codings could be realised as well asymptotically by some relatively simple transformations 6 -- u, y - w, or even by linear transformations. In fact, however, one sees from the solutions of the two theorems that linear transformations cannot in general be optimal. To see this, introduce the notion of the spectral gap set Ge of {,}; the values of co for which a Tf(cw)a = 0 for some a c 0. Ge may be arbitrary (we assumed it empty in Theorem 2, but will not require this in general). Gu and Gw are determined by (15) and (28) respectively, and are in general unrelated to Ge or to each other. Linear relations on the other hand for 5 -- u, y -* w would imply at least that Ge C G The absence of necessity for such relations implies that in general and Gy C GW. the optimal codings are non-linear. 6. The optimal linear rules Given that the optimal transformations x -> y, y -> w are in general nonlinear, there is some interest in determining the optimal linear transformations, so that results may be compared. Because one is dealing with two iterated linear operations, this optimisation problem turns out not to be of the usual LQG nature, for which an optimal linear operator is determined from a linear equation. The equations for the optimal linear operators in the present case are non-linear. A consequence of this is that the optimal linear operators share one characteristic with the optimal unrestricted operators. They show spectral gaps, in general, corresponding to the fact that there are certain channel modes and

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

378

P. WHITFLE AND J. F. RUDGE

frequencies which are not worth using, in that noise is too high, or effective signalling too costly. We shall consider only the case where the matrix (wo) is square (i.e. m = n) and non-singular. Other cases can be reduced to this one, as indicated at the end of Section 3. We shall also suppose that r = n (so that the dimensions of /, x and u are all identical). The analysis of this section generalises to the case r4 n in a fashion sufficiently evident that the extra discussion needed seems unwarranted. Let the s -> u transformation be
u, =

>s
-00

denoted in short by (36) and let the generating function


co

u = 06

(w)

Oe-iws

also be denoted simply by 0 in cases where there is no ambiguity. Let similarly


= = y,

so that
x = 406

w =f (6

+ E).

The quantity to be minimised with respect to the two operators 0 and 3 is then L1 + L2 = 21 [tr(OgOt(Q + tR4))

+ tr [M {(M(P0 - I) g (f30 - I)t + Pft3 t .}]] d@. We could just as well be minimising k2LI + L2 with the k2 absorbed into the definitions of Q and R. Consider the following normalisation of the problem. Let g be factorised by g = GGt into square (non-singular) factors. Define also the matrix (37) V = (4+)-l Q-' + R.

Then, since f,/ is taken to be p.d., there exists a matrix K such that the matrices (38) Kt VK = r = diag(yj) KfK= = I Ktf-K

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

Simultaneous communication and control

379

are simultaneously diagonal. Note that the eigenvalues of VfE, are just those of A in Section 3. We shall transform from 0, f3 and M to '= K-'00G (39) a = G-'PK
S = GtMG

in terms of which the quantity to be minimised is the integral of


(40) L(oZ) = tr{Frt + S[(a' - I)(a - I)t + aat]}.

We shall suppose that the matrix S has spectral resolution


n

S = E rjio'j are the same as in Section 4, i.e. the eigenvalues of P, and are where the Trj so that -rj increases as y, decreases. ordered supposed Theorem 3. The matrices a and ; minimising expression (40) may be determined as follows:
(41) a = (I + s')-',

and the jth column ij of (t can be taken as

_ l, = (l/, )"2(r'/2
LI = 1
L2

2)

The separate components of minimal loss have the evaluations

(42)
(43)

1/

/2 (--1/2

1/2)

1/2

min (,7/2

yl'2) dw. ' yields the

Proof. Stationarity of expression (40) with respect to a and conditions (44) (45) I)~t + a = 0 rF + atS(a - I) = 0. (a* -

From (44) we deduce expression (41) for a in terms of ; which we shall write alternatively as
(46) a = (tJ-'

Substituting this value of a into (40), we can write

(47)

L(to)= tr[r't + S(I- rtJ-1')]

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

380

P. WHITTLE AND J. F. RUDGE

which is no longer a quadratic function of 4. The condition that this expression be stationary with respect to ; can be derived either by direct differentiation or by substituting solution (46) into (45). It is (48) r5F' = J-~'S'tJ-.

Now r is diagonal and .ft and the right-hand side are symmetric, so that, unless there is a particular relation between the y,, expression (48) implies that rt and J-'S(tJ-' are both diagonal as well. Let
'=t= diag (i) = D.

From (44) it follows that J and rSst have the evaluations J=I+D (49) MS= (I + D)D(I + D)

and are also diagonal, and hence that, as claimed in the theorem, the columns of ( are proportional to right eigenvectors of S. We can set then
=

8 /2a

where the 8, are possibly zero, and oa is a right eigenvector of S (with the appropriate labelling j as yet undetermined). It follows then from (45) that
7r8, = (8, + 1)2y%8,

so that either 8, = 0 or
S3
=

y-

1/2(

1/2 _-

1/2)

Now from (48) and the diagonal nature of all the matrices, we deduce that expression (47) has the evaluation L (o)= E (r, - y,8). Thus the optimal choice of 8, is clearly
8,
=
-1/2

(r'

1/2)

for all j, when


L(@ ) =
j-

ij

jl/2 _ Y1/2)2

We have now to determine the labelling of the rTj'swhich will make this sum in its turn minimal. Because the function (.)+ is convex, this minimum is achieved when the rTjand y, are oppositely ordered, i.e.,
Yi > yk = iTTj

k.

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

Simultaneous communication and control

381

Evaluations (42), (43) of the separate components now follow by routine calculations. 7. The relationship between optimised costs and those obtained using the optimal linear rules In this section (due to J.F.R.) we are interested in determining under what circumstances the linear laws found in Section 6 achieve optimal results, and in obtaining some idea of the degree of sub-optimality in cases where this does not occur. Denote by O1 and 02 the optimised values of L1 and L2 and retain the notation L1, L2 for the values using optimal linear rules. Theorem 4. Linear rules achieve overall optimality over the whole range of the parametrising constant k, if and only if r = n and for all j (1 -j - n) and almost all o
irT() yj(aW)

=a2

for some positive constant a. Proof. Suppose firstly that the conditions are satisfied. Then for all k > 0 we have from Theorem 3 with k2 incorporated LI = L2 = f [d. yi(w)]

+1f

min [7r,(w),ak]dw.

Note that if A = k/a and ,u = ka then for all j, (Ay,j(w))- = (ka/lrj())-' = 7rj()/l

so that this pair of A, ,u satisfy the capacity constraint of Sections 3 and 4. They trivially give optimised values Oi = Li completing the proof of sufficiency. Now let t = min{r, n}. For all k, there exists an optimal linear control which utilises only t noise channels. However, if t < n then for some values of O, 02 the optimal coding needs to use all n noise channels giving a higher information rate between control and observation than is possible with just t channels. If t < r then the information degradation takes place over at most t channels in the linear case whereas to obtain overall optimality it is necessary in some situations to use source codewords that are r-vectors. Hence if linear rules achieve optimality for all k, then t = r = n. i = 1,2,

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

382

P. WHITTLE AND J. F. RUDGE

To prove necessity of the second condition, consider the spectral density matrix of u. For optimal u we have from Theorem 1 K-'4f,.4 (Kt)-' = [A-'K-' B-'(K-' B-)t - K-' fe (Kt)-']+

[A-'K-' V-'(Kt)- - I]+ using (12), (37) and (38) = [A-'F -1- I]+. For optimal linear u, we have from (36), (37) and Theorem 3
fuu

= OfeOt = OGGtOt

= - 'K( -l'K), so that


K-1 fut t(Kt)-l = *t= diag (8)).

Thus for equality of the spectral density matrices in the two cases we require diag([(1/Ayj)- 1]+)= diag([(r6,/k2yj)1/2- 1]+) i.e.
diag(y -' [(/A) - y ]+) = diag (y -'[(ryj k 2)1/2 _ yj]+).

Now if the condition of the theorem does not hold, then for some A it is not possible with any k to achieve equality in (50) on any w set of measure 2Ir. For this A choose k such that L2 = 02, the latter quantity being defined by A. Now the u achieving optimality is unique and so the linear rule must necessarily give L1 > O1. Hence the condition is necessary. The special case of this theorem when r = n = 1 and y,, rrjare c independent for all j - the scalar memoryless case - is well known and states that if a memoryless Gaussian source and a synchronous time-discrete memoryless additive Gaussian noise channel are given, then direct connection of the source to the channel, together with proper scaling of input and output, results in a communication system that is ideal with respect to the squared error criterion. There are naturally other instances in which linear rules might be optimal and the following theorem gives one important example. Consider a memoryless situation and suppose without loss of generality that
y/= yI < ?2 = ?3
7T = 7T27T3

' - yn
* * > 7, .

Theorem 5. For the case described above, let the input powers L, and 0i both equal 8. Let the multiplicity of y be p and the multiplicity of 7r be q. Then for small enough 8

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

Simultaneous communication and control

383 02= L2 if p = q;

(i) (ii)

L2< 02+ 0(2)

if p

q.

Proof. Let s = min [p, q]. By hypothesis we have


Y = yl =
IT = 7T1 = ***=

= Yp<
7Tq >

p+l
7Tq+i

with an obvious re-definition if p = n or q = r. Let - y) P(Yp+l


<min
py[(7r/rq+l)q/

- 1]
)2]}.

sy {1 - max [( rq,/ITr)',2, (yl/y+,

This condition ensures that only terms corresponding to ;r and y occur in the various summations. Now 1 = and the capacity constraint gives p log(1 + /py)= q log (TT/,),
i.e. e = 7r(1 + 8/py)-P'/

= p(1/A - y)

Thus
02 = =

j=1

7 - qrr[1- (1+ (8/py))-pq]


- -( 7 r8/y)+(7r82/2y2)((l/p)+(l/q))+ 8' 0(3).

For the linear case we have _ ky1/2] Li = 8 = sy 1/2[ 1/2 so that k = s(7ry)1/2/(8 + sy). Thus L2 =
=

- sT + sk (7ry)"2
r, - s7[l - (1 + (8/sy))-'].

We see that if s = p = q then L2 = 02, otherwise

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions

384 L2=

P. WHITTLE AND J. F. RUDGE

-rj -(7rr8/y)+ (r82/sy2)+ 0(S3)

< 02 +0(2).

It is easy to see why Part (i) is true in that for small input power only part of the channel is used, and in the extreme case p = q = 1 we are effectively back to the scalar case. However, it is important to note that even where the linear rules are sub-optimal for all small 5, the extent of sub-optimality has order 82 rather than just 8. Finally we consider a simple scalar example with memory. Suppose that R, Q, fee, g and M are all constant with the value unity, and that 4 has a first-order autoregressive character, i.e. (51). 10 12- 4(e,i) 4 (e-i@)= 1/(1 + p2 - 2p cos o)

for some scalar p. p = 0 reduces to the memoryless case with solution 0 < L2 _ 1, LI > 0. L, = 2((1/L2) - 1), In the case when p approaches unity it was conjectured by the authors that PI would find it much more difficult to pass information effectively to PII. The results, shown in Figure 1 in the form of a graph of L1 against L2, only partially bear this out. The right-hand tail of the graph for optimal values has the form L1 - 2.604((1/L2)- 1.15) LI _ 2

so that for small values of L2 the ratio of optimal L1 values in the two cases is about 1.3. Another surprising feature emerging from the graph is that the linear strategy, although non-optimal in the case of non-zero p, is nevertheless so very near optimal. References
AOKI, M. (1967) Optimisation of Stochastic Systems. Academic Press, New York. A. (1971) Rate Distortion Theory. Prentice-Hall, Princeton, N. J. BERGER,
BRANDENBURG, L. H. AND WYNER, A. D. (1974) Capacity of the Gaussian channel with

memory: the multivariate case. Bell Syst. Tech. J. 53, 745-778. R. G. (1968) Information Theoryand Reliable Communication. Wiley, New York. GALLAGHER, Ho, Y. C. AND CHU, K. C. (1972) Team decision theory and information structures in optimal control problems - Part I. I.E.E.E. Trans. Automatic Control AC-17, 15-22. PINSKER,M. S. (1964) Information and Information Stability of Random Variables and Processes. Holden-Day, San Francisco. SHANNON, C. E. (1957) Certain results in coding theory for noisy channels. Information and Control 1, 6-25. B. S. (1969) Linear information coding. Radio Engrg. Electron. Phys. 7, 22-33. TSYBAKOV, H. (1968) A counterexample in stochastic optimum control. SIAM J. Control 6, WITSENHAUSEN, 131-146.

This content downloaded from 14.139.235.3 on Sat, 13 Jul 2013 02:02:26 AM All use subject to JSTOR Terms and Conditions