You are on page 1of 9

arxiv.org:14XX.XXXX [physics.

gen-ph]

The difference between memory and prediction in


linear recurrent networks
Sarah Marzen1, ∗
1
Physics of Living Systems, Department of Physics,
Massachusetts Institute of Technology, Cambridge, MA 02139
(Dated: August 15, 2017)
Recurrent networks are trained to memorize their input better, often in the hopes that such
training will increase the ability of the network to predict. We show that networks designed to
memorize input can be arbitrarily bad at prediction. We also find, for several types of inputs,
that one-node networks optimized for prediction are nearly at upper bounds on predictive capacity
given by Wiener filters, and are roughly equivalent in performance to randomly generated five-node
networks. Our results suggest that maximizing memory capacity leads to very different networks
arXiv:1706.09382v2 [cs.LG] 14 Aug 2017

than maximizing predictive capacity, and that optimizing recurrent weights can decrease reservoir
size by half an order of magnitude.

Keywords: echo state networks, memory capacity

PACS numbers: 02.50.-r 89.70.+c 05.45.Tp 02.50.Ey

Often, we remember for the sake of prediction. Such Ref. [11] gave similar expressions for discrete-time linear
is the case, it seems, in the field of echo state networks recurrent networks. Ref. [12] gave closed-form expres-
(ESNs) [1, 2]. ESNs are large input-dependent recurrent sions for the Fisher memory curve of discrete-time linear
networks in which a “readout layer” is trained to match recurrent networks, which measure how much changes in
a desired output signal from the present network state. input signal perturb the network state; for linear recur-
Sometimes, the desired output signal is the past or future rent networks, this curve is independent of the particular
of the input to the network. input signal.
If the recurrent networks are large enough, they should We differ from these previous efforts mostly in that we
have enough information about the past of the input sig- study both memory capacity and newly-defined “predic-
nal to reproduce a past input or predict a future input tive capacity”. We derive an upper bound for predic-
well, and only the readout layer need be trained. Still, tive capacity via Wiener filters in terms of the autocor-
the weights and structure of the recurrent network can relation function of the input. Two surprising findings
greatly affect the predictive capabilities of the recurrent result. First, predictive capacity is not typically maxi-
network, and so many researchers are now interested in mized at the “edge of criticality”, unlike memory capac-
optimizing the network itself to maximize task perfor- ity [5, 7, 9]. Instead, maximizing memory capacity can
mance [3]. lead to minimization of predictive capacity. Second, opti-
Much of the theory surrounding echo state networks mized one-node networks tend to achieve more than 99%
centers on memorizing white noise, an input for which of the possible predictive capacity, while (unoptimized)
memory is essentially useless for prediction [4]. This linear random networks need at least five nodes to re-
leads to a rather practical question: how much of the the- liably achieve similar memory and predictive capacities,
ory surrounding optimal reservoirs, based on maximizing and ten-node nonlinear random networks cannot match
memory capacity [5–9], is misleading if the ultimate goal the optimized one-node linear network. The latter result
is to maximize predictive power? suggests that optimizing reservoir weights can lead to at
We study the difference between optimizing for mem- least half an order-of-magnitude reduction in the size of
ory and optimizing for prediction in linear recurrent the reservoir with no loss in task performance.
networks subject to scalar temporally-correlated input
generated by countable Hidden Markov models. Ref.
[10] gave closed-form expressions for memory function of
I. MODEL
continuous-time linear recurrent networks in terms of the
autocorrelation function of the input, and closely stud-
ied the case of an exponential autocorrelation function. Let s(n) denote the input signal at time n, and let x(n)
denote the network state at time n. The network state
updates as
∗ semarzen@mit.edu x(n + 1) = W x(n) + s(n)v (1)
2

where W, v are two reservoir properties that we wish to where


optimize. We restrict our attention to the case that W
is diagonalizable, pk = hs(n − k)x(n)in (11)
and
~ −1 ,
W = P diag(d)P (2)
C = hx(n)x(n)> in . (12)
where P is the matrix of eigenvectors of W and d~ are the
corresponding eigenvalues. For reasons that will become Due to Eq. 5, we need not divide p> kC
−1
pk by the vari-
clear later, we define a vector ance of the input. This memory function is also the
squared correlation coefficient between our remember or
ω = P −1 v. (3) forecast of input s(n − k) from network state x(n) and
the true input s(n − k).
We further assume that the input s(t) has been gener-
P
Memory capacity is usually defined as k=0 m(k), but
ated by a countable Hidden Markov model, so that its since Eq. 1 updates x(n) with s(n − 1) instead of s(n),
autocorrelation function can be expressed as we have
X ∞
Rss (t) = A(λ)λ|t| , (4) X
MC = m(k), (13)
λ∈Λ
k=1

where Λ is a set of numbers with magnitude less than 1. and we define the predictive capacity as
See Ref. [13] or the appendix. To avoid normalization
factors, we assert that ∞
X
X P C := m(−k). (14)
Rss (0) = A(λ) = 1. (5) k=0
λ∈Λ
Intuitively, M C is higher when the present network state
The power spectral density of this input process, with is better able to remember inputs, while P C is higher
Z π when the present network state is better able to forecast
1 inputs based on what it remembers of input pasts.
Rss (t) = S(f )eif t df, (6)
2π −π We have made an effort here to find the most useful
expressions for M C and P C, so that one might consider
is using the expressions here to calculate M C, P C instead

X of simulating the input and recurrent network. As shown
S(f ) = Rss (k)e−if k (7) in the appendix,
k=∞

P C = 2πω > DP C B −1 ω.

X X
|k| −if k (15)
= A(λ) λ e (8)
λ∈Λ k=−∞
where
X 1 − λ2
= A(λ) (9) π >
(1 − λe−if )(1 − λeif )
 
ω ω
Z
λ∈Λ B := S(f ) df, (16)
−π e−if − d~ eif − d~
by the Wiener-Khinchin theorem.
which is related to 2πC by a similarity transform, and
where
!>
X A(λ)A(λ0 )  1

1
DP C := .
0
1 − λλ0 λ−1 − d~ (λ0 )−1 − d~
λ,λ ∈Λ
(17)
II. RESULTS
The expression for memory capacity is more involved:
The memory function is classically defined by [5] M C = 2πω > DM C B −1 ω,

(18)

m(k) := p>
kC
−1
pk (10) where the matrix DM C has entries
3

 1 + di dj λ(λ0 )3 + di dj λ0 λ3 + di dj (λλ0 )2 − di dj λλ0 − di dj (λ0 )2 − di dj λ2 − d2 d2


i j
X
(DM C )ij = A(λ)A(λ0 )
(1 − λλ0 )(1 − di λ)(1 − di λ0 )(1 − dj λ)(1 − dj λ0 )(1 − di dj )
λ,λ0 ∈Λ

di (1 − di dj )λ2 λ0 + dj (1 − di dj )λ(λ0 )2 
− (19)
(1 − λλ0 )(1 − di λ)(1 − di λ0 )(1 − dj λ)(1 − dj λ0 )(1 − di dj )

20 25

20
15

15
MC

MC
10

10

5
5

0 0
−1.0 −0.5 0.0 0.5 1.0 0 1 2 3 4 5 6
W N
5 1.8

1.6
4
1.4

1.2
3
1.0
PC

PC

0.8
2
0.6

1 0.4

0.2

0 0.0
−1.0 −0.5 0.0 0.5 1.0 0 1 2 3 4 5 6
W N

FIG. 1. M C (top) and P C (bottom) as a function of W for FIG. 2. M C (top) and P C (bottom) as a function of N
Rss (t) = e−0.1|t| (blue) and Rss (t) = 21 e−0.1|t| + 12 e−|t| (green), for Rss (t) = 12 e−0.1|t| + 21 e−|t| and ω, d~ drawn randomly:
computed using Eqs. 15 and 18 in the main text. While P C ωi ∼ U[0, 1], di ∼ U[−1, 1]. The signal-limited maximal
is maximized for some intermediate W that depends on the P C is shown in green, whereas M C is network-limited. Both
input signal, M C is maximized in the limit W → 1. When M C, P C were computed using Eqs. 15 and 18 in the main
|W | ≥ 1, the network no longer satisfies the echo state prop- text.
erty, and so we only calculate P C, M C for |W | < 1.

than the reservoirs that maximize predictive capacity. To


as shown in the appendix. Together, these expressions illustrate how different the two reservoirs might be, we
explain why simple linear ESNs [14] can perform just as consider the capacity of a one-node network subject to
well as non-simple linear ESNs on maximization of M C; two types of input.
from Eq. 18, the memory capacity of a linear ESN is the The first type of input considered has autocorrela-
same as the memory capacity of a simple linear ESN with tion Rss (t) = e−α|t| . Some algebra reveals that M C =

~ e4α −2eα W +2e3α W −W 2 (1−W 2 )
v = ω and W = diag(d). (e2α −1)(e2α −W 2 ) and P C = (e2αe−1)(e 2α −W 2 ) . In-
It is unsurprising but not often mentioned that the spection of these formulae or inspection of the plots
reservoirs which maximize memory capacity are different of these formulae in Fig. 1 (blue lines) for α = 0.1
4

shows that M C is maximized at the “edge of criticality”, 1 √1


W → 1, at which point, x(n) is an average of observed 2| 2
s(n)– i.e., x(n) = hs(k)ik≤n . Interestingly, at that point, 1

P C is minimized, i.e. P C = 0. Instead, for this partic- 2| − 2 A B
ular input, P C is maximized at W = 0, at which point,
x(n) = s(n − 1)– i.e., x(n) is the last observed input 1| √12
symbol. 0.35
Both memory and predictive capacity can increase
without bound by increasing the length of temporal 0.30
correlations in this input: limW →1 M C = coth(eα/2 ),
and limW →0 P C = 21 (coth α − 1). These results mirror 0.25
what was found in Ref. [10] for continuous-time net-
works: limW →1 M C = α2 plus corrections of O(α), and 0.20
1

PC
limW →0 P C = 2α plus corrections of O(1).
It is a little strange to say that W = 0 can maxi- 0.15
mize the predictive capacity of a reservoir, as W = 0
implies that there essentially is no reservoir. But such 0.10
arg maxW P C is unusual. Consider input with Rss (t) =
1 −0.1|t|
2e + 21 e−|t| to a one-node network. Memory capac- 0.05
ity is still maximized as W → 1, but predictive capacity
0.00
is now maximized at W ≈ 0.8. See Fig. 1(green). Inter- 0 2 4 6 8 10
estingly, we still minimize any error in memorization of N
previous inputs by storing (and implicitly guessing) their
average value. FIG. 3. At top, the Hidden Markov model generating input to
The scaling of capacity with the network size is also the nonlinear recurrent network. Edges are labeled p(x|g)|x,
very different for memory and prediction. Memory ca- where x is the emitted symbol and p(x|g) is the probabil-
ity of emitting that symbol when in hidden state g, and ar-
pacity M C for linear recurrent networks famously scales rows indicate which hidden state one goes to after emitting a
linearly with the number of of nodes for linear recurrent particular symbol from the previous hidden state. This Hid-
networks [5]. Unlike memory, predictive capacity P C is den Markov model generates a zero-mean, unit-variance Even
|t|
bounded by the signal itself. The Wiener filter kτ (n) min- Process, which has autocorrelation function Rss (t) = − 12 .
imizes the mean-squared error h(s(n + τ ) − ŝ(n + τ ))2 in At bottom, the predictive capacity of random nonlinear re-
of future input s(n + τ ) and a forecast of future input current networks whose evolution is given by Eq. 21 with
P∞ f (x) = tanh(x) and entries of W and v drawn randomly:
from past input, ŝ(n + τ ) := m=0 kτ (m)s(n − m). Re- Wij , vi ∼ U[0, 1], where W is then scaled so that its largest
call that minimizing mean-squared error is equivalent to magnitude eigenvalue has absolute value 1/1.1. 25 random
maximizing the correlation coefficient between future in- networks are surveyed at each N , and the blue line tracks the
put and forecast of this future input. Hence, we can place mean. The green line shows both the predictive capacity of
an upper bound on predictive capacity P C in terms of the optimized one-node linear network and the upper bound
from Eq. 20.
Wiener filters, which after some straightforward simpli-
fication shown in the appendix takes the form

X The surprisingly good performance of optimized one-
PC ≤ ~rτ> R−1~rτ (20) node networks leads us then to ask how big random (un-
τ =0 optimized) networks need be in order to achieve similar
where (~rτ )i = Rss (τ + i) and Rij = Rss (i − j). results. Unoptimized random networks need ≈ 5 nodes
As P C is at most finite, the scaling of P C with the to reliably achieve similar results to the optimized one-
number of nodes of the network N must be eventually node network for both memory and predictive capacity.
o(1). See Fig. 2(bottom). For instance, for Rss (t) = See Fig. 2 in comparison to Fig. 1.
1 −0.1|t|
+ 21 e−|t| , Eq. 20 gives P C ≤ 1.652, which is Finally, we ask whether any of the lessons learned here
2e
nearly attained by the optimal one-node network, for for linear recurrent networks extend to nonlinear recur-
which maxW P C is ≈ 1.65. And this is not a special rent networks in which
property for a cherry-picked input signal; similar results
x(n + 1) = f (W x(n) + s(n)v) (21)
hold for other different, randomly chosen Λ, Aλ combi-
nations not shown here. for some nonlinear function f . From Eq. B1, we see that
5

linear recurrent networks forecast input via a linear com- of the nodes. The advantage of such an approximation
bination of past input; therefore, as noted previously, is that one only need store the present network state, as
their performance is bounded from above by the perfor- opposed to storing the entire past of the input signal. In
mance of Wiener filters. The performance of nonlinear other words, the present network state provides a nearly
recurrent networks is bounded above by a quantity that sufficient “echo” of the input signal’s past for input pre-
depends on the nonlinearity, which in principle might sur- diction.
pass the bound on predictive capacity given by Eq. 20. We have studied the resource savings that can come
However, optimizing the weights of nonlinear recur- from optimizing the recurrent network and readout
rent networks is far more difficult than for linear re- weights, as opposed to just optimizing the readout
current networks. This is illustrated by Fig. 3(bot- weights. Surprisingly, we find that a network designed
tom), which shows P d C of random nonlinear networks. to maximize memory capacity has arbitrarily low predic-
We estimate the predictive capacity from simulations via tive capacity; see Fig. 1. More encouragingly, we find
PM > −1
k=0 p̂k Ĉ p̂k , where p̂>
k is the sample covariance of that an optimized single-node linear recurrent network is
s(n + τ ) and x(n), Ĉ is the sample variance of x(n), and essentially equivalent in terms of both memory and pre-
M is taken to be 100, as the correlation coefficient dies off dictive capacity to a five-node random linear recurrent
relatively quickly. The reservoir properties W and v are network, and near maximal predictive capacity. Finally,
chosen randomly, in that both matrix elements Wij and numerical results suggest that nonlinear recurrent net-
vector elements vi are drawn randomly at uniform from works have more difficulty achieving high predictive ca-
the unit interval, and the matrix W is rescaled so that pacity relative to the Wiener filter-placed upper bound,
the eigenvalue of maximum magnitude has magnitude of even though these nonlinear networks might in principle
1/1.1, and the nonlinearity is set to f (x) = tanh x. The surpass such an upper bound.
input to the network is generated by the Hidden Markov It is unclear whether or not the factor-of-five will gen-
model shown in Fig. 3(top). For comparison, the green eralize to nonlinear recurrent networks or for inputs gen-
line shows the upper bound on predictive capacity for lin- erated by uncountable Hidden Markov models, e.g. the
ear recurrent networks given by Eq. 20, which is achieved output of chaotic dynamical systems. Perhaps more im-
by one-node linear networks with W = 0. These numer- portantly, predictive capacity is not necessarily the quan-
ical results are qualitatively similar to results attained tity that we would most like to maximize [16]. Hopefully,
when comparing the memory capacity of linear and non- the differences between memory and predictive capacity
linear recurrent networks, in that linear networks tend to presented here will stimulate the search for more task-
outperform nonlinear networks [12, 15]. appropriate objective functions and for more reservoir
optimization recipes.

III. DISCUSSION ACKNOWLEDGMENTS

The famous Wiener filter is a linear combination of the We owe substantial intellectual debts to A Goudarzi, J
past input signal that minimizes the mean-squared error P Crutchfield, S Still and A Bell and additionally thank
between said linear combination and a future input. Lin- I Nemenman, N Ay, C Hillar, S Dedeo and W Bialek for
ear recurrent networks are, in some sense, an attempt to very useful conversations. S.M. was funded by an MIT
approximate the Wiener filter under constraints on the Physics of Living Systems Fellowship.
kernel that come from the structure of the recurrent net-
work. Here, the linear filter is not allowed access to all
the past of the signal, but is only allowed access to the REFERENCES
echoes of the signal past provided by the present state

[1] Herbert Jaeger. The “echo state” approach to analysing [2] Herbert Jaeger and Harald Haas. Harnessing nonlinear-
and training recurrent neural networks-with an erratum ity: Predicting chaotic systems and saving energy in wire-
note. Bonn, Germany: German National Research Cen- less communication. Science, 304(5667):78–80, 2004.
ter for Information Technology GMD Technical Report, [3] Mantas Lukoševičius and Herbert Jaeger. Reservoir com-
148(34):13, 2001. puting approaches to recurrent neural network training.
Computer Science Review, 3(3):127–149, 2009.
6

[4] Memory can be used to estimate the bias of the coin, but [11] Alireza Goudarzi, Sarah Marzen, Peter Banda, Guy Feld-
nothing else about the past provides a guide to the future man, Christof Teuscher, and Darko Stefanovic. Memory
input. and information processing in recurrent neural networks.
[5] Herbert Jaeger. Short term memory in echo state net- arXiv:1604.06929, 2016.
works, volume 5. GMD-Forschungszentrum Information- [12] Surya Ganguli, Dongsung Huh, and Haim Sompolinsky.
stechnik, 2001. Memory traces in dynamical systems. Proceedings of
[6] Olivia L White, Daniel D Lee, and Haim Sompolinsky. the National Academy of Sciences, 105(48):18970–18975,
Short-term memory in orthogonal neural networks. Phys- 2008.
ical review letters, 92(14):148102, 2004. [13] Paul M Riechers, Dowman P Varn, and James P Crutch-
[7] Joschka Boedecker, Oliver Obst, Joseph T Lizier, field. Pairwise correlations in layered close-packed struc-
N Michael Mayer, and Minoru Asada. Information pro- tures. Acta Crystallographica Section A: Foundations and
cessing in echo state networks at the edge of chaos. The- Advances, 71(4):423–443, 2015.
ory in Biosciences, 131(3):205–213, 2012. [14] Georg Fette and Julian Eggert. Short term memory and
[8] Igor Farkaš, Radomı́r Bosák, and Peter Gergel’. Compu- pattern matching with simple echo state networks. Ar-
tational analysis of memory capacity in echo state net- tificial Neural Networks: Biological Inspirations–ICANN
works. Neural Networks, 83:109–120, 2016. 2005, pages 13–18, 2005.
[9] Peter Barančok and Igor Farkaš. Memory capacity of [15] Taro Toyoizumi. Nearly extensive sequential memory life-
input-driven echo state networks at the edge of chaos. In time achieved by coupled nonlinear neurons. Neural com-
International Conference on Artificial Neural Networks, putation, 24(10):2678–2699, 2012.
pages 41–48. Springer, 2014. [16] Jasmine Collins, Jascha Sohl-Dickstein, and David Sus-
[10] Michiel Hermans and Benjamin Schrauwen. Memory in sillo. Capacity and trainability in recurrent neural net-
linear recurrent neural networks in continuous time. Neu- works. arXiv:1611.09913, 2016.
ral Networks, 23(3):341–355, 2010.

Appendix A: Autocorrelation function of Hidden Markov models

This is a simple version of the argument in Ref. [13] that assumes diagonalizability of the transition matrix. Let
T (x) be the labeled transition matrices of the Hidden Markov model, let
X
T = T (x) (A1)
x

be the transition matrix, and let p~eq = eig1 (T ) be the stationary distribution over the hidden states. Assuming
zero-mean input, we have

R(t) = hx(t − 1)x(0)i (A2)


X
= xx0 Pr(Xt−1 = x, X0 = x0 ) (A3)
x,x0
X 0
= xx0~1> T (x) T t T (x ) p~eq (A4)
x,x0
X    0

= ~1> xT (x) T t x0 T (x ) p~eq (A5)
x,x0
! !
X X
= ~1> xT (x)
T t
xT (x)
p~eq . (A6)
x x

If T is diagonalizable (and it typically is), then T = P diag(~λ)P −1 leads to


! !
P diag(~λ )P
X X
R(t) = ~1 >
xT (x) t −1
xT (x)
p~eq , (A7)
x x

and so R(t) is a linear combination of λti .


7

Appendix B: Derivation of closed-form expressions for P C, M C

From Eq. 1, we have



!
X
x(n) = W k−1 s(n − k) v, (B1)
k=1

assuming the echo state property. Thus,

pk = hs(n − k)x(n)in (B2)


X∞
= W m−1 Rss (k − m)v (B3)
m=1

and

C = hx(n)x(n)> in (B4)

X 0
= W m−1 vv > (W > )m −1 Rss (m − m0 ). (B5)
m,m0 =1

Substituting Eq. 6 into the above equation gives


∞ π
1
Z
> m0 −1 0
X
>
C= W m−1
vv (W ) S(f )eif (m−m ) df (B6)
2π −π
m,m0 =1
∞ ∞
! !
π
1
Z X X 0 0
= S(f ) eif m W m−1 vv > (W > )m −1 e−if m df (B7)
2π −π m=1 m0 =1
∞ ∞
! !
π
1
Z
> m0 −if m0
X X
if m m >
= S(f ) e W vv (W ) e df (B8)
2π −π m=0 m0 =0
π
1
Z
−1 −1
= S(f ) I − eif W vv > 1 − e−if W > df, (B9)
2π −π

and using Eq. 2,


π   > !
1 ω ω
Z
C= P S(f ) df P −1 . (B10)
2π −π 1 − eif d~ 1 − e−if d~

Returning to Eq. B3 and using Eq. 4, we have



!
X X
m−1 |k−m|
pk = W A(λ)λ v (B11)
m=1 λ∈Λ
X ∞
X
= A(λ) W m−1 λ|k−m| v (B12)
λ∈Λ m=1
(P P∞
A(λ)  m=1 W m−1 λm−k v
λ∈Λ k<1
= P Pk m−1 k−m
P∞ m−1 m−k
 (B13)
λ∈Λ A(λ) m=1 W λ + m=k+1 W λ v k≥1
−k −1 ∞
(P
m m
P
λ∈Λ A(λ)λ W (P m=1 W λ ) v k<1
= P k −1 k m −m −1 −k
P∞ m m
 (B14)
λ∈Λ A(λ) λ W m=1 W λ + W λ m=k+1 W λ v k≥1
(P
A(λ)λ−k (λ−1 − W )−1 v k<1
= Pλ∈Λ k k −1 k −1 −1
 . (B15)
λ∈Λ A(λ) (W − λ )(W − λ) + W (λ − W ) v k≥1
8

Using Eq. 2,
(P
λ∈Λ A(λ)λ−k ( λ−1ω−d~ ) k<1
pk = P P d~k −λk d~k
. (B16)
λ∈Λ A(λ)diag( d−λ
~ + λ−1 −d~
)ω k≥1

Thus we have

X
PC = p>
−k C
−1
p−k (B17)
k=0

!> !
X X
k ω −1
X
k ω
= 2π A(λ)λ ( ) B A(λ)λ ( ) (B18)
k=0 λ∈Λ λ−1 − d~ λ∈Λ λ−1 − d~
X A(λ)A(λ0 ) ω ω
= 2π 0
( )> B −1 ( ) (B19)
λ∈Λ
1 − λλ λ − d~
−1 (λ ) − d~
0 −1

X X A(λ)A(λ0 ) ωi ωj
= 2π 0
( −1 )(B −1 )ij ( 0 −1 ) (B20)
i,j λ∈Λ
1 − λλ λ − di (λ ) − dj
!
X X A(λ)A(λ0 ) 1 1
B −1 ij ωj ,

= 2π ωi 0 −1 0 −1
(B21)
i,j
1 − λλ λ − di (λ ) − dj
λ∈Λ

which gives the formula in the main text. Similar manipulations, with the help of Mathematica, give the more involved
formula for M C.

Appendix C: Derivation of upper bound for P C

Recall that

hs(t + τ )ŝ(t + τ )i2t


P Cτ = (C1)
hŝ(t)2 it

and

X
PC = P Cτ . (C2)
τ =0
P∞
As our problem setup naturally restricts us to causal linear filters, P Cτ is maximized with ŝ(t + τ ) = n=1 s(t −
n)kτ (n), with kτ (n) a Wiener filter. In particular, suppose that kτ (n) satisfies the Wiener-Hopf equation:

X
Rss (τ + t) = Rss (t − m)kτ (m). (C3)
m=1

In matrix form, this reads


    
Rss (τ + 1) Rss (0) Rss (−1) Rss (−2) . . . kτ (1)
Rss (τ + 2) Rss (1) Rss (0) Rss (1) . . . kτ (2)
= (C4)
.. .. .. .. ..
  
..
. . . . . .

and so
   −1  
kτ (1) Rss (0) Rss (−1) Rss (−2) . . . Rss (τ + 1)
kτ (2) Rss (1) Rss (0) Rss (1) . . . Rss (τ + 2)
= . (C5)
.. .. .. .. ..
  
..
. . . . . .
9

For ease of notation, we define R as


 
Rss (0) Rss (−1) Rss (−2) . . .
R := Rss (1) Rss (0) Rss (1) . . . (C6)
 
.. .. .. ..
. . . .

and
 
Rss (τ + 1)
~rτ := Rss (τ + 2) (C7)
 
..
.

so in short, ~kτ = R−1~rτ . Then, hs(t + τ )ŝ(t + τ )it = hŝ(t)2 it and so then

Rss (τ + n)kτ (n) = ~rτ>~kτ = ~rτ> R−1~rτ .
X
P Cτ = hs(t + τ )ŝ(t + τ )it = (C8)
n=1

As these ~kτ are the causal linear filters that maximize the correlation coefficient between s(t + τ ) and ŝ(t + τ ), we
have

X
PC ≤ ~rτ> R−1~rτ (C9)
τ =0

for any linear recurrent network.

You might also like