You are on page 1of 25

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
1

From Parameter Estimation to Dispersion


of Nonstationary Gauss-Markov Processes
Peida Tian, Student Member, IEEE, Victoria Kostina, Member, IEEE

Abstract—This paper provides a precise error analysis for mean squared error (MSE). Two commonly used criteria to
the maximum likelihood estimate âML (un 1 ) of the parameter a quantify the distortion of a lossy compression scheme are the
0
given samples un 1 = (u1 , . . . , un ) drawn from a nonstationary average distortion criterion and the excess-distortion probability
Gauss-Markov process Ui = aUi−1 + Zi , i ≥ 1, where U0 = 0,
a > 1, and Zi ’s are independent Gaussian random variables criterion. The rate-distortion theory, initiated by Shannon [11]
with zero mean and variance σ 2 . We show a tight nonasymptotic and extensively investigated by researchers [12–17], studies
exponentially decaying bound on the tail probability of the the optimal tradeoff between the rate R and the distortion.
estimation error. Unlike previous works, our bound is tight In the limit of large blocklength n, the minimum rate R
already for a sample size of the order of hundreds. We apply required to achieve average distortion d is given by the rate-
the new estimation bound to find the dispersion for lossy
compression of nonstationary Gauss-Markov sources. We show distortion function. The nonasymptotic version of the rate-
that the dispersion is given by the same integral formula that distortion problem [18–22] studies the rate-distortion tradeoff
we derived previously for the asymptotically stationary Gauss- for finite blocklength n. Our main contribution is a coding
Markov sources, i.e., |a| < 1. New ideas in the nonstationary case theorem that characterize the gap between the rate-distortion
include a deeper understanding of the scaling of the maximum function and the minimum rate R at blocklength n for the
eigenvalue of the covariance matrix of the source sequence, and
new techniques in the derivation of our estimation error bound. nonstationary Gauss-Markov source (a > 1), under the excess-
distortion probability criterion. We leverage our result on the
ML estimator to understand lossy compression as follows. We
Index Terms—Parameter estimation, maximum likelihood esti-
mator, unstable processes, finite blocklength analysis, lossy com- apply our bound on the estimation error of the ML estimator
pression, sources with memory, rate-distortion theory, covering to construct a typical set of the sequences whose estimated
in stochastic processes, adaptive control. parameter a is close to the true a. We then use the typical set in
our achievability proof of the nonasymptotic coding theorem.
Without loss of of generality, we assume that a ≥ 0
I. I NTRODUCTION
in this paper, since, otherwise, we can consider another
A. Overview random process {Ui0 }∞ i=1 defined by the invertible mapping
0
We consider two related problems that concern a scalar U i , (−1)i
Ui that satisfies Ui0 = (−a)Ui−1
0
+ (−1)i Zi , where
i
Gauss-Markov process {Ui }∞ i=1 , defined by U0 = 0 and
(−1) Zi ’s are also independent zero-mean Gaussian random
variables with variance σ 2 . We distinguish the following three
Ui = aUi−1 + Zi , ∀i ≥ 1, (1) cases:
• 0 < a < 1: the asymptotically stationary case;
where Zi ’s are independent Gaussian random variables with
2 • a = 1: the unit-root case;
zero mean and variance σ .
• a > 1: the nonstationary case.
The first problem is parameter estimation: given sample
n In this paper, we mostly focus on the nonstationary case.
u1 drawn from the Gauss-Markov source, we seek to design
and analyse estimators for the unknown system parameter a.
The consistency and asymptotic distribution of the maximum B. Motivations
likelihood (ML) estimator have been studied in the literature [2–
Estimation of parameters of stochastic processes from their
7]. Our main contribution is a large deviation bound on
realizations has many applications. In the statistical analysis of
the estimation error of the ML estimator. Our numerical
economic time series [2, 23, 24], the Gauss-Markov process
experiments indicate that our new bound is tighter than
{Ui }∞i=1 is used to model the varying prices of a certain
previously known results [8–10].
commodity at time i, and the ML estimate of the unknown
The second problem is the nonasymptotic performance of
coefficient a is then used to predict future prices. [25] and [26,
the optimal lossy compressors of the Gauss-Markov process.
Sec. 5] used the Gauss-Markov process with a = 1 to model
An encoder outputs nR bits for each realization un1 . Once the
the stochastic structure of the velocity of money. The Gauss-
decoder receives the nR bits, it produces ûn1 as a reproduction
Markov process, also known as the autoregressive process of
of un1 . The distortion between un1 and ûn1 is measured by the
order 1 (AR(1)), is a special case of the general autoregressive-
P. Tian and V. Kostina are with the Department of Electrical Engineering, moving-average (ARMA) model [27, 28], for which various
California Institute of Technology. (e-mail: {ptian, vkostina}@caltech.edu). estimation and prediction procedures have been proposed, e.g.
This research was supported in part by the National Science Foundation the Box-Jenkins method [28]. The Gauss-Markov process is
(NSF) under Grant CCF-1751356. A preliminary version [1] of this paper was
accepted for publication in the IEEE International Symposium on Information also a special case of the linear state-space model (e.g. [29,
Theory, Paris, France, July 2019. Chap. 5]) that is popular in control theory. One of the problems

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
2

in control is system identification [30], which is the problem to denote the complement of a set S. All logarithms and
of building mathematical models using measured data from exponentials are base e.
unknown dynamical systems. Parameter estimation is one of
the common methods used in system identification where the II. P REVIOUS W ORKS
dynamical system is modeled by a state-space model [30, Chap. A. Parameter Estimation
7] with unknown parameters. In modern data-driven control
systems, where the goal is to control an unknown nonstationary The maximum likelihood (ML) estimate âML (un1 ) of the
system given measured data, parameter estimation methods parameter a given samples un1 = (u1 , . . . , un )0 drawn from
are used as a first step in designing controllers [10] [31, Sec. the Gauss-Markov source is given by
Pn−1
1.2]. In speech signal processing, the linear predictive coding n i=1 ui ui+1
algorithm [32] relies on parameter estimation (the ordinary âML (u1 ) = P n−1 2 . (2)
i=1 ui
least squares estimate, or, equivalently, the maximum likelihood
estimate assuming Gaussian noise) to fit a higher-order Gauss- The derivation of (2) is straightforward, e.g. [47, App. F-
Markov process, see [32, App. C]. A fine-grained analysis of A]. The problem is to provide performance guarantees of
the ML estimate is instrumental in optimizing the design of all âML (un1 ). This simply formulated problem has been widely
these systems. Our nonasymptotic analysis leading up to a large studied in the literature. Our main contribution in this paper is
deviation bound for the ML estimate in our simple setting can a nonasymptotic fine-grained large deviations analysis of the
provide insights for analyzing more complex random processes, estimation error.
e.g., higher-order autoregressive processes and vector systems. The estimate âML (un1 ) in (2) has been extensively studied in
Understanding finite-blocklength lossy compression of the the statistics [4, 6] and economics [2, 3] communities. Mann
Gauss-Markov process fits into a continuing effort by many and Wald [2] and Rubin [3] showed that the estimation error
researchers to advance the rate-distortion theory of information âML (U1n ) − a converges to 0 in probability for any a ∈ R. Ris-
sources with memory, see [13–17, 33–43], as well as into sanen and Caines [6] later proved that âML (U1n ) − a converges
a newer push [18–22, 44–49] to understand the fundamental to 0 almost surely for 0 < a < 1. To better understand the finer
limits of low latency communication. There is a tight connection scaling of the error âML (U1n ) − a, researchers turned to study
between lossy compression of the nonstationary Gauss-Markov the limiting distribution of the normalized estimation error
process and control of an unstable linear system under h(n)(âML (U1n ) − a) for a careful choice of the standardizing
communication constraints [50, 51]. Namely, the minimum function h(n):
channel capacity needed to achieve a given LQG (linear
q
n
 1−a2 , |a| < 1,

quadratic Gaussian) cost for the plant [50, Eq. (1)] is lower-

bounded by the causal rate-distortion function of the Gauss- h(n) , √n , |a| = 1, (3)
2
 |a|n ,

Markov process [50, Eq. (9)]. See [51, Th. 1] for more details.

|a| > 1.
a2 −1
Being more restrictive on the coding schemes, the causal rate-
distortion function is further lower-bounded by the traditional With the above choices of h(n), Mann and Wald [2] and
rate-distortion function. The result in this paper on the rate- White [4] showed that the distribution of the normalized
distortion tradeoff in the finite blocklength regime provides estimation error h(n)(âML (U1n ) − a) converges to N (0, 1) for
a lower bound on the minimum communication rate required |a| < 1; to the standard Cauchy distribution for |a| > 1; and
to ensure that the LQG cost stays below a desired threshold for |a| = 1, to the distribution of
with desired probability at the end of a finite horizon. Finally, B 2 (1) − 1
the aforementioned linear predictive coding algorithm [32] is R1 , (4)
2 0 B 2 (t) dt
connected to lossy compression of autoregressive processes,
see a recent historical note by Gray [52, p.2]. where {B(t) : t ∈ [0, 1]} is a Brownian motion.
Generalizations of the above results in several directions have
also been investigated. In [2, Sec. 4], the maximum likelihood
C. Notations estimator for the p-th order stationary autoregressive processes
For n ∈ N, we use [n] to denote the set {1, 2, ..., n}. We use with Zi ’s being i.i.d. zero-mean and bounded moments random
the standard notations for the asymptotic behaviors O(·), o(·), variables (not necessarily Gaussian) was shown √ to be weakly
Θ(·), Ω(·) and ω(·). Namely, let f (n) and g(n) be two consistent, and the scaled estimation errors n(âj −aj ) for j =
functions of n, then f (n) = O(g(n)) means that there exists 1, . . . , p were shown to converge in distribution to the Gaussian
a constant c > 0 and n0 ∈ N such that |f (n)| ≤ M |g(n)| for random variables as n tends to infinity. Anderson [5, Sec. 3]
any n ≥ n0 ; f (n) = o(g(n)) means limn→∞ f (n)/g(n) = 0; studied the limiting distribution of the maximum likelihood
f (n) = Θ(g(n)) means there exist positive constants c1 , c2 estimator for a nonstationary vector version of the process (1).
and n0 ∈ N such that c1 g(n) ≤ f (n) ≤ c2 g(n) for any Chan and Wei [7] studied the performance of the estimation
n ≥ n0 ; f (n) = Ω(g(n)) if and only if g(n) = O(f (n)); and error when a is not a constant but approaches to 1 from below
f (n) = ω(g(n)) if and only if limn→∞ f (n)/g(n) = +∞. For in the order of 1/n. The problem of estimating the parameter
a matrix M, we denote by M0 its transpose, by kMk its operator a from a block of outcomes of the Gauss-Markov source (1)
norm (the largest singular value) and by µ1 (M) ≤ . . . ≤ µn (M) is one of the simplest versions in recent studies of machine
its eigenvalues listed in nondecreasing order. We use S c learning for dynamical systems [10, 53–56]. One objective of

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
3

those studies is to obtain tight performance bounds on the least- main result on parameter estimation is a tight nonasymptotic
squares estimates of the system parameters A, B, C, D from a lower bound on P + (n, a, η) and P − (n, a, η). For larger a,
single input / output trajectory {Wi , Yi }ni=1 in the following the lower bound becomes larger, which suggests that unstable
state-space model, e.g. [54, Eq. (1)–(2)]: systems are easier to estimate than stable ones, an observation
consistent with [53]. The proof is inspired by Rantzer [10, Lem.
Xi+1 = AXi + BWi + Zi , (5)
5], but our result improves Rantzer’s result (10) and Bercu and
Yi = CXi + DWi + Vi , (6) Touati’s result (11), see Fig. 1 for a comparison. Most of our
where Xi , Wi , Zi , Vi ’s are random vectors of certain dimen- results generalize to the case where Zi ’s are i.i.d. sub-Gaussian
sions and the system parameters A, B, C, D are matrices of random variables, see Theorem 4 in Section III-D below.
appropriate dimensions. The Gauss-Markov process in (1) can
be written as the state-space model by choosing A = a being
a scalar, B = D = 0, C = 1 and Vi = 0. For stable vector B. Nonasymptotic Rate-distortion Theory
systems, that is, kAk < 1, Oymak and Ozay [54, Thm. √ 3.1]
showed that the estimation error in spectral norm is O(1/ n) The rate-distortion theory studies the problem of compressing
with high probability, where n is the number of samples. a generic random process {Xi }∞ i=1 with minimum distortion.

For the subclass of the regular unstable systems [56, Def. 3], Given a distortion threshold d > 0, an excess-distortion
Faradonbeh et al. [56, Thm. 1] proved that the probability of probability  ∈ (0, 1) and the number of codewords M ∈ N,
estimation error exceeding a positive threshold in spectral norm an (n, M, d, ) lossy compression code for a random vector
n n
decays exponentially in n. For the Gauss-Markov processes X 1 consists of an encoder fn : R → [M ], and a decoder
n n n
considered in the present paper, Simchowitz et al. [53, Thm. g n : [M ] → R , such that P [d (X 1 , gn n (X1 ))) > d] ≤ ,
(f
B.1] and Sarkar and Rakhlin [55, Prop. 4.1] presented tail where d(·, ·) is the distortion measure. This paper n n
considers
n
bounds on the estimation error of the ML estimate. the mean squared error (MSE) distortion: ∀ x 1 , y1 ∈R ,

Another line of work closely related to this paper is the large n


deviation principle (LDP) [57, Ch. 1.2] on âML (U1n )−a. Given 1X
d(xn1 , y1n ) , (xi − yi )2 . (12)
+ −
an error threshold η > 0, define P (n, a, η) and P (n, a, η) n i=1
as follows:
1 The minimum achievable code size and source coding rate are
P + (n, a, η) , − log P [âML (U1n ) − a > η] , (7) defined respectively by
n
1
P − (n, a, η) , − log P [âML (U1n ) − a < −η] . (8)
n M ? (n, d, ) , min {M ∈ N : ∃ (n, M, d, ) code} , (13)
We also define P (n, a, η) as 1
R(n, d, ) , log M ? (n, d, ). (14)
n
1 n
P (n, a, η) , − log P [|âML (U1 ) − a| > η] . (9)
n In this paper, we approximate the nonasymptotic coding rate
The large deviation theory studies the rate functions, defined R(n, d, ) for the nonstationary Gauss-Markov source.
as the limits of P + (n, a, η), P − (n, a, η) and P (n, a, η), as n Another related and widely studied setting is compression
goes to infinity. Bercu et al. [8, Prop. 8] found the rate function under the average distortion criterion. Given a distortion
for the case of 0 < a < 1. For a ≥ 1, Worms [9, Thm. 1] threshold d > 0 and the number of codewords M ∈ N,
proved that the rate functions can be bounded from below an (n, M, d) lossy compression code for a random vector
implicitly by the optimal value of an optimization problem. X1n consists of an encoder fn : Rn → [M ], and a decoder
These studies of the limiting distribution and the LDP of gn : [M ] → Rn , such that E [d (X1n , gn (fn (X1n )))] ≤ d. Simi-
the estimation error are both asymptotic. In this paper, we larly, one can define M ? (n, d) and R(n, d) as the minimum
consider the nonasymptotic analysis of the estimation error. Two achievable code size and source coding rate, respectively, under
nonasymptotic lower bounds on P + (n, a, η) and P − (n, a, η) the average distortion criterion. The traditional rate-distortion
are available in the literature. For any a ∈ R, Rantzer [10, Th. theory [11, 12, 14, 15, 34, 59] showed that the limit of the
4] showed that operational source coding rate R(n, d) as n tends to infinity
1 equals the informational rate-distortion function for a wide
P + (n, a, η) (and P − (n, a, η)) ≥ log(1 + η 2 ). (10) class of sources. For discrete memoryless sources, Zhang,
2
Yang and Wei in [19] showed that R(n, d) approaches the
Bercu and Touati [58, Cor. 5.2] proved that
rate-distortion function in the order log n/2n + o(log n/n).
+ − η2 For abstract alphabet memoryless sources, Yang and Zhang
P (n, a, η) (and P (n, a, η)) ≥ , (11)
2(1 + yη ) in [20, Th. 2] showed a similar convergence rate.
where yη is the unique positive solution to (1 + x) log(1 + x) − Under the excess-distortion probability criterion, one can also
2
x − η = 0 in x. Both bounds (10) and (11) do not capture study the nonasymptotic behavior of the minimum achievable
the dependence on a and n, and are the same for P (n, a, η)+ excess-distortion probability ? (n, d, M ):
and P − (n, a, η). All the bounds in [10, 53–56] are either
optimal only order-wise or involve implicit constants. Our ? (n, d, M ) , inf { > 0 : ∃ (n, M, d, ) code} . (15)

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
4

Marton’s excess distortion exponent [18, Th. 1, Eq. (2)-(3), and A-C below. For η > 0 and n ∈ N, we define the following
(20)] showed that for discrete memoryless sources PX , it holds sets
that 
1

+
Sn , s ∈ R : s > 0, α` < , ∀` ∈ [n] , (22)
2σ 2
 
1 ? log n
− log  (n, d, M ) = min D(PX̂ ||PX ) + O , 
1

n PX̂ n Sn− , s ∈ R : s > 0, β` < , ∀` ∈ [n] . (23)
(16) 2σ 2

where the minimization is over all probability distributions Theorem 1. For any constant η > 0, the estimator (2) satisfies
PX̂ such that RX̂ (d) ≥ lognM , where M is such that lognM for any n ≥ 2,
is a constant, RX̂ (d) denotes the rate-distortion function of n−1
1 X
P + (n, a, η) ≥ sup log 1 − 2σ 2 α` ,

a discrete memoryless source with single-letter distribution (24)
+
s∈Sn 2n
PX̂ , and D(·||·) denotes the Kullback-Leibler divergence. As `=1
n−1
pointed out by [21, p.2], for fixed d > 0 and  ∈ (0, 1), even the 1 X
P − (n, a, η) ≥ sup log 1 − 2σ 2 β` ,

asymptotic behavior of R(n, d, ) is unanswered by Marton’s (25)

s∈Sn 2n
`=1
bound in (16). Ingber and Kochman [21] (for finite-alphabet
and Gaussian sources) and Kostina and Verdú [22] (for abstract where α` and β` are defined in (19) and (21), respectively,
sources) showed that the minimum achievable source coding and Sn+ and Sn− are defined in (22) and (23), respectively.
rate R(n, d, ) satisfies a Gaussian approximation:
Theorem 1 is a useful result for numerically computing
r lower bounds on P + (n, a, η) and P − (n, a, η). In Fig. 1, we
−1 V(d)
R(n, d, ) ≈ RX (d) + Q () , (17) plot our lower bounds in Theorem 1, previous results in (10)
n by Rantzer and (11) by Bercu and Touati, and a simulation
where V(d) is the dispersion of the source (defined as the result. As one can see, our bound in Theorem 1 is much tighter
variance of the tilted information random variable, details later) than previous results.
and Q−1 denotes the inverse q-function. In this paper, by The proof of Theorem 1, presented in Appendix A-A below,
extending our previous analysis [47, Th. 1] of the stationary is a detailed analysis of the Chernoff bound using the tower
Gauss-Markov source to the nonstationary one, we establish a property of conditional expectations. The proof is motivated
Gaussian approximation in the form of (17) for the nonstation- by [10, Lem. 5], but our analysis is more accurate and the result
ary Gauss-Markov sources. One of the key ideas behind this is significantly tighter, see Fig. 1 and Fig. 3 for comparisons.
2
extension is to construct a typical set using the ML estimate One recovers Rantzer’s lower bound (10) by setting s = η/σ
of a, and to use our estimation error bound to probabilistically and bounding α` as α` ≤ α1 (due to the monotonicity of α`
characterize that set. shown in Appendix A-B below) in Theorem 1. We explicitly
state where we diverge from [10, Lem. 5] in the proof in
Appendix A-A below.
III. PARAMETER E STIMATION Remark 1. In view of the Gärtner-Ellis theorem [57, Th. 2.3.6],
A. Nonasymptotic Lower Bounds we conjecture that the bounds (24) and (25) can be reversed
in the limit of large n:
We first present our nonasymptotic bounds on P + (n, a, η) n−1
and P − (n, a, η), defined in (7) and (8) above, respectively. 1 X
lim sup P + (n, a, η) ≤ lim sup sup log 1 − 2σ 2 α` ,

We define two sequences {α` }`∈N and {β` }`∈N as follows. n→∞ n→∞ s∈Sn
+ 2n
`=1
Let σ 2 > 0 and a > 1 be fixed constants. For η > 0 and a (26)
parameter s > 0, let α` be the following sequence
and similarly for (25).
σ 2 s2 − 2ηs
α1 , , (18)
2 B. Asymptotic Lower Bounds
2 2
[a + 2σ s(a + η)]α`−1 + α1
α` = , ∀` ≥ 2. (19) We next present our bounds on the error exponents, that
1 − 2σ 2 α`−1
is, the limits of P + (n, a, η), P − (n, a, η) and P (n, a, η) as n
Similarly, let β` be the following sequence tends to infinity. To take limits using (24) and (25), we need
to understand the two sequences of sets Sn+ and Sn− . Define
σ 2 s2 − 2ηs the limits of the sets as
β1 , , (20)
2 +
\
[a2 + 2σ 2 s(−a + η)]β`−1 + β1 S∞ , Sn+ , (27)
β` = , ∀` ≥ 2. (21) n≥1
1 − 2σ 2 β`−1 \

S∞ , Sn− . (28)
Note the subtle difference between (19) and (21): there is a n≥1
negative sign in the numerator in (21). Both sequences depend
We have the following properties.
on η and s. We derive closed-form expressions and analyze
the convergence properties of α` and β` in Appendices A-B Lemma 1. Fix any constant η > 0.

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
5

Fig. 1: Numerical simulations and lower bounds on P + (n, a, η): Fig. 2: Numerical computation of the sets Sn+ for a = 1.2 and
We choose a = 1.2 and η = 10−3 . For each n, we η = 0.1. Each horizontal line corresponds to n = 1, ..., 5 in
generate N = 106 independent samples un1 from the the bottom-up order. Within each horizontal line, the red thick
Gauss-Markov process (1). We approximate P + (n, parts denote the ranges of s for which αn < 2σ1 2 , and the blue
 a, η) by thin region is where αn ≥ 2σ1 2 . The plot for Sn− is similar.
− n1 log N1 # {samples un1 with âML (un1 ) − a > η} . We plot
lower bounds on P + (n, a, η) by Rantzer (10), Bercu and Touati
in (11), our nonasymptotic bound in (24) and the asymptotic
where
bound in Theorem 2 in Section III-B below. 
log a,
 0 < η ≤ η1 ,
− 1 2aη−(a2 −1)
• (Monotone decreasing sets) For any n ≥ 1, we have I (a, η) , 2 log 1−(η−a)2 , η1 < η < η2 , (36)

log(2η − a), η ≥ η2 ,


+
Sn+1 ⊆ Sn+ , Sn+1 ⊆ Sn− . (29)
•(Limits of the sets) It holds that with the thresholds η1 and η2 given by
a2 − 1
 
+ 2η η1 , , (37)
S∞ = 0, 2 , (30) a √
σ
3a + a2 + 8


 η 2 , . (38)
S∞−
% 0, 2 . (31) 4
σ Remark 2. The results in (30)-(31) and (33)-(34) indicate the
The proof of Lemma 1 is presented in Appendix A-D below. asymmetry between P + (n, a, η) and P − (n, a, η): the set S∞ −
+ − + + −
The exact characterization of Sn and Sn for each n using η is has a larger range than S∞ , and I (a, η) > I (a, η), which
involved. One can see from the definitions (22) and (23) that suggests that the maximum likelihood estimator âML (U1n ) is
( p ) more likely to underestimate a than to overestimate it.
+ − η + 1 + η2
S1 = S1 = s ∈ R : 0 < s < . (32) Fig. 3 presents a comparison of (35), Rantzer’s bound (10)
σ2
and Bercu and Touati (11). Our bound (35) is tighter than both
+
To obtain the set Sn+1 from Sn+ , we need to solve αn+1 < 2σ1 2 , of them for any η > 0.
which is equivalent to solving an additional inequality involving
a polynomial of degree n + 2 in s (using the closed-form C. Decreasing Error Thresholds
expression for αn+1 in (128) in Appendix A-B below). Fig. 2
presents a plot of Sn+ for n = 1, ..., 5. Despite the complexity When the number of samples n increases, it is natural to also
of the sets Sn+ and Sn− , Lemma 1 shows their monotonicity have smaller error thresholds η. In this section, we consider
property and limits. the regime where the error threshold η = ηn > 0 is a sequence
Combining Theorem 1 and Lemma 1, we obtain the decreasing to 0. In this setting, Theorem 1 still holds and the
following lower bounds on the error exponents. The proof proof stays the same, except that we replace α` and β` , by the
is given in Appendix A-E below. length-n sequences αn,` and βn,` for ` = 1, . . . , n, respectively,
where αn,` and βn,` now depend on ηn instead of a constant
Theorem 2. Fix any constant η > 0. For the ML estimator (2), η:
the following three inequalities hold:
σ 2 s2 − 2ηn s
lim inf P + (n, a, η) ≥ I + (a, η) , log(a + 2η), (33) αn,1 , 2
, (39)
n→∞
2 2
lim inf P − (n, a, η) ≥ I − (a, η), (34) α = [a + 2σ s(a + ηn )]αn,`−1 + αn,1 , ∀` = 2, . . . , n.
n→∞ n,`
1 − 2σ 2 αn,`−1
lim inf P (n, a, η) ≥ I − (a, η), (35) (40)
n→∞

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
6

variables. This general result is of independent interest and


will not be used in the rest of the paper.
Definition 1 (sub-Gaussian random variable, e.g. [60, Def.
2.7]). Fix σ > 0. A random variable Z ∈ R with mean µ
is said to be σ-sub-Gaussian with variance proxy σ 2 if its
moment-generating function (MGF) satisfies
σ 2 s2
E[es(Z−µ) ] ≤ e 2 , (46)
for all s ∈ R.
One important property of σ-sub-Gaussian random variables
is the following well-known bound on the MGF of quadratic
functions of σ-sub-Gaussian random variables.
Lemma 2 ([10, Prop. 2]). Let Z be a σ-sub-Gaussian random
Fig. 3: Comparisons lower bounds on lim inf n→∞ P (n, a, η): variable with mean µ. Then
For a = 1.2, we plot the three lower bounds in Rantzer (10), sµ2
 
1
E exp(sZ 2 ) ≤ √
 
Bercu and Touati (11) and our (35) in Theorem 2. exp (47)
1 − 2σ 2 s 1 − 2σ 2 s
1
for any s < 2σ 2 .
The sequence βn,` is defined in a similar way. For √ Theorem 2 Equality holds in (46) and (47) when Z is Gaussian. In
to remain valid, we require ηn no smaller than 1/ n to ensure
particular, the right side of (47) is the MGF of the noncentral
that the right sides of (24)-(25) still converge to the right sides
χ2 -distributed random variable Z 2 .
of (33)-(34), respectively. Let ηn be a positive sequence such
that Theorem 4 (Generalization to sub-Gaussian case). Theorems 1–

1
 3 and Lemma 1 remain valid for the estimator (2) when Zi ’s
ηn = ω √ . (41) in (1) are i.i.d. zero-mean σ-sub-Gaussian random variables.
n
The generalizations of Theorems 1–3 and Lemma 1 from
Theorem 3. For any σ 2 > 0 and a > 1, let ηn > 0 be a
Gaussian to sub-Gaussian Zi ’s only require minor changes in
positive sequence satisfying (41). Then, Theorem 1 holds with
the corresponding proofs. See Appendix A-G for the details.
α` replaced by αn,` , and β` by βn,` , and Theorem 2 holds
with (33) and (34) replaced, respectively, by
IV. T HE D ISPERSION OF
lim inf P + (n, a, ηn ) ≥ log a, (42) A N ONSTATIONARY G AUSS -M ARKOV S OURCE
n→∞
lim inf P − (n, a, ηn ) ≥ log a. (43) A. Rate-distortion functions
n→∞
For a generic random process {Xi }∞ i=1 , the n-th order rate-
The proof of Theorem 3 is presented in Appendix A-F below.
distortion function RX1n (d) is defined as
Theorem 3 is a quite strong result as it states that even if the
error threshold is a sequence decreasing to zero, as long as (41) 1
RX1n (d) , inf I(X1n ; Y1n ), (48)
is satisfied, the probability of estimation error exceeding such PY n |X n :
1 1
n
decreasing thresholds is still exponentially small, with exponent E[d(X1n ,Y1n )]≤d
being at least log a.
where X1n , (X1 , . . . , Xn )0 is the n-dimensional random
2
Corollary 1. For any σ > 0 and any a > 1, there exists a vector determined by the random process, I(X1n ; Y1n ) is the
constant c ≥ 12 log(a) such that for all n large enough, mutual information between X1n and Y1n , d is a given distortion
" r # threshold, and d (·, ·) is the distortion measure defined in (12) in
log log n Sec. II-B above. The rate-distortion function RX (d) is defined
n
P |âML (U1 ) − a| ≥ ≤ 2e−cn . (44)
n as
Corollary 1 is used in Section IV-E below to derive the RX (d) , lim sup RX1n (d). (49)
dispersion of nonstationary Gauss-Markov sources. The proof n→∞

of Corollary 1 is by applying Theorem 3 with ηn chosen as For a wide class of sources, the rate-distortion function RX (d)
r has been shown to be equal to the minimum achievable
log log n
ηn = . (45) source coding rate under the average distortion criterion, in
n the limit of n → ∞, see [11] for discrete memoryless sources
and [12] for general ergodic sources. In particular, Gray’s
D. Generalization to sub-Gaussian Zi ’s coding theorem [15, Th. 2] for the Gaussian autoregressive
In this section, we generalize the above results to the processes directly implies that for the Gauss-Markov source
case where Zi ’s in (1) are zero-mean σ-sub-Gaussian random {Ui }∞i=1 in (1) for any a ∈ R, its rate-distortion function RU (d)

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
7

equals the minimum achievable source coding rate under the


average distortion criterion as n tends to infinity. The n-th
order rate-distortion function RU1n (d) of the Gauss-Markov
source is given by the n-th order reverse waterfilling, e.g. [15,
Eq. (22)]:
n
σ2
 
1X1
RU1n (d) = log max µn,i , , (50)
n i=1 2 θn
n
σ2
 
1X
d= min θn , , (51)
n i=1 µn,i

where θn > 0 is the n-th order water level, and µn,i ’s for
i ∈ [n] (sorted in nondecreasing order) are the eigenvalues of
the n × n matrix F0 F with F being an n × n lower triangular
matrix defined as

1,
 i = j, Fig. 4: Rate-distortion functions: RU (d) in (53) with a = 1.2,
(F)ij , −a, i = j + 1, (52) and RZ (d) in (56).

0, otherwise.

One can check that σ 2 (F0 F)−1 is the covariance matrix of U1n . C. Informational Dispersion
The way that we use (50)-(51) is to first solve the n-th order The d-tilted information [22, Def. 6] is the key random
water level θn using (51) for a given distortion threshold d, variable in our nonasymptotic analysis of R(n, d, ). Under
then plugging that water level into (50) to obtain RU1n (d). The different names, the d-tilted information has also been studied
rate-distortion function RU (d) of the Gauss-Markov source is by Blahut [61, Th. 4] and Kontoyiannis [36, Sec. III-A]. Using
given by the limiting reverse waterfilling: the definition in [22, Def. 6], the d-tilted information U1n (un1 , d)
n
Z π in u 1 is
σ2
 
1 1
RU (d) = log max g(w), dw, (53) U1n (un1 , d) , −λ?n d − log E exp (−λ?n d(un1 , V1?n )) , (58)
2π −π 2 θ
Z π
σ2
 
1 ?
(54) where λn?nis the negative slope of RU1 (d) at the distortion level
n
d= min θ, dw,
2π −π g(w) d and V1 is the random variable that achieves the infimum
in (48) for U n . In [47, Lem. 7, Eq. (228)], by a decorrelation
where θ > 0 is the limiting water level and g(w) is a function argument, we1obtained the following expression for the d-tilted
from [−π, π] to R given by information for the Gauss-Markov source: for any a ∈ R and
2
g(w) , 1 + a − 2a cos(w). (55) any n ∈ N,
n
!
2 2
X min(θ n , σn,i ) x i
Moreover, it is well-known [11] that the rate-distortion function U1n (un1 , d) = 2 −1 +
of the Gaussian memoryless source {Zi }∞ (the special case i=1
2θn σn,i
i=1
when a is set to 0 in the Gauss-Markov model) is n 2
1X max(θn , σn,i )
log , (59)
σ2 2 i=1 θn
 
1
RZ (d) = max 0, log . (56)
2 d
where θn > 0 is given by (51), xn1 , S0 un1 with S being an
See Fig. 4 for a plot of RU (d) and RZ (d). n × n orthonormal matrix that diagonalizes (F0 F)−1 , and

2 σ2
σn,i , (60)
B. Operational Dispersion µn,i
To characterize the convergence rate of the minimum with µn,i ’s being the eigenvalues of the n × n matrix F0 F. We
achievable source coding rate R(n, d, ) (defined in (14) in refer to the random variable X1n , defined by
Section II-B above) to the rate-distortion function, we define
X1n , S0 U1n , (61)
the operational dispersion VU (d) for the Gauss-Markov source
as as the decorrelation of U1n . Note that the decorrelation X1n has

R(n, d, ) − RU (d)
2 independent coordinates and
VU (d) , lim lim sup n , (57) 2
→0 n→∞ Q−1 () Xi ∼ N (0, σn,i ). (62)
where Q−1 denotes the inverse Q-function. The main result in Using (50)-(51) and (62), one can show [47, Eq. (55) and
the second part of this paper gives VU (d) for the nonstationary (228)] that the d-tilted information U1n (un1 , d) in un1 for the
Gauss-Markov source. Gauss-Markov source satisfies U1n (un1 , d) = X1n (xn1 , d). The

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
8

minimum achievable source coding rates (defined in (14)) for D. A Few Remarks
lossy compression of U1n and X1n are equal, as are their rate- In view of (54), there are two special water levels θmin and
distortion functions: RU1n (d) = RX1n (d), see [47, Sec. III.A] θmax , defined as follows:
for the detail. It is known [22, Property 1] that the d-tilted
information U1n (un1 , d) satisfies (by the Karush-Kuhn-Tucker σ2 σ2
θmin , min = (67)
conditions for the optimization problem (48), roughly speaking) w∈[−π,π] g(w) (a + 1)2

E U1n (U1n , d) = RU1n (d).


 
(63) and
σ2 σ2
The informational dispersion VU (d) is then defined as the limit θmax , max = . (68)
w∈[−π,π] g(w) (a − 1)2
of the variance of the d-tilted information normalized by n:
The critical distortion dc is defined as the distortion correspond-
1 ing to the water level θmin , and by (54) we have
VU (d) , lim sup Var U1n (U1n , d) .
 
(64)
n→∞ n
σ2
dc = θmin = . (69)
By decorrelating the Gauss-Markov source and analyzing U1n (a + 1)2
the limiting behavior of the eigenvalues of the covariance
The maximum distortion dmax is defined as the distortion
matrix of U1n , we obtain the following reverse waterfilling
corresponding to the water level θmax . By (54), we have
representation for the informational dispersion. The proof is
Z π
given in Appendix B-A below. 1 σ2
dmax = dw. (70)
Lemma 3. The informational dispersion of the nonstationary 2π −π g(w)
Gauss-Markov source is given by Using similar techniques as in [47, Eq. (169)–(172)], one can
Z π "  2 # compute the integral in (70) as
1 σ2
VU (d) = min 1, dw, (65) σ2
4π −π θg(w) dmax = . (71)
a2 − 1
where θ > 0 is given in (54), and g is in (55). In this paper, we always consider a fixed distortion threshold
d such that 0 < d < dmax .
Notice that the informational dispersion in the nonstationary
case is given by the same expression as in the stationary Remark 3. Gray [15, Eq. (24)] showed the following relation
case [47, Eq. (57)]. It is known, e.g. [22, Eq. (94)] and [21, between the rate-distortion function RU (d) of the Gauss-
Sec. IV], that the informational dispersion for the Gaussian Markov source and RZ (d) of the Gaussian memoryless source:
memoryless source {Zi }∞ i=1 is
(
RU (d) = RZ (d), d ∈ (0, dc ],
(72)
1 RU (d) > RZ (d), d ∈ (dc , dmax ).
VZ (d) = , ∀d ∈ (0, σ 2 ). (66)
2
Using Lemma 3 above, one can easily show (in the same way
See Fig. 5 for a plot of VU (d) and VZ (d). as [47, Cor. 1]) that their dispersions are also comparable:
(
VU (d) = VZ (d), d ∈ (0, dc ],
(73)
VU (d) < VZ (d), d ∈ (dc , σ 2 ).

The results in (72)-(73) imply that for low distortions d ∈


(0, dc ), the minimum achievable source coding rate in compress-
ing the Gauss-Markov source and the Gaussian memoryless
source are the same up to second-order terms, a phenomenon we
observed in the stationary case as well [47, Cor. 1]. See Fig. 4
and Fig. 5 for a visualization of (72) and (73), respectively.
Remark 4. For the function RU (d), we show that

RU (dmax ) = log a. (74)

This result has an interesting connection to the problem of


control under communication constraints [62] [63, Th. 1] [64,
Prop. 3.1], where it was shown that the minimum rate to
asymptotically stabilize a linear, discrete-time, scalar system
Fig. 5: Dispersions :VU (d) in (65) with a = 1.2, and VZ (d) is also log a, suggesting that stability is unattained with any
in (66). rate lower than log a even if an infinite lookahead is allowed.
The derivation of (74) is presented in Appendix B-C below.

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
9

Remark 5. Let P1 and P2 be the two special points on the via the following second-order refinement of the “lossy AEP”
curve VU (d) at distortions dc and dmax , respectively. Then, (asymptotic equipartition property) for the nonstationary Gauss-
the coordinates of P1 and P2 are given by Markov sources.
(1 + a2 )(a − 1)
 
Lemma 4 (Second-order lossy AEP for the nonstationary
P1 = (dc , 1/2), P2 = dmax , . (75)
2(a + 1)3 Gauss-Markov sources). For the Gauss-Markov source with
a > 1, let PV1?n be the random variable that attains the
The derivation for P2 is the same as that in the stationary
minimum in (48) with X1n there replaced by U1n . It holds that
case [47, Eq. (61)] except that we need to compute the residue  
at 1/a instead of at a since we now have a > 1, see [47, App. 1 n 1
P log ≥  U n (U , d) + p(n)
1 ≤ ,
B-A] for details. n
PV1?n (B (U1 , d)) 1
q(n)
(81)
E. Second-order Coding Theorem where
Our main result establishes the equality between the opera-
p(n) , c1 (log n)c2 + c3 log n + c4 , (82)
tional dispersion and the informational dispersion.
q(n) , Θ(log n), (83)
Theorem 5 (Gaussian approximation). For the Gauss-Markov
source (1) with a > 1. For any fixed excess-distortion with positive constants ci ’s, i = 1, ..., 4.
probability  ∈ (0, 1) and distortion threshold d ∈ (0, dmax ),
The proof of Lemma 4 is presented in Appendix F-E below.
we have
The proof of Theorem 7 (using the random coding bound (79)
VU (d) = VU (d). (76) and Lemma 4) is presented in Appendix E below.

Specifically, we have the following converse and achievabil-


ity. F. The Connection between Lossy AEP and Parameter Estima-
tion
Theorem 6 (Converse). For the Gauss-Markov source with
The proof of lossy AEP in the form of Lemma 4 is technical
a > 1, for any fixed excess-distortion probability  ∈ (0, 1)
even for stationary memoryless sources. A lossy AEP for
and distortion threshold d, the minimum achievable source
stationary α-mixing processes was derived in [37, Cor. 17]. For
coding rate R(n, d, ) satisfies
stationary memoryless sources with single-letter distribution
PX , the idea in [22, Lem. 2] is to form a typical set Fn
r  
VU (d) −1 log n 1
R(n, d, ) ≥ RU (d) + Q () − +O , of source outcomes [22, Lem. 4] using the product of the
n 2n n
(77) empirical distributions [22, Eq. (270)]: PX̂ × . . . × PX̂ , where
PX̂ (x) , n1 i=1 1{xi = x} is the empirical distribution of a
Pn
where Q−1 denotes the inverse Q-function, RU (d) is the given source sequence xn1 , and then to show that the inequality
rate-distortion function given in (53), and VU (d) is the inside the bracket in (81) holds for xn1 ∈ Fnc and that the
informational dispersion given by Lemma 3 above. probability of the complement set Fnc is√at most 1/q(n), where
The converse proof is similar to that in the asymptotically p(n) = C log n + c and q(n) = K/ n [22, Lem. 2]. The
stationary case in [47, Th. 7]. See Appendix D for the details. Gauss-Markov source is not memoryless, and it is nonstationary
for a > 1. To form a typical set of source outcomes, we
Theorem 7 (Achievability). In the setting the Theorem 6, the define the following proxy random variables using the estimator
minimum achievable source coding rate R(n, d, ) satisfies âML (un1 ) in (2).
r  
VU (d) −1 1 Definition 2 (Proxy random variables). For each sequence
R(n, d, ) ≤ RU (d) + Q () + O √ .
n n log n un1 of length n generated by the Gauss-Markov source, define
(78) the proxy random variable X̂1n as an n-dimensional Gaussian
random vector with independent coordinates, each of which
It is straightforward that (76) follows from Theorems 6 and 7. 2
follows the distribution N (0, σ̂n,i ) with
Central to the achievability proof of Theorem 7 is the random
coding bound. Specifically, direct application of [22, Cor. 11] 2
σ̂n,1 , σ 2 âML (un1 )2n , (84)
implies that there exists an (n, M, d, ) code such that 2
2 σ
σ̂n,i , , 2 ≤ i ≤ n,
 ≤ inf E exp −M · PV1n (B(U1n , d)) ,
 

(79) 1+ âML (un1 )2 − 2âML (un1 ) cos n+1
PV n
1
(85)
where the infimization is over all random variables defined on
where âML (un1 ) is in (2) above.
Rn and B(un1 , d)) denotes the distortion d-ball around un1 :
Remark 6. The proxy random variable in Definition 2 differs
B(un1 , d)) , {z1n ∈ Rn : d(un1 , z1n ) ≤ d} . (80)
from that in [47, Eq. (119)] for the stationary case in the
2
To obtain the achievability in (78) from (79), we need to behavior of the largest variance σ̂n,1 . For each realization un1 ,
bound from below the probability of the distortion d-ball in we construct the Gaussian random vector X̂1n according to (84)-
terms of the informational dispersion. This connection is made (85), which is a proxy to the decorrelation X1n in (61) above.

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
10

The variances of X̂i and Xi are very close due to the closeness Corollary 1 is essential to the proof of Theorem 8. See the
of âML (un1 ) to a (Corollary 1). details in Appendix F-C.
Remark 7. Since the proxy random variable X̂1n depends on Let E denote the event inside the square bracket in (81).
the realization of U1n , Definition 2 defines the joint distribution Then, we prove Lemma 4 by intersecting E with the typical set
of (X1n , X̂1n ), where X1n is the decorrelation of U1n in (61) T (n, p) and the complement T (n, p)c , respectively, and then
above. bounding the probability of the two intersections separately.
See Appendix F-E for the details.
The following convex optimization problem will be instru-
mental: for two generic random vectors An1 and B1n with V. D ISCUSSION
distributions PAn1 and PB1n , respectively, define
A. Stationary and Nonstationary Gauss-Markov Processes
1
R(An1 , B1n , d) , inf D(PF1n |An1 ||PB1n |PAn1 ), It took several decades [13–17] to completely understand
PF n |An :
1 1
n the difference in rate-distortion functions between stationary
E[d(An n
1 ,F1 )]≤d
and nonstationary Gaussian autoregressive sources. We briefly
(86)
summarize this subtle difference here to make the point that
where D(PF1n |An1 ||PB1n |PAn1 ) is the conditional relative entropy. generalizing results from the stationary case to the nonstationary
See Appendix F-B for detailed discussions on this optimization one is natural but nontrivial.
problem. Since det(F) = 1, the eigenvalues µn,i ’s of F0 F satisfy
For each realization un1 (equivalently, each xn1 = S0 un1 with Yn
the n × n matrix S defined in the text above (60)), we define µn,i = 1. (93)
n random variables mi (un1 ) , i = 1, . . . , n as follows. i=1
n n ?n
• Let X1 the decorrelation of U1 in (61) above. Let Y1 be Using (93), we can equivalently rewrite (50) as
the random variable that attains the infimum in RX1n (d). n 2
!
n n
• For each u1 , choose A1 in (86) to be the proxy random
1X 1 σn,i
RU1n (d) = max 0, log , (94)
variable X̂1n , and let B1n to be Y1?n . Let F̂1?n be the random n i=1 2 θn
variable that attains the infimum in R(X̂1n , Y1?n , d). 2
where θn > 0 is in (51) and σn,i ’s are in (60). Both (50)
Then, for each i = 1, . . . , n, define and (94) are valid expressions for the n-th order rate-distortion
(87) function RU1 (d), regardless of whether the source is stationary
h i
n
mi (un1 ) , E (F̂i? − xi )2 |X̂i = xi .
or nonstationary. The classical Kolmogorov reverse waterfilling
Denote result [13, Eq. (18)], obtained by taking the limit in (94),
r implies that the rate-distortion function of the stationary Gauss-
log log n
ηn , . (88) Markov source (0 < a < 1) is given by (the subscript K stands
n for Kolmogorov)
The typical set for the Gauss-Markov source is then defined Z π
σ2
 
1 1
as follows. RK (d) = max 0, log dw, (95)
2π −π 2 θg(w)
Definition 3 (Typical set). For any d ∈ (0, dmax ), n ≥ 2
where θ > 0 is given in (54) and g(w) is given in (55).
and a constant p > 0, define T (n, p) to be the set of vectors
While (53) and (54) are valid for both stationary and non-
un1 ∈ Rn that satisfy the following conditions:
stationary cases, Hashimoto and Arimoto [16] noticed in
|âML (un1 ) − a| ≤ ηn , (89) 1980 that (95) is incorrect for the nonstationary Gaussian

n ! k
autoregressive source. The reason is the different asymptotic
1 X x2i

0
(90) behaviors of the eigenvalues µn,i ’s of F F (52) in the stationary


n 2 − (2k − 1)!! ≤ 2, k = 1, 2, 3,
i=1 σn,i and nonstationary cases: while in the stationary case, the

n
spectrum is bounded away from zero, in the nonstationary
1 X
n
mi (u1 ) − d ≤ pηn , (91) case, the smallest eigenvalue µn,1 approaches 0, causing a

n

i=1
discontinuity. By treating that smallest eigenvalue in a special
way, Hashimoto and Arimoto [16, Th. 2] showed that
where xn1 = S0 un1 is the decorrelation (61) and σn,i 2
’s are
defined in (60) above. RHA (d) = RK (d) + log(max(a, 1)) (96)

The typical set in Definition 3 is in the same form as that is the correct rate-distortion function for both stationary and
in the stationary case [47, Def. 2], but the definitions of proxy nonstationary Gauss-Markov sources, where the subscript HA
random variables and the analyses are different. stands for the authors of [16]. For the general higher-order
Gaussian autoregressive source, the correction term needed
Theorem 8. For any d ∈ (0, dmax ), there exists a constant in (96) depends on the unstable roots of the characteristic
p > 0 such that for the probability that the Gauss-Markov polynomial of the source, see [16, Th. 2] for the details. In 2008,
source produces a typical sequence satisfies Gray and Hashimoto [17] showed the equivalence between
RHA (d) in (96), obtained by taking a limit in (94), and Gray’s
 
1
P [U1n ∈ T (n, p)] ≥ 1 − Θ . (92) result RU (d) in (53), obtained by taking a limit in (50).
log n

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
11

The tool that allows one to take limits in (94) and (50) is the using the function (102) fails to yield the correct rate-
following theorem on the asymptotic eigenvalue distribution of distortion function for nonstationary sources due to the
the almost Toeplitz matrix F0 F, which is the (rescaled) inverse discontinuity of FK (t) at 0.
of the covariance matrix of U1n . Denote Gray [15, Eq. (22)] and Hashimoto and Arimoto [16]
circumvent the above difficulty in two different ways,
α, min g(w) = (a − 1)2 , (97) which lead to (53) and (96), respectively. Gray [15] applied
w∈[−π,π]
Theorem 9 on (50) using the function
and
σ2
 
1
β, max g(w) = (a + 1)2 . (98) FG (t) = log max t, , (103)
w∈[−π,π] 2 θ
Gray [65, Th. 2.4] generalized the result of Grenander and which is indeed continuous at 0, while Hashimoto and
Szegö [66, Th. in Sec. 5.2] on the asymptotic eigenvalue Arimoto [16, Th. 2] still use the function FK (t) but
distribution of Toeplitz forms to that of matrices that are consider µn,1 and µn,i , i ≥ 2 separately:
asymptotically equivalent to Toeplitz forms, see [65, Chap. 1X
n
1
2.3] for the details. Define FK (µn,i ) + FK (µn,1 ), (104)
n i=2 n
α0 , inf µn,i . (99)
n∈N, i∈[n] which in the limit yields (96) by plugging µn,1 = Θ(a−2n )
Theorem 9 (Gray [15, Eq. (19)], Hashimoto and Arimoto [16, into (102).
Th. 1]). For any continuous function F (t) over the interval
B. New Results on the Spectrum of the Covariance Matrix
t ∈ [α0 , β] , (100)
The following result on the scaling of the eigenvalues
the eigenvalues µn,i ’s of F0 F with F in (52) satisfy µn,i ’s refines [16, Lemma], and its proof is presented in
n Z π Appendix B-D.
1X 1
lim F (µn,i ) = F (g(w)) dw, (101) Lemma 5. Fix a > 1. For any i = 2, . . . , n, the eigenvalues
n→∞ n 2π −π
i=1 of F0 F are bounded as
where g(w) is defined in (55).
ξn−1,i−1 ≤ µn,i ≤ ξn,i , (105)
The eigenvalues µn,i ’s behave quite differently in the
where
following three cases, leading the subtle difference in rate-  
distortion functions. iπ
ξn,i , 1 + a2 − 2a cos . (106)
1) For the stationary case a ∈ (0, 1), it can be easily n+1
shown [47, Eq. (71)] that α0 = α > 0 and all eigenvalues The smallest eigenvalue is bounded as
µn,i ’s lie in between α and β. Kolmogorov’s formula (95) c2 1 c1
is obtained by applying Theorem 9 to (94) using the 2 log a + ≥ − log µn,1 ≥ 2 log a − , (107)
n n n
function
where c1 > 0 and c2 are constants given by
σ2
 
1 aπ
FK (t) , max 0, log , (102) c1 = 2 log(a + 1) + 2 , (108)
2 θt a −1
where θ > 0 is given by (54). a 2aπ
c2 = 2 log 2 + . (109)
2) For unit-root processes / Wiener processes a = 1, closed- a − 1 a2 − 1
form expressions of µn,i ’s are given by Berger [14, Eq. Remark 8. The constant c1 in (108) is positive, while c2 in (109)
(2)]. Those results imply that the smallest eigenvalue µn,1 can be positive, zero or negative, depending on the value of
is of order Θ n12 and α0 = α = 0. Using the same a > 1. Lemma 5 indicates that a−2n is a good approximation
function as in (102), Berger obtained the rate-distortion to µn,1 . Using (105)–(106), we deduce that for i = 2, . . . , n,
functions for the Wiener processes a = 1 [14, Eq. 4] 1 .
3) For the nonstationary case a > 1, we have α0 = 0 < α, µn,i ∈ [α, β]. (110)
−2n
the smallest eigenvalue µn,1 is of order Θ(a ) and Based on Lemma 5, we obtain a nonasymptotic version of
the other n − 1 eigenvalues lie in between α and β. Theorem 9, which is useful in the analysis of the dispersion, in
This behavior of eigenvalues was shown by Hashimoto particular, in deriving Proposition 1 in Appendix C-A below.
and Arimoto [16, Lemma] for higher-order Gaussian
autoregressive sources, and we will show a refined version Theorem 10. Fix any a > 1. For any bounded, L-Lipschitz
for the Gauss-Markov source in Lemma 5 below. As and nondecreasing function (or nonincreasing function) F (t)
pointed out by [16, Th. 1], an application of Theorem 9 over the interval (100) and any n ≥ 1, the eigenvalues µn,i ’s
of F0 F satisfy
1 To be precise, although the rate-distortion function for the Wiener process
1 X n Z π
is correct in [14, Eq. 4], the proof there is not rigorous since in this case 1 C
L
F (µ ) − F (g(w)) dw ≤ , (111)

0
α = α = 0 but FK (t) is not continuous at t = 0 as pointed out in [17, Eq. n,i
n 2π n

i=1 −π
(23)]. Therefore, the limit leading to [14, Eq. 4] needs extra justifications.

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
12

where g(w) is defined in (55) and CL > 0 is a constant that with 1 degree of freedom, we obtain
depends on L and the maximum absolute value of F .  2

E Wn−1 · exp(α1 Un−1 )
The proof of Theorem 10 is in Appendix B-E. 1  2

=√ 2
E Wn−2 · exp(α2 Un−2 ) . (118)
1 − 2σ α1
VI. C ONCLUSION This is where our method diverges from Rantzer [10, Lem. 5],
In this paper, we obtain nonasymptotic (Theorem 1) and who chooses s = ση2 and bounds α2 ≤ α1 (due to Property A4
asymptotic (Theorem 2) bounds on the estimation error of in Appendix A-B below) in (118). Instead, by conditioning on
the maximum likelihood estimator of the parameter a of Fn−3 in (118) and repeating the above recursion for another
the nonstationary scalar Gauss-Markov process. Numerical n − 2 times, we compute E [Wn ] exactly using the sequence
simulations in Fig. 1 confirm the tightness of our estimation {α` }:
error bounds compared to previous works. As an application of ( n−1
)
1X 2
the estimation error bound (Corollary 1), we find the dispersion E [Wn ] = exp − log(1 − 2σ α` ) . (119)
for lossy compression of the nonstationary Gauss-Markov 2
`=1
sources (Theorems 6 and 7). Future research directions include If s 6∈ Sn+ ,
then by the definition of the set Sn+ we have
generalizing the error exponent bounds in this paper, applicable E [Wn ] = +∞. Therefore,
to identification of scalar dynamical systems, to vector systems,
and finding the dispersion of the Wiener process. inf E [Wn ] = inf+ E [Wn ] . (120)
s>0 s∈Sn

A PPENDIX A
A. Proof of Theorem 1 B. Properties of the Sequence α`
Proof. We present the proof of (24). The proof of (25) is We derive several important elementary properties about the
similar and is omitted. For any n ≥ 2, denote by Fn the σ- sequences α` and β` . First, we consider α` . We find the two
algebra generated by Z1 , . . . , Zn . For any s > 0, η > 0, and fixed points r1 < r2 of the recursive relation (19) by solving
n ≥ 2, we denote Wn the following random variable the following quadratic equation in x:
( n−1 )
X 2σ 2 x2 + [a2 + 2σ 2 s(a + η) − 1]x + α1 = 0. (121)
Wn , exp s (Ui Zi+1 − ηUi2 ) . (112)
i=1 Property A1: For any s > 0 and η > 0, (121) has two roots
By the Chernoff bound, we have r1 < r2 , and r1 < 0. The two roots r1 and r2 are given by

−[a2 + 2σ 2 (a + η)s − 1] − ∆
P [âML (U1n ) − a ≥ η] ≤ inf E [Wn ] . (113) r1 = , (122)
s>0 4σ 2 √
2 2
To compute E [Wn ], we first condition on Fn−1 . Since Zn is −[a + 2σ (a + η)s − 1] + ∆
r2 = , (123)
the only term in Wn that does not belong to Fn−1 , we have 4σ 2
where ∆ denotes the discriminant of (121):
E [Wn ]
 2
=E Wn−1 · E[exp(s(Un−1 Zn − ηUn−1 ))|Fn−1 ]

(114) ∆ = 4σ 4 [(a + η)2 − 1]s2 +
 2
=E Wn−1 · exp(α1 Un−1

) , (115) 4σ 2 [(a + η)(a2 − 1) + 2η]s + (a2 − 1)2 . (124)
Proof. Note that the discriminant ∆ satisfies
where α1 is the deterministic function of s and η defined in (18),
and (115) follows from the moment generating function of Zn . ∆ > (a2 − 1)2 > 0, (125)
2
To obtain a recursion, we condition on Fn−2 . Since Un−1 and
2 where we used a > 1. Then, (122) implies r1 < 0.
Un−2 Zn−1 are the only two terms in Wn−1 · exp(α1 Un−1 )
that do not belong to Fn−2 , we use the relation Un−1 = Property A2: For σ2η2 6= s > 0 and η > 0, the sequence
α` −r1
aUn−2 + Zn−1 and we complete squares in Zn−1 to obtain α` −r2 is a geometric sequence with common ratio
2
Wn−1 · exp(α1 Un−1 ) [a2 + 2σ 2 s(a + η)] + 2σ 2 r1
q, . (126)
n 
s
2 [a2 + 2σ 2 s(a + η)] + 2σ 2 r2
=Wn−2 · exp α1 Zn−1 + (a + )Un−2 +
2α1 Furthermore,
 2
s o
q ∈ (0, 1), (127)
(a2 α1 − sη)Un−2
2
− α1 a + 2
Un−2 . (116)
2α1
and it follows immediately that
Furthermore, using the formula for the moment generating α1 −r1 `−1
(r1 − r2 ) α q
function of the noncentral χ2 -distributed random variable α` = r1 + 1 −r2
, (128)
1 −r1 `−1
   2 1− α α1 −r2 q
s r2 − r1
Zn−1 + a + Un−2 (117) = r2 + α1 −r1 `−1 . (129)
2α1 −1
α1 −r2 q

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
13

Proof. Using the recursion (19) and the fact that r1 and r2 Proof. We verify that Γ > 0 for any η > 0 and s ∈ Iη .
` −r1
are the fixed points of (19), one can verify that αα` −r2 is a The reason that Γ > 0 is not as obvious as (125) is due to
geometric sequence with common ratio q given by (126). The the subtle difference between (124) and (135) in the negative
relation (127) is verified by direct computations using (122) sign of a. Note that Γ in (135) is a quadratic equation in s
and (123). and the discriminant of Γ is given by (with some elementary
2η manipulations)
Property A3: For any σ2 6= s > 0 and η > 0, we have
γ = 16σ 4 (2aη − a2 + 1)2 ≥ 0. (136)
lim α` = r1 . (130)
`→∞ 2
−1
2η Hence, in general, (135) has two roots (distinct when η 6= a 2a )
For s = σ2 , we have α` = 0 = r2 > r1 , ∀` ≥ 1. and Γ could be positive or negative. However, an analysis of
Proof. The limit (130) follows from (127) and (128). Plugging two cases (−a + η)2 − 1 ≥ 0 and (−a + η)2 − 1 < 0 reveals
s = σ2η2 into (18) yields α1 = 0, which implies by (19) that that Γ > 0 for any η > 0 and s ∈ Iη . Therefore, (132) has
α` = 0 for ` ≥ 1. two distinct roots t1 < t2 given in (133) and (134) above.
β1
From (132), we have t1 t2 = 2σ 2 , which is negative for s ∈ Iη .
Property A4: For any s ∈ Iη , we have α` < 0 and α` Therefore, we have t1 < 0 < t2 .
decreases to r1 geometrically. For s > σ2η2 , (130) still holds,
β` −t1
but the convergence is not monotone: there exists an `? ≥ 1 Property B2: For any η > 0 and s ∈ Iη , the sequence β` −t2
such that α` > 0 and increases to α`? for 1 ≤ ` ≤ `? ; and is a geometric sequence with common ratio
α` < 0 and increases to r1 for ` > `? .
[a2 + 2σ 2 s(−a + η)] + 2σ 2 t1
p, . (137)
Proof. Due to (129), the monotonicity of α` depends on the [a2 + 2σ 2 s(−a + η)] + 2σ 2 t2
1 −r1
signs of r2 − r1 and α
α1 −r2 . Note that r2 − r1 > 0 by Property In addition, for any η > 0 and s ∈ Iη , we also have
A1. Plugging x = α1 into (121), we have
p ∈ (0, 1). (138)
(α1 − r1 )(α1 − r2 ) = (a + σ 2 s)2 α1 . (131)
It follows immediately that
Since for s ∈ Iη we have α1 < 0 by (18), (131) implies
1 −r1
that α
α1 −r2 < 0 for any s ∈ Iη . This immediately implies
−t1 `−1
(t1 − t2 ) ββ11 −t p
2
that α` decreases to r1 due to (128) and (129). Therefore, β` = t1 + −t1 `−1
, (139)
1 − ββ11 −t p
α` ≤ α1 < 0, ∀` ≥ 1. For any s > σ2η2 , we have α1 > 0 2

and α 1 −r1 t2 − t1
α1 −r2 > 0. In fact, since r1 < 0, we have α1 > r2 , = t2 + β1 −t1 . (140)
1 −r1 `−1 − 1
which implies α
α1 −r2 > 1. Therefore, the conclusion follows β1 −t2 p
from (129).
Proof. Similar to that of Property A2 above for α` .
Property A5: For any η > 0, the root r1 in (122) is a
Property B3: For any η > 0 and s ∈ Iη , we have β` ≤
decreasing function in s > 0.
β1 < 0 and β` decreases to t1 geometrically:
Proof. Direct computations using (122), (124) and the assump-
tion that a > 1. lim β` = t1 . (141)
`→∞

Proof. This can be verified using (139) and (140) by noticing


C. Properties of the Sequence β` that t2 − t1 > 0 and for s ∈ Iη ,
The sequence β` is analyzed similarly, although it is slightly
(β1 − t1 )(β2 − t2 ) = (a − σ 2 s)2 β1 < 0. (142)
more involved than α` . We only consider 0 < s ≤ σ2η2 in the
rest of this section. We find the two fixed points t1 < t2 of
the recursive relation (21) by solving the following quadratic
equation in x: Property B4: For any constant a > 1, recall the two
thresholds η1 and η2 , defined in (37) and (38) in Section III-A
2σ 2 x2 + [a2 + 2σ 2 s(−a + η) − 1]x + β1 = 0. (132) above, respectively. Then,
Property B1: For s = σ2η2 , we have β` = 0, ∀` ≥ 1. For any 1) When 0 < η ≤ η1 , the root t1 in (133) is an increasing
η > 0 and s ∈ Iη , (132) has two distinct roots t1 < 0 < t2 , function in s ∈ Iη .
given by 2) When η ≥ η2 , t1 is a decreasing function in s ∈ Iη .
√ 3) When η1 < η < η2 , t1 is a decreasing function in s ∈
−[a2 + 2σ 2 s(−a + η) − 1] − Γ (0, s? ); and an increasing function in s ∈ s? , σ2η2 , where
t1 = , (133)
4σ 2 √ s? is the unique solution in the interval Iη to
−[a2 + 2σ 2 s(−a + η) − 1] + Γ
t2 = , (134) dt1
4σ 2 = 0, (143)
ds s=s?

where the discriminant Γ of (132) is
and s? is given by
Γ = 4σ 4 [(−a + η)2 − 1]s2 +
aη(η − η1 )
4σ 2 [(−a + η)(a2 − 1) + 2η]s + (a2 − 1)2 . (135) s? , . (144)
σ 2 (1 − (η − a)2 )

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
14

Proof. Using (133) and (135), we compute the derivatives of To show the other direction, it suffices to show that for any
t1 as follows: s > σ2η2 , there exists n ∈ N such that αn ≥ 2σ1 2 . Let `? be the
( integer defined in Property A4 above. Then, `? satisfies the
dt1 η−a 1
=− −√ σ 2 [(−a + η)2 − 1]s following two conditions
ds 2 Γ α1 − r1 `? −1
) q ≥ 1, (151)
1 2
α1 − r2
+ [(−a + η)(a − 1) + 2η] , (145) α1 − r1 `?
2 q < 1. (152)
α1 − r2
d2 t 1 σ 2 (2aη − a2 + 1)2
= 3 ≥ 0. (146) We show that α`? ≥ 2σ1 2 , which would complete the proof.
ds2 Γ2
Due to r2 − r1 > 0, using (129) and (152), we have
To simplify notations, denote by L(s) the first derivative: r2 − r1
α`? ≥ r2 + 1 (153)
dt1
L(s) , (s). (147) q −1
ds r2 − r1 q
From (145), we have = (154)
1−q
−a2 (η − η1 ) 1
L(0) = , (148) = , (155)
a2 − 1 2σ 2
and where (155) 2 is by plugging (122), (123) and (126) into (154).
  Finally, to show (31), for any 0 < s ≤ 2η/σ 2 , we have
2η −
L = β` ≤ 0 < 2σ1 2 , ∀` ≥ 1, hence 0, 2η/σ 2 ⊆ S∞ . The other
σ2 direction cannot hold since there are many counterexamples,
( −2(2η−a)(η−η 0
2 )(η−η2 )
η ∈ 0, a−1 ∪ a+1 e.g., a = 1.2, σ 2 = 1, η = 0.15 and s = 0.35 > σ2η2 , where the
 
(a−2η)2 −1 , 2 2 , +∞
η
η ∈ a−1 a+1
 sequence β` increases monotonically to t1 ≈ 0.0411 < 2σ1 2 .
1−(a−2η)2 , 2 , 2 , − 2η

Hence, in this case, 0.35 ∈ S∞ but 0.35 6∈ 0, σ2 .
(149)
where η20 is given by E. Proof of Theorem 2

3a −
+8 a2 Proof. Theorem 1 and Lemma 1 imply that for any s ∈ Iη ,
η20 , . (150) n−1
4 1 X
+
Since L(s) is an increasing function in s due to (146), to lim inf P (n, a, η) ≥ lim log(1 − 2σ 2 α` ). (156)
n→∞ n→∞ 2n
`=1
determine the monotonicity of t1 , we only need to consider
the following three cases. Recall that α ` depends on s. By (130), the continuity of the
a) When L(0) ≥ 0, or equivalently, 0 < η ≤ η1 , we have function x →
7 log(1 − x) and the Cesàro mean convergence,
L(s) ≥ 0 for any s ∈ Iη . Hence, t1 is an increasing function we have
in s. 1

 lim inf P + (n, a, η) ≥ log(1 − 2σ 2 r1 ), (157)
b) When L σ2 ≤ 0, we have L(s) ≤ 0 for any s ∈ Iη . n→∞ 2
Hence, t1 is a decreasing function in s. We now show that  where r1 depends on s via (122). Since (157) holds for any
L σ2η2 ≤ 0 isequivalent to η ≥ η2 . When η ∈ a−1 2 , a+1
2 ,
s ∈ Iη , using Property A5 in Appendix A-B above and
we have L σ2η2 > 0 by (149) and η > 0. When η ∈ 0, a−1 2 ∪ supremizing (157) over s ∈ Iη , we obtain (33). Specifically,
a+1 2η the supremum of (157) over s ∈ Iη is achieved in the limit
2 , +∞ , it is easy to see from (149) that L σ2 ≤ 0 is
2 2
equivalent to η ∈ [η20 , a/2] ∪ [η2 , +∞). Hence, the equivalent of s going to the right end point 2η/σ . Plugging s = 2η/σ
2η into (122), we obtain the corresponding value for r1 :
condition for L σ2 ≤ 0 is η ∈ [η  2 , +∞).
c) When L(0) < 0 and L σ2η2 > 0, or equivalently, η ∈ (a + 2η)2 − 1
(η1 , η2 ), solving (143) using (145) yields (144). Since L(s) is − , (158)
2σ 2
?
monotonically increasing due to (146), we know that s given which is further substituted into (157) to yield (33).
by (144) is the unique solution to (143) in Iη , and L(s) ≤ 0 Similarly, to show (34), using Property B3 in Appendix A-C
for s ∈ (0, s? ] and L(s) > 0 for s ∈ (s? , 2η/σ 2 ). above, we have
1
D. Proof of Lemma 1 lim inf P − (n, a, η) ≥ sup log(1 − 2σ 2 t1 ). (159)
n→∞ s∈Iη 2
Proof. We first show the monotone decreasing property. The Then, by Property B4 in Appendix A-C above, the supermizer
+
set Sn+1 contains all s > 0 such that a1 , ..., an , an+1 are all s0 in (159) is given by
less than 1/2σ 2 , while the set Sn+ contains all s > 0 such 
that a1 , ..., an are all less than 1/2σ 2 , hence Sn+1 +
⊆ Sn+ . The 0,
 0 < η ≤ η1
− 0
same argument yields the conclusion for Sn . s = s? , η1 < η < η2 (160)
+
We then prove that S∞ = 0, 2η/σ 2 . Property A4 above  2η

σ 2 , η ≥ η2 ,
in Appendix A-B implies that for any 0 < s ≤ 2η/σ 2 , we
have α` ≤ 0 < 2σ1 2 . Hence 0, 2η/σ 2 ⊆ Sn+ for any n ≥ 1.

2 It is pretty amazing that (155) is in fact an equality.

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
15

where s? is given by (144). Plugging (160) into (159) where r1 , r2 and q in this regime depend on ηn with order
yields (34). dependence given in TABLE I above. Using the inequality
x
Finally, the bound (35) follows from (33) and (34), since log(1 − x) ≥ x−1 , ∀x ∈ (0, 1), we have
P [|âML (U1n ) − a| > η] P + (n, a, ηn )
=P [(âML (U1n ) − a) > η] + P [(âML (U1n ) − a) < −η] (161) n−1
≥ log(1 − 2σ 2 r1 )+
2n
and n−1
1 X −1
1−2σ 2 r2 1−2σ 2 r1
. (169)
lim inf P (n, a, η) 2n 1 
`=1 2σ 2 (r2 −r1 )
+ 2σ 2 (r2 −r1 ) ·

n→∞ α −r
− α1 −r1 q `−1
= lim inf min P + (n, a, η), P − (n, a, η)
 1 2
(162)
n→∞ 2

Since 1 − 2σ r2 > 0 due to (123), we can further bound
≥I (a, η). (163) P + (n, a, ηn ) as
P + (n, a, ηn )
n−1
≥ log(1 − 2σ 2 r1 )−
F. Proof of Theorem 3 2n !
n−1
1 X `−1 2σ 2 (r2 − r1 )
 
Proof. For any sequence ηn , the proof of Theorem 1 in α1 − r1
q · − (170)
Appendix A-A above remains valid with α` replaced by αn,` n 1 − 2σ 2 r1 α1 − r2
`=1
defined in (40) in Section III-C above. We present the proof n−1
of (42), and omit that of (43), which is similar. In this regime, ≥ log(1 − 2σ 2 r1 )−
2n
for each n ≥ 1, the proof of Lemma 1 implies that
2σ 2 (r2 − r1 )
 
1 α1 − r1

2ηn
 · − (171)
0, 2 ⊆ Sn+ . (164) n (1 − 2σ 2 r1 )(1 − q) α1 − r2
σ n−1 2 1
= log(1 − 2σ r1 ) − , (172)
Then, in (24), we choose 2n nΘ(ηn2 )
ηn where in the last step we used the results in TABLE I. Due to
∈ Sn+ .
s = sn = (165)
σ2 the assumption (41) on ηn and (167), we obtain (42).
First, using (122)-(123), (126) and the choice (165), we can
determine the asymptotic behavior of quantities involved in G. Proof of Theorem 4
determining αn,` in (128) and (129) (with η replaced by ηn Proof. We point out the proof changes in generalizing our
and s replaced by sn ), summarized in TABLE I. results to the sub-Gaussian case. There are two changes to
α1 r1 r2 r2 − r1 q −α 1 −r1 be made in the proof of Theorem 1 in Appendix A-A above,
α1 −r2
2)
−Θ(ηn −Θ(1) 2)
Θ(ηn Θ(1) Θ(1) Θ(1/ηn2) the equality from (114) to (115) is replaced by ≤ since Zn
is σ-sub-Gaussian; the equality in (118) is replaced by ≤ due
TABLE I: Order dependence in ηn of the quantities involved to Lemma 2. The rest of the proof for Theorem 1 remains
in determining αn,` in (128) and (129). the same for the sub-Gaussian case. Since Lemma 1 and
Theorem 2, 3 depend only on the properties of the sequences
We make two remarks before proceeding further. It can be α` and β` and not on the distribution of Zn ’s as long as
easily verified from (126) that the common ratio q is a constant Theorem 1 holds, their proofs remain exactly the same for the
belonging to (0, 1) and sub-Gaussian case.
1
lim q = ∈ (0, 1). (166)
ηn →0 a2 A PPENDIX B
Hence, for all large n, q is bounded by positive constants A. Proof of Lemma 3
between 0 and 1. Besides, from (122), we have Proof. In view of (62), we take the variances on both sides
2
a −1 of (59) to obtain
lim r1 = − . (167)  !2 
ηn →0 2σ 2 1 X
n 2
σn,i
Second, from (128), (24) and the choice (165), we have VU (d) = lim sup min 1, . (173)
n→∞ 2n i=1 θn
P + (n, a, ηn )
Note that limn→∞ θn = θ, where θ > 0 is the water level
n−1
≥ log(1 − 2σ 2 r1 )+ given by (54). Applying Theorem 9 in Section V-A to (173)
2n     with the function
α1 −r1 `−1
1
n−1
2σ (r − r ) 2 − α1 −r2 q
"  2 #
2 1 1 σ2
X
log 1 − ·   , F (t) , min 1, , (174)
2n 1 − 2σ 2 r1 1 + − α1 −r1
q `−1 2 θt
`=1 α1 −r2
(168) which is continuous at t = 0, we obtain (65).

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
16

B. An Integral Plugging (106) into (183) and then taking the limit, we obtain
We present the computation of an interesting integral that is lim Ln = lim Rn
useful to obtain the value of RU (dmax ). n→∞ n→∞
1 π
Z
Lemma 6. For any constant r ∈ [−1, 1], it holds that = log(1 + a2 − 2a cos(w)) dw (184)
π 0
Z π √ √
1+r+ 1−r = 2 log a, (185)
log(1 − r cos(w)) dw = 4π log .
−π 2 where the last equality is due to Lemma 6 in Appendix B-B
(175) above. In the rest of the proof, we obtain the following
Proof. Denote refinement of (185): for any n ≥ 1,
Z π c1
Rn ≥ 2 log a − , (186)
I(r) , log(1 − r cos(w)) dw. (176) n
−π c2
Ln ≤ 2 log a + , (187)
By Leibniz’s rule for differentiation under the integral sign, n
we have where c1 and c2 are the constants given by (108) and (109)
dI(r)
Z π
∂ in Lemma 5, respectively. Then, (107) will follow directly
= log(1 − r cos(w)) dw (177) from (182), (186) and (187).
dr −π ∂r
Z π The proofs of the refinements (186) and (187) are similar,
cos w
= −2 · dw. (178) and both are based on the elementary relations between
0 1 − r cos w Riemann sums and their corresponding integrals. We present
With the change of variable u = tan (w/2) and partial-fraction the proof of (186), and omit that of (187). Note that the
decomposition, we obtain the closed-form solution to the function h(w) , π1 log(1 + a2 − 2a cos(w)) is an increasing
integral in (178): function in w ∈ [0, π], and its derivative is bounded above
by M1 , π(a2a2 −1) for any fixed a > 1. Therefore, from (106)
dI(r) 2π 2π
= − √ . (179) and (183), we have
dr r r 1 − r2 Z π
M1 π 2

Rn + 1 log(a + 1)2 − 1

It can be easily verified by directly taking derivatives that the log(g(w)) dw ≤ ,
n π 0 2n
right-side of (175) is indeed the antiderivative of (179).
(188)
and (186) follows immediately.
C. Derivation of RU (dmax ) in (74)
We present two ways to obtain (74). The first one is E. Proof of Theorem 10
to directly use (96) in Section V-A. For θ = θmax , we
Proof. From Lemma 5, we know that α0 = 0 < α (recall (97)
have RK (dmax ) = 0 in (95), then (74) immediately follows
and (99)). Since g(w) is an even function, we have
from (96). The second method relies on (53). For θ = θmax , Z π
observe from (53) that 1
I, F (g(w)) dw (189)
1
Z π 2π −π
RU (dmax ) = log(g(w)) dw. (180) 1 π
Z
4π −π = F (g(w)) dw. (190)
π 0
Then, computing the integral (180) using Lemma 6 in Ap-
pendix B-B yields (74). Denote the maximum absolute value of F over the inter-
val (100) by T > 0. It is easy to check that the function
F (g(w)) is 2aL-Lipschitz since F (·) is L-Lipschitz and the
D. Proof of Lemma 5 derivative of g(w) is bounded by 2a. For the following Riemann
Proof. The bound (105) is obtained by partitioning F0 F into its sum
leading principal submatrix of order n − 1 and then applying 1X
n   

the Cauchy interlacing theorem to that partition, see [47, Lem. Sn , F g , (191)
n i=1 n
1] for details. To obtain (107), observe from (93)
n
!−1 the Lipschitz property implies that
Y
µn,1 = µn,i . (181) 2aL
|Sn − I| ≤ . (192)
i=2 πn
Combining (181) and (105) yields For i ≥ 2, rewrite (106) and (105) as
   
1 (i − 1)π iπ
Ln ≥ − log µn,1 ≥ Rn , (182) g ≤ µn,i ≤ g . (193)
n n n+1
where Denote the sum in (111) as
n n−1 n
1X 1X 1X
Ln , log ξn,i and Rn , log ξn−1,i . (183) Qn , F (µn,i ). (194)
n i=2 n i=1 n i=1

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
17

Then, separating F (µn,1 ) from Qn and applying (193), we Proof. Using the same derivation as that of (59), one can
have obtain the following representation of the d-tilted information
2T U1n (U1n , dn ) at distortion level dn :
Qn ≥ Sn − , (195)
n n 2
!
n+1 3T X min(θ, σn,i ) Xi2
Qn ≤ Sn+1 + . (196) U1n (U1n , dn ) = 2 −1 +
n n i=1
2θ σn,i
n 2
Therefore, there is a constant CL > 0 depending on L and T 1X max(θ, σn,i )
such that (111) holds. log , (203)
2 i=1 θ

A PPENDIX C where X1n is the decorrelation of U1n defined in (61). Note that
the difference between (59) and (203) is that θn is replaced by
We gather the frequently used notations in this section as θ. Using (62) and taking expectations and variances of both
follows. For any given distortion threshold d > 0, sides of (203), we arrive at
• let θ > 0 be the water level corresponding to d in the n 2
!
limiting reverse waterfilling (54); 1  n
 1 X σn,i
E U1n (U1 , dn ) = log max 1, , (204)
• for each n ≥ 1, let θn be the water level corresponding n 2n i=1 θ
to d in the n-th order reverse waterfilling (51); n 4
!
1 n 1 X σn,i
let dn be the distortion associated to the water level θ in
 
• Var U1n (U1 , dn ) = min 1, 2 . (205)
the n-th order reverse waterfilling (51). n 2n i=1 θ
For clarity, we explicitly write down the relations between d
Applying Theorem 10 in Section V-B to (204) with the func-
and θn , and dn and θ:
tion FG (t) defined in (103) yields (201). Similarly, applying
n Theorem 10 to (205) with the function (174) yields (202).
1X 2
d= min(θn , σn,i ), (197)
n i=1
n
Proposition 1 is one of the key lemmas that will be used in
1X 2 both converse and achievability proof. Proposition 1 and its
dn = min(θ, σn,i ), (198)
n i=1 proof are similar to those of [47, Eq. (95)–(96)]. The difference
is that we apply Theorem 10, which is the nonstationary version
2
where σn,i ’s are given in (60). Note that d and θ are constants of [47, Th. 4], to a different function in (204).
independent of n, while dn and θn are functions of n, and
there is no direct reverse waterfilling relation between dn and
θn . Applying Theorem 9 in Section V-A above to the function B. Approximation of the d-tilted Information
t 7→ min(θ, σ 2 /t), we have
The following proposition gives a probabilistic characteriza-
lim dn = d, (199) tion on the accuracy of approximating the d-tilted information
n→∞
U1n (U1n , d) at distortion level d using the d-tilted information
and U1n (U1n , dn ) at distortion level dn .

lim θn = θ. (200) Proposition 2. For any d ∈ (0, dmax ), there exists a constant
n→∞ τ > 0 (depending on d only) such that for all n large enough
Theorem 10 in Section V-B then implies that the convergences
1
P U1n (U1n , d) − U1n (U1n , dn ) > τ ≤ ,
 
in (199) and (200) are both in the order of 1/n. (206)
n
where dn is defined in (198).
A. Expectation and Variance of the d-tilted Information
Proposition 1. For any d ∈ (0, dmax ) and n ≥ 1, let dn be Proof. The proof in [47, App. D-B] works through for the
defined in (198) above. Then, the expectation and variance nonstationary case as well, since the proof [47, App. D-B]
of the d-tilted information U1n (U1n , dn ) at distortion level dn only relies on that the convergences in (199) and (200) are
satisfy both in the order of 1/n, which continue to hold for the
nonstationary case.
1 
E U n (U1n , dn ) − RU (d) ≤ C1 ,

n (201)
1 n Remark 9. The following high probability set is used in our

1  converse and achievability proof:
V U n (U1n , dn ) − VU (d) ≤ C2 ,

n (202)
1 n
A , U1n (U1n , d) − U1n (U1n , dn ) ≤ τ .

(207)
where RU (d) and VU (d) are the rate-distortion function
given in (53) and the informational dispersion given in (65), Proposition 2 implies that P[A] ≥ 1 − 1/n for all n large
respectively, and C1 and C2 are positive constants. enough.

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
18

A PPENDIX D where dn is define in (198) above. Note that all the randomness
C ONVERSE P ROOF in Gn is from U1n , hence we will also use the notation Gn (un1 )
Proof of Theorem 6. Using the general converse by Kostina to indicate one realization of the random variable Gn . By
and Verdú [22, Th. 7] and our established Propositions 1 and 2 bounding the deterministic part, that is, log M , of Gn using
in Appendix C, the proof is the same as the converse proof Proposition 1, we know that with probability 1,

in the asymptotically stationary case [47, Th. 7, Eq. (97)– Gn ≥ E + Q−1 (n ) V − U1n (U1n , dn ) + log(log n/2),
(109)]. For completeness, we give a proof sketch. Choosing (215)
γ = (log n)/2 and setting X to be U1n in [22, Th. 7], we know where we use E and V to denote the expectation and variance
that any (n, M, d, ) code for the Gauss-Markov source must of the informational dispersion U1n (U1n , dn ) at distortion level
satisfy dn . Define the set Gn as

 ≥ P U1n (U1n , d) ≥ log M + (log n)/2 − 1/ n. (208)
 
Gn , {un1 ∈ Rn : Gn (un1 ) < log(log n/2)} , (216)
By conditioning on the high probability set A defined in Re- Then, in view of (203), the informational dispersion
mark 9 above, we can further bound  from below by U1n (U1n , dn ) is a sum of independent random variables with
√ bounded moments, and we apply Berry-Esseen theorem to
(1 − 1/n) · P U1n (U1n , dn ) ≥ log M + (log n)/2 + τ − 1/ n.
 
obtain
(209)
C
PU1n (Gn ) ≤ n + √ . (217)
From (203), we know that U1n (U1n , dn ) is a sum of independent n
random variables with means and variances bounded by the We define one more set L as
n
rate-distortion function RU (d) and the informational dispersion  
1
VU (d), with errors in the order of 1/n due to Proposition 1. Ln , un1 ∈ Rn : log < log M − G n (un
1 ) .
Choosing M as in [47, Eq. (103)] and applying the Berry- PV1?n (B(un1 , d))
Esseen theorem to  n (U n , d ), we obtain the converse in (218)
U1 1 n
Theorem 6. Then, by the lossy AEP in Lemma 4 in Section IV-E above
and Proposition 2, we have
A PPENDIX E 1 1
PU1n (Ln ) ≥ 1 − − . (219)
ACHIEVABILITY P ROOF q(n) n
Proof of Theorem 7. With our lossy AEP for the nonstationary Finally, for any constant  ∈ (0, 1) and n large enough, we
Gauss-Markov source and Propositions 1 and 2, the proof is define n as in (212) above and set M as in (213). Then, there
similar to the one for the stationary Gauss-Markov source in [47, exists (n, M, d, 0 ) code such that
Sec. V-C]. Here, we streamline the proof. As elucidated in 0
Section IV-E above, the standard random coding argument [22,
≤E exp −M · PV1?n (B(U1n , d)) · 1{Ln } +
 
Cor. 11] implies that for any n, there exists an (n, M, d, 0 )
E exp −M · PV1?n (B(U1n , d)) · 1{Lcn }
  
code such that (220)
1 1
0 ≤ inf E exp −M · PV1n (B(U1n , d)) . ≤E exp(e−Gn ) +
   
(210) + , (221)
PV n
1
q(n) n
Choosing V1n
to be V1?n (the random variable that attains where the last inequality is due to the definition of Ln and (219).
the minimum in (48) with X1n there replaced by U1n ), the By further conditioning on Gn , we conclude that there exists
bound (210) can be relaxed to (n, M, d, 0 ) code such that
C 1 1
0 ≤ E exp −M · PV1?n (B(U1n , d)) . 0 ≤ n + √ + + (222)
 
(211) n n q(n)
To simplify notations, in the following, we denote by C a = . (223)
constant that might be different from line to line. Given any Therefore, by the choice of M in (213), the minimum
constant  ∈ (0, 1), define n as achievable source coding rate R(n, d, ) must satisfy
C 1 1 r
n ,  − √ − − , (212) VU (d) −1
n q(n) n R(n, d, ) ≤ RU (d) + Q ()+
n
where q(n) is defined in (83) above. Note that for all n large K1 log log n p(n) K2
+ +√ , (224)
enough, we have n ∈ (0, 1). We choose M as n n nq(n)
for all n large enough, where K1 > 0 is a universal constant and
p
log M , nRU (d) + nVU (d)Q−1 (n )+
K2 is a constant depending on . Here we change from Q−1 (n )
log(log n/2) + p(n) + C + τ, (213)
to Q−1 () using a Taylor expansion. Therefore, Theorem 7
where p(n) is defined in (82) and τ is from Proposition 2 follows immediately from (224) with the choices of p(n) and
above. We also define the random variable Gn as q(n) given by (82) and (83), respectively, in the lossy AEP in
Lemma 4 in Section IV-E above. We have O(·) in (78) since
Gn , log M − U1n (U1n , dn ) − p(n) − C − τ, (214) K2 could be positive or negative.

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
19

A PPENDIX F using the negative slope of R(An1 , B1n , d) w.r.t. d. This result
P ROOF OF L OSSY AEP is summarized in the following theorem.
A. Notations Theorem 11. Let A1 , . . . , An be independent random vari-
For the optimization problem R(An1 , B1n , d) in (86), the ables with
generalized tilted information defined in [22, Eq. (28)] in an1
Ai ∼ N (0, αi2 ), (233)
(a realization of An1 ) is given by
ΛB1n (an1 , δ, d) , −δnd − log E [exp(−nδd (an1 , B1n ))] , and B1 , . . . , Bn be independent random variables with
(225) Bi ∼ N (0, βi2 ), . (234)
where δ > 0 and d ∈ (0, dmax ). For properties of the For any d such that
generalized tilted information, see [22, App. D]. For clarity, n
we list the notations used throughout this section: 1X 2
0<d< (α + βi2 ), (235)
1) X1n denotes the decorrelation of U1n defined in (61); n i=1 i
2) X̂1n is the proxy random variable of X1n defined in
we have the following parametric representation for
Definition 2 in Section IV-F above;
R(An1 , B1n , d):
3) For Y1?n that achieves RX1n (d) in (48), ?n
 F̂1 is the random
vector that achieves R X̂1n , Y1?n , d ; R(An1 , B1n , d) = −λd+
n n
4) We denote λ?n the negative slope of RX1n (d) (the same 1 X 1 X λαi2
notation used in (58)): log(1 + 2λβi2 ) +
2n i=1
n i=1 1 + 2λβi2
λ?n , −R0X1n (d). (226) (236)

It is shown in [47, Lem. 5] that λ?n is related to the n-th n


1 X αi2 + βi2 (1 + 2λβi2 )
order water level θn in (51) by d= , (237)
n i=1 (1 + 2λβi2 )2
1
. λ?n = (227)
2θn where λ > 0 is the parameter.
Given any source outcome un1 , let xn1 be the decorrelation Similar results to Theorem 11 have appeared previously in
of un1 . Define λ̂n as the literature [20, 37, 42]. See [37, Example 1 and Th. 2] for
λ̂n , −R0 (X̂1n , Y1?n , d). (228) the case of n = 1. For completeness, we present a proof.

5) Comparing the definitions of d-tilted information and the Proof. Fix any d that satisfies (235), and let λ be such
generalized tilted information, one can see that [47, Eq. that (237) is satisfied. Note from (237) that d is a strictly
(18)] decreasing function in λ (unless βi = 0 for all i ∈ [n]),
hence such λ is unique. The upper bound on d in (235)
X1n (xn1 , d) = ΛY1?n (xn1 , λ?n , d). (229) guarantees that λ > 0. We first show the ≤ direction in (236).
For An1 = an1 ∈ Rn , define the conditional distribution
6) Recalling (62) and applying the reverse waterfilling
PFi |Ai (fi |ai ) as
result [67, Th. 10.3.3], we know that the coordinates
of Y1?n are independent and satisfy 2λβi2 ai βi2
 
N , . (238)
Yi? ∼ N (0, νn,i
2
), (230) 1 + 2λβi2 1 + 2λβi2
We then define the joint distribution PAn1 ,F1n as
where
n
2 2
− θn ),
Y
νn,i , max(0, σn,i (231) PAn1 ,F1n , PFi |Ai PAi . (239)
i=1
with θn > 0 given in (197).
Using (237), we can check that with such a choice of PAn1 ,F1n ,
B. Parametric Representation of the Gaussian Conditional the expected distortion between An1 and F1n equals d. The
Relative Entropy Minimization details follow.
Various aspects of the optimization problem (86) have been E[d (An1 , F1n )]
discussed in [47, Sec. II-B]. In particular, let B1?n be the =E {E[d (An1 , F1n ) |An1 ]} (240)
optimizer of RAn1 (d), then we have n
1X 
E E[(Fi − Ai )2 |Ai ]

= (241)
R(An1 , B1?n , d) = RAn1 (d), (232) n i=1
n
where RAn1 (d) is in (48). Another useful result on the optimiza- 1X βi2 αi2
tion problem (86) is the following: when the input An1 and B1n = + (242)
n i=1 1 + 2λβi2 (1 + 2λβi2 )2
are independent Gaussian random vectors, we have parametric
characterizations for the optimizer and optimal value of (86), =d, (243)

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
20

where (242) is from the relation E[(X − t)2 ] = Var[X] + satisfying (235), let λ be given by (237). Suppose that αi2 ’s
(E[X] − t)2 and (243) is due to (237). Therefore, the choice and βi2 ’s are such that both
of PF1n |An1 in (238) and (239) is feasible for the optimization
n
problem in defining R(An1 , B1n , d). Hence, 1X 1
(251)
1 n i=1 (1 + 2λβi2 )4
R(An1 , B1n , d) ≤ D(PF1n |An1 ||PB1n |PAn1 ) (244)
n
n and
1X
= E[D(PFi |Ai (·|Ai )||PBi )]. (245) n
n i=1 1 X 2βi2 (2αi2 + 1 + 2λβi2 )
(252)
n i=1 (1 + 2λβi2 )3
It is straightforward to verify that the Kullback-Leibler diver-
2
gence between two Gaussian distributions X ∼ N (µX , σX )
2
and Y ∼ N (µY , σY ) is given by are bounded by positive constants. Let Â1 , . . . , Ân be indepen-
dent random variables with
2
σX + (µX − µY )2 1 2
σX 1
D(PX ||PY ) = − log − . (246)
2
2σY 2 2
σY 2 Âi ∼ N (0, α̂i2 ). (253)
Using (246) and (238), we see that (245) equals the right-hand
Let λ̂ be such that
side of (236). To prove the other direction, we use the Donsker-
Varadhan representation of the Kullback-Leibler divergence [68, n
1 X α̂i2 + βi2 (1 + 2λ̂βi2 )
Th. 3.5]: d= . (254)
n i=1 (1 + 2λ̂βi2 )2
D(P ||Q) = sup EP [g(X)] − log EQ [exp g(X)], (247)
g Then, there is a constant C > 0 such that
where the supremum is over all functions g from the sample 2
α̂i − αi2 .

λ̂ − λ ≤ C max (255)

space to R such that both expectations in (247) are finite. Fix 1≤i≤n
any PF1n |An1 such that E[d (An1 , F1n )] ≤ d. For any An1 = an1 ,
in (247), we choose P to be PF1n |An1 (·|an1 ), Q to be PB1n and Proof. We can view (237) as an equation of the form
g to be g(f1n ) , −nλd(f1n , an1 ) for any f1n ∈ Rn , then we f (α12 , . . . , αn2 , λ) = 0. Then, by the implicit function theorem,
have we know that there exists a unique continuously differentiable
function h such that
D(PF1n |An1 (·|an1 )||PB1n ) ≥ −nλEPF n |An (·|an1 ) [d(F1n , an1 )]
1 1

− log EPBn [exp{−nλd (B1n , an1 )}]. λ = h(α12 , . . . , αn2 ), (256)


1
(248)
and
Taking expectations on both sides of (248) with respect to PAn1 ( )−1
n
and then normalizing by n, we have ∂h 1 X 2βi2 [2αi2 + βi2 (1 + 2λβi2 )] 1
= .
∂αi2 n i=1 (1 + 2λβi2 )3 n(1 + 2λβi2 )2
R(An1 , B1n , d) ≥ −λE[d (An1 , F1n )]
(257)
− EPAn log EPBn [exp{−nλd (B1n , An1 )}].
1 1
(249) Hence,
Using the formula for the moment generating function for ( n
)−1 v n
1 X 2βi2 (2αi2 + 1 + 2λβi2 )
u
noncentral χ2 distributions, we can compute u1 X 1
k∇hk2 = t .
n i=1 (1 + 2λβi2 )3 n i=1 (1 + 2λβi2 )4
2
EPBn [exp{−nλd (B1n , an1 )}]
n
1
(258)
−λa2i
 
Y 1
= exp . (250)
1 + 2λβi2 By the assumptions (251) and (252), we know that there exists
p
i=1 1 + 2λβi2
a constant C > 0 such that
Plugging (250) into (249) and using E[d (An1 , F1n )] ≤ d, we
conclude that R(An1 , B1n , d) is greater than or equal to the C
k∇hk2 ≤ √ . (259)
right-hand side of (236). n

Our next result states that for fixed βi2 ’s satisfying certain Hence, we have
mild conditions, if we change the variances from αi2 ’s to α̂i2 ’s,
λ̂ − λ ≤ k∇hk2 k(α12 , . . . , αn2 ) − (α̂12 , . . . , α̂n2 )k2 (260)

then the perturbation on the corresponding λ’s is controlled by
the perturbation on αi2 ’s. ≤ C max α̂i2 − αi2 .

(261)
1≤i≤n
Theorem 12 (Variance perturbation). Let and αi2 ’s βi2 ’s
be in (233) and (234) above, respectively. For a fixed d

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
21

C. Proof of Theorem 8 since Xi /σn,i ’s are i.i.d. standard normal random variables.
The proof is similar to [47, Th. 12], and we streamline Moreover, in view of (51), their expectations satisfy
the proof and point out the differences. We use the notations n n
1X 1X
defined in Appendix F-A above. E [m̄i (U1n )] = min(σn,i2
, θn ) = d. (267)
n i=1 n i=1
Our Corollary 1 implies that for all n large enough the
condition (89) is violated with probability at most 2e−cn for a Since Xi /σn,i ’s have bounded moments, from Berry-Esseen
constant c > log(a)/2. This is much stronger than the bound theorem, we know that there exists a constant ω > 0 such that
Θ (1/poly log n) in the stationary case [47, Th. 6]. for all n large enough
In view of (62), the random variables Xi /σn,i for i = " n
1 X
#
1, . . . , n, are distributed according to i.i.d. standard normal n
C1 C2
m̄i (U1 ) − d > ωηn ≤ + √ , (268)

P
distributions and their 2k-th moments equal to (2k − 1)!!. The n i=1 log n n
Berry-Esseen theorem implies that √ the condition (90) is violated where ηn is in (88) above and C1 , C2 are positive constants. In
with probability at most Θ (1/ n). This is the same as the the last step of the program, we control the difference between
stationary case [47, Eq. (279)–(280)]. mi (U1n ) and m̄i (U1n ). From (264)–(265), we have
We use the following procedure to show that the condi- n n
tion (91) is violated with probability at most Θ (1/ log n): 1X n 1X
n
m̄ (u
i 1 ) − mi (un1 )
• We approximate mi (u1 ) by another random variable n i=1
n i=1
m̄i (un1 ) that is easier to analyze. n
n n 1 X 2νn,i (λ̂n − λ?n )
4
• We show that (91) with mi (u1 ) replaced by m̄i (u1 ) holds = +
with probability at least 1 − Θ(1/ log n). n i=1 (1 + 2λ̂n νn,i 2 )(1 + 2λ? ν 2 )
n n,i
n n
• We then control the difference between mi (u1 ) and
1 X 2x2i νn,i 2 2
(2 + 2λ̂n νn,i 2
+ 2λ?n νn,i )(λ̂n − λ?n )
m̄i (un1 ). .
n i=1 2 )2 (1 + 2λ? ν 2 )2
(1 + 2λ̂n νn,i n n,i
To carry out the above program, we first give an expression
(269)
for mi (un1 ) by applying [47, Lem. 4] (see also the proof of
Theorem 11) on R(X̂1n , Y1?n , d). Note that X̂1n and Y1?n are For i = 1, we have νn,1 2 2

= σn,1 − θn = Θ a2n , λ̂n = Θ(1)
?
Gaussian random vectors with independent coordinates with and λn = Θ(1). This implies that the summands in (269) for
variances given by (85) and (230), respectively. Then, [47, i = 1 are both of the order O(1/n) for any x21 = O(a4n ). For
Lem. 4] implies that the optimizer PF̂ ?n |X̂ n for R(X̂1n , Y1?n , d) 2 ≤ i ≤ n, the condition (89) and the variance perturbation
1 1
satisfies result in Theorem 12 imply that every summand in (269) for
Y n i ≥ 2 is in the order of ηn . Hence, (269) is in the order of ηn .
PF̂ ?n |X̂ n =x̂n = PF̂ ? |X̂i =x̂i , (262) Finally, combining (268) and (269) implies that conditioning
1 1 1 i
i=1 on the conditions (89) and (90), we have (91) is violated with
where the conditional distributions F̂i |X̂i = x̂i are Gaussian: probability at most Θ(1/ log n).
?
!
2 2
2λ̂n νn,i x̂i νn,i
N , , (263)
1 + 2λ̂n νn,i 2 1 + 2λ̂n νn,i 2
D. Auxiliary Lemmas
2
where νn,i ’s are defined in (231) above and λ̂n is defined Lemma 7 (Lower bound on the probability of distortion balls).
in (228) above. Then, using the definition of mi (un1 ) in (87) Fix d ∈ (0, dmax ). For any n large enough and any un1 ∈
and (263) above, we obtain T (n, p) defined in Definition 3 in Section IV-F above, and γ
defined in (298) below, it holds that
2
νn,i x2i
mi (un1 ) = + , (264) K1
h   i
1+ 2
2λ̂n νn,i 2 )2
(1 + 2λ̂n νn,i P d − γ ≤ d xn1 , F̂1?n ≤ d |Xˆ1n = xn1 ≥ √ , (270)
n
where xn1 = S> un1 . The random variable mi (un1 ) in the form where K1 > 0 is a constant and F̂1?n is in Appendix F-A
of (264) is hard to analyze since we do not have a simple above.
expression for λ̂n . By replacing λ̂n with λ?n , we define another
random variable m̄i (un1 ) that turns out to be easier to analyze: The proof is in Appendix F-F.
2
νn,i x2i Lemma 8. Fix d ∈ (0, dmax ) and  ∈ (0, 1). There exists
m̄i (un1 ) , ? 2 + 2 )2 . (265) constants C and K2 > 0 such that for all n large enough,
1 + 2λn νn,i (1 + 2λ?n νn,i h   i
Plugging (227) and (231) into (265), we obtain P ΛY1?n X1n , λ̂n , d ≤ ΛY1?n (X1n , λ?n , d) + C log n
K2
!
2 2 2
min(σn,i , θn ) x i ≥1 − √ , (271)
m̄i (un1 ) = 2
2
2 − 1 + min(σn,i , θn ), n
σn,i σn,i
(266) where λ?n and λ̂n are defined in (226) and (228), respectively.
where θn is the n-th order water level in (51) and xn1 = Proof. The proof of Lemma 8 is the same as [47, Eq. (314)–
S> un1 . The random variable m̄i (U1n ) is much easier to analyze (333)] except that we strengthen the right side of [47, Eq. (322)]

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
22

to be Θ(e−cn ) for a constant c > log(a)/2 due to Corollary 1. F. Proof of Lemma 7


Proof. The proof is similar to the stationary case [47, Lem.
10]. We streamline the proof and point out the differences.
Conditioned on X̂1n = xn1 , the random variable
E. Proof of Lemma 4 n
  1 X ? 2
Using Lemmas 7 and 8 in Appendix F-D above, the proof of d xn1 , F̂1?n = F̂i − xi (282)
n i=1
Lemma 4 is almost the same as that in the stationary case [47,
Eq. (270)-(278)]. For completeness, we sketch the proof here. follows a noncentral χ2 -distribution with (at most) n degrees
We weaken the bound [22, Lem. 1] by setting PX̂ as PX̂ n and of freedom, since it is shown in [47, Eq. (282) and Lem. 4]
1
PY as PY1?n to obtain that for any xn1 ∈ Rn , that conditioned on X̂1n = xn1 , the distribution of the random
variable F̂i? − xi is given by
1
log !
PY1?n (Bd (xn1 )) −xi νi2
N , , (283)
≤ inf ΛY1?n (xn1 , λ̂n , d) + λ̂n nγ− 1 + 2λ̂n νi2 1 + 2λ̂n νi2
γ>0
where νi2 ’s are given in (231). Then, the conditional expectation
h   i
log P d − γ ≤ d xn1 , F̂1?n ≤ d|X̂1n = xn1 , (272)
is given by
n
where λ̂n in (228) depends on X1n . Let E denote the event
h   i 1X
E d xn1 , F̂1?n |X̂1n = xn1 = mi (un1 ), (284)
inside the square brackets in (81). Then, n i=1

P[E] where mi (un1 ) is defined in (87) in Section IV-Eabove. In


=P[E ∩ T (n, p)] + P[E ∩ T (n, p)c ] (273) view of (282), (284) and (91), we expect that d xn1 , F̂1?n
h concentrates around d conditioned on X̂1n = xn1 for un1 ∈
≤P ΛY1?n (X1n , λ̂n , d) ≥ ΛY1?n (X1n , λ?n , d) + p(n) − λ̂n nγ−
T (n, p). Note that the proof of Theorem 8 related to (91) is
1 i
different from the one in the stationary case, see Appendix F-C
log n + log K1 , T (n, p) + P[T (n, p)c ] (274)
h 2 i above for the details. To simplify notations, we denote the
≤P ΛY1?n (X1n , λ̂n , d) ≥ ΛY1?n (X1n , λ?n , d) + C log n + variances as
 2 
P[T (n, p)c ] (275) n ? n
Vi (x1 ) , Var F̂i − xi |X̂1 = x1 , n
(285)
1
≤ , (276)
v
u n
q(n) n
u1 X
V (x1 ) , t Vi (xn1 ). (286)
n i=1
where
• (274) is due to (272) and Lemma 7; Due to (283) and (91), we see (F̂i? − xi )2 ’s have finite second-
• From (274) to (275), we used the fact that for un1 ∈ and third- order absolute moments. That is, we have
T (n, p), λ̂n can be bounded by
V (xn1 ) = Θ(1), (287)
λ̂n − 1 ≤ B1 ,

(277) for un1
∈ T (n, p). Therefore, we can apply the Berry-Esseen

theorem. Hence,
where B1 > 0 is a constant and θ > 0 is given by (54).
h   i
P d − γ ≤ d xn1 , F̂1?n ≤ d |X̂1n = xn1
The bound (277) is obtained by the same argument as " Pn
that in the stationary case [47, Eq. (273)]; γ is chosen n(d − γ) − i=1 mi (un1 )
in (298) below; the constants ci ’s, i = 1, ...4 in (82) are =P √
nV (xn1 )
chosen as n  
1 X 2
? n
1 ≤√ F̂i − x i − m i (u1 )
c1 = B1 + , (278) nV (xn1 ) i=1
2θ Pn #
c2 = B4 , (279) nd − i=1 mi (un1 )
≤ √ |X̂1n = xn1 (288)
1 nV (xn1 )
c3 = C + , (280)
2  Pn
nd − i=1 mi (un1 )

c4 = − log K1 , (281) ≥Φ √
nV (xn1 )
Pn
n(d − γ) − i=1 mi (un1 )
 
where B4 > 0 is given in (297) below and K1 and C are 2B1
−Φ √ n − √ (289)
the same constants in Lemmas 7 and 8, respectively. nV (x1 ) n

• (276) is due to Lemma 8 and Theorem 8. nγ 0 2B1
= Φ (ξ) − √ , (290)
V (xn1 ) n

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
23

where [7] N. H. Chan and C.-Z. Wei, “Asymptotic inference for nearly
• (289) follows from the Berry-Esseen theorem; B1 > 0 is
nonstationary AR(1) processes,” The Annals of Statistics, pp.
1050–1063, Sep. 1987.
a constant, and [8] B. Bercu, F. Gamboa, and A. Rouault, “Large deviations for
Z t quadratic forms of stationary Gaussian processes,” Stochastic
1 τ2
Φ(t) , √ e− 2 dτ (291) Processes and their Applications, vol. 71, no. 1, pp. 75–90, Oct.
2π −∞ 1997.
is the cumulative distribution function of the standard [9] J. Worms, “Large and moderate deviations upper bounds for the
Gaussian autoregressive process,” Statistics & probability letters,
Gaussian distribution;
vol. 51, no. 3, pp. 235–243, Feb. 2001.
• (290) is due to the mean value theorem and [10] A. Rantzer, “Concentration bounds for single parameter adaptive
1 t2 control,” in Proceedings of 2018 IEEE Annual American Control
Φ0 (t) = √ e− 2 ; (292) Conference, Milwaukee, WI, USA, Jun. 2018, pp. 1862–1866.
2π [11] C. E. Shannon, “Coding theorems for a discrete source with a
• In (290), ξ satisfies fidelity criterion,” IRE Nat. Conv. Rec, vol. 4, no. 1, pp. 142–163,
Pn Pn Mar. 1959.
n(d − γ) − i=1 mi (un1 ) nd − i=1 mi (un1 ) [12] T. Goblick, “A coding theorem for time-discrete analog data
√ ≤ξ≤ √ .
nV (xn1 ) nV (xn1 ) sources,” IEEE Transactions on Information Theory, vol. 15,
(293) no. 3, pp. 401–407, May 1969.
[13] A. Kolmogorov, “On the Shannon theory of information trans-
By (91) and (287), we see that there is a constant B2 > 0 mission in the case of continuous signals,” IRE Transactions on
such that Information Theory, vol. 2, no. 4, pp. 102–108, Dec. 1956.
[14] T. Berger, “Information rates of Wiener processes,” IEEE
nd − ni=1 mi (un1 )
P
Transactions on Information Theory, vol. 16, no. 2, pp. 134–139,
p
√ n
≤ B2 log log n. (294)
nV (x1 ) Mar. 1970.
[15] R. M. Gray, “Information rates of autoregressive processes,”
Hence, as long as γ in (293) satisfies IEEE Transactions on Information Theory, vol. 16, no. 4, pp.
412–421, Jul. 1970.
γ ≤ O(ηn ), (295) [16] T. Hashimoto and S. Arimoto, “On the rate-distortion function
where ηn is defined in (88), there exists a constant B3 > 0 for the nonstationary Gaussian autoregressive process,” IEEE
Transactions on Information Theory, vol. 26, no. 4, pp. 478–480,
such that Jul. 1980.
p [17] R. M. Gray and T. Hashimoto, “A note on rate-distortion
|ξ| ≤ B3 log log n. (296)
functions for nonstationary Gaussian autoregressive processes,”
Let B4 > 0 be a constant such that IEEE Transactions on Information Theory, vol. 54, no. 3, pp.
1319–1322, Feb. 2008.
B32 [18] K. Marton, “Error exponent for source coding with a fidelity
B4 ≥ + 1, (297) criterion,” IEEE Transactions on Information Theory, vol. 20,
2
no. 2, pp. 197–199, Mar. 1974.
and choose γ as [19] Z. Zhang, E.-H. Yang, and V. K. Wei, “The redundancy of
(log n)B4 source coding with a fidelity criterion. I. known statistics,” IEEE
γ,, (298) Transactions on Information Theory, vol. 43, no. 1, pp. 71–91,
n Jan. 1997.
which satisfies (295). Then, plugging the [20] E.-H. Yang and Z. Zhang, “On the redundancy of lossy
bounds (287), (296), (297) and (298) into (290), we source coding with abstract alphabets,” IEEE Transactions on
Information Theory, vol. 45, no. 4, pp. 1092–1110, May 1999.
conclude that there exists a constant K1 > 0 such that (290) [21] A. Ingber and Y. Kochman, “The dispersion of lossy source
K1
is further bounded from below by √ n
. coding,” in Proceedings of 2011 IEEE Data Compression
Conference, Snowbird, UT, USA, Mar. 2011, pp. 53–62.
R EFERENCES [22] V. Kostina and S. Verdú, “Fixed-length lossy compression in the
[1] P. Tian and V. Kostina, “From parameter estimation to dispersion finite blocklength regime,” IEEE Transactions on Information
of nonstationary Gauss-Markov processes,” in Proceedings of Theory, vol. 58, no. 6, pp. 3309–3338, Jun. 2012.
2019 IEEE International Symposium on Information Theory, [23] T. Haavelmo, “The statistical implications of a system of simul-
Paris, France, Jul. 2019, pp. 2044–2048. taneous equations,” Econometrica, Journal of the Econometric
[2] H. B. Mann and A. Wald, “On the statistical treatment of linear Society, pp. 1–12, Jan. 1943.
stochastic difference equations,” Econometrica, Journal of the [24] T. Koopmans, “Serial correlation and quadratic forms in normal
Econometric Society, pp. 173–220, Jul. 1943. variables,” The Annals of Mathematical Statistics, vol. 13, no. 1,
[3] H. Rubin, “Consistency of maximum likelihood estimates in pp. 14–33, Mar. 1942.
the explosive case,” Statistical Inference in Dynamic Economic [25] J. P. Gould and C. R. Nelson, “The stochastic structure of the
Models, pp. 356–364, 1950. velocity of money,” The American Economic Review, vol. 64,
[4] J. S. White, “The limiting distribution of the serial correlation no. 3, pp. 405–418, Jun. 1974.
coefficient in the explosive case,” The Annals of Mathematical [26] D. A. Dickey and W. A. Fuller, “Distribution of the estimators
Statistics, pp. 1188–1197, Dec. 1958. for autoregressive time series with a unit root,” Journal of the
[5] T. W. Anderson, “On asymptotic distributions of estimates of American Statistical Association, vol. 74, no. 366a, pp. 427–431,
parameters of stochastic difference equations,” The Annals of Jun. 1979.
Mathematical Statistics, pp. 676–687, Sep. 1959. [27] P. Whittle, Hypothesis testing in time series analysis. Almqvist
[6] J. Rissanen and P. Caines, “The strong consistency of maximum & Wiksells boktr., 1951, vol. 4.
likelihood estimators for ARMA processes,” The Annals of [28] G. E. P. Box and G. M. Jenkins, Time series analysis: forecasting
Statistics, pp. 297–315, Mar. 1979. and control. San Francisco: Holden-Day, 1970.

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
24

[29] A. H. S. Kailath, Thomas and B. Hassibi, Linear estimation. Transactions on Automatic Control, vol. 64, no. 11, pp. 4525–
New Jersey: Prentice Hall, 2000. 4540, Nov. 2019.
[30] L. Ljung, System Identification: Theory for the User. Englewood [52] R. M. Gray, “In memory of A.H. “Steen” Gray Jr.” IEEE Signal
Cliffs, New Jersey: P T R Prentice Hall, 1987. Processing Magazine, vol. 37, no. 2, pp. 96–100, Mar. 2020.
[31] S. L. Tu, “Sample complexity bounds for the linear quadratic [53] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht,
regulator,” Ph.D. dissertation, UC Berkeley, 2019. “Learning without mixing: Towards a sharp analysis of linear
[32] B. S. Atal and S. L. Hanauer, “Speech analysis and synthesis system identification,” in Proceedings of the 31st Conference
by linear prediction of the speech wave,” The journal of the On Learning Theory, ser. Proceedings of Machine Learning
acoustical society of America, vol. 50, no. 2B, pp. 637–655, Research, S. Bubeck, V. Perchet, and P. Rigollet, Eds., vol. 75.
Apr. 1971. PMLR, 06–09 Jul. 2018, pp. 439–473.
[33] T. Berger, “Rate distortion theory for sources with abstract [54] S. Oymak and N. Ozay, “Non-asymptotic identification of LTI
alphabets and memory,” Information and Control, vol. 13, no. 3, systems from a single trajectory,” in 2019 American Control
pp. 254–273, Sep. 1968. Conference (ACC), Philadelphia, USA, Jul. 2019, pp. 5655–5661.
[34] R. M. Gray, “Rate distortion functions for finite-state finite- [55] T. Sarkar and A. Rakhlin, “Near optimal finite time identification
alphabet Markov sources,” IEEE Transactions on Information of arbitrary linear dynamical systems,” in Proceedings of
Theory, vol. 17, no. 2, pp. 127 – 134, Mar. 1971. the 36th International Conference on Machine Learning, ser.
[35] A. Wyner and J. Ziv, “Bounds on the rate-distortion function Proceedings of Machine Learning Research, K. Chaudhuri and
for stationary sources with memory,” IEEE Transactions on R. Salakhutdinov, Eds., vol. 97. Long Beach, California, USA:
Information Theory, vol. 17, no. 5, pp. 508–513, Sep. 1971. PMLR, 09–15 Jun. 2019, pp. 5610–5618.
[36] I. Kontoyiannis, “Pointwise redundancy in lossy data compres- [56] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time
sion and universal lossy data compression,” IEEE Transactions identification in unstable linear systems,” Automatica, vol. 96,
on Information Theory, vol. 46, no. 1, pp. 136–152, Jan. 2000. pp. 342–353, Oct. 2018.
[37] A. Dembo and L. Kontoyiannis, “Source coding, large deviations, [57] A. Dembo and O. Zeitouni, Large Deviations Techniques and
and approximate pattern matching,” IEEE Transactions on Applications. Berlin: Springer-Verlag, 2010.
Information Theory, vol. 48, no. 6, pp. 1590–1615, Jun. 2002. [58] B. Bercu and A. Touati, “Exponential inequalities for self-
[38] I. Kontoyiannis, “Pattern matching and lossy data compression normalized martingales with applications,” The Annals of Applied
on random fields,” IEEE Transactions on Information Theory, Probability, vol. 18, no. 5, pp. 1848–1869, 2008.
vol. 49, no. 4, pp. 1047–1051, Apr. 2003. [59] T. Berger, Rate Distortion Theory: A Mathematical Basis for
[39] I. Kontoyiannis and R. Zamir, “Mismatched codebooks and Data Compression. Englewood Cliffs, New Jersey: Prentice
the role of entropy coding in lossy data compression,” IEEE Hall, 1971.
Transactions on Information Theory, vol. 52, no. 5, pp. 1922– [60] M. J. Wainwright, High-Dimensional Statistics: A Non-
1938, May. 2006. Asymptotic Viewpoint, ser. Cambridge Series in Statistical and
[40] R. Venkataramanan and S. S. Pradhan, “Source coding with Probabilistic Mathematics. Cambridge: Cambridge University
feed-forward: rate-distortion theorems and error exponents for Press, 2019, vol. 48.
a general source,” IEEE Transactions on Information Theory, [61] R. Blahut, “Computation of channel capacity and rate-distortion
vol. 53, no. 6, pp. 2154–2179, Jun. 2007. functions,” IEEE Transactions on Information Theory, vol. 18,
[41] I. Kontoyiannis and J. Zhang, “Arbitrary source models and no. 4, pp. 460–473, Jul. 1972.
Bayesian codebooks in rate-distortion theory,” IEEE Transactions [62] W. S. Wong and R. W. Brockett, “Systems with finite com-
on information theory, vol. 48, no. 8, pp. 2276–2290, Aug. 2002. munication bandwidth constraints–II: Stabilization with limited
[42] A. Dembo and I. Kontoyiannis, “The asymptotics of waiting information feedback,” IEEE Transactions on Automatic Control,
times between stationary processes, allowing distortion,” Annals vol. 44, no. 5, pp. 1049–1053, May 1999.
of Applied Probability, pp. 413–429, May. 1999. [63] J. Baillieul, “Feedback designs for controlling device arrays with
[43] M. Madiman, M. Harrison, and I. Kontoyiannis, “Minimum communication channel bandwidth constraints,” in Proceedings
description length vs. maximum likelihood in lossy data com- of 1999 ARO Workshop on Smart Structures, Pennsylvania State
pression,” in Proceedings of 2004 IEEE International Symposium University, State College, PA, USA, Aug. 1999, pp. 48–55.
on Information Theory, Chicago, IL, USA, Jun. 2004, p. 461. [64] S. Tatikonda and S. Mitter, “Control under communication
[44] V. Kostina and S. Verdú, “Lossy joint source-channel coding in constraints,” IEEE Transactions on Automatic Control, vol. 49,
the finite blocklength regime,” IEEE Transactions on Information no. 7, pp. 1056–1068, Jul. 2004.
Theory, vol. 59, no. 5, pp. 2545–2575, May 2013. [65] R. M. Gray, “Toeplitz and Circulant Matrices: A Review,”
[45] V. Y. Tan and O. Kosut, “On the dispersions of three network Foundations and Trends R in Communications and Information
information theory problems,” IEEE Transactions on Information Theory, vol. 2, no. 3, pp. 155–239, 2006.
Theory, vol. 60, no. 2, pp. 881–903, Feb. 2014. [66] U. Grenander and G. Szegö, Toeplitz Forms and their Applica-
[46] S. Watanabe, “Second-order region for Gray–Wyner network,” tions. New York: Chelsea Publishing Company, 1984.
IEEE Transactions on Information Theory, vol. 63, no. 2, pp. [67] T. M. Cover and J. A. Thomas, Elements of Information Theory,
1006–1018, Feb. 2017. 2nd ed. Hoboken, New Jersey: John Wiley & Sons, Nov. 2012.
[47] P. Tian and V. Kostina, “The dispersion of the Gauss-Markov [68] Y. Polyanskiy and Y. Wu, Lecture notes on information theory.
source,” IEEE Transactions on Information Theory, vol. 65, [Online]. Available: http://people.lids.mit.edu/yp/homepage/data/
no. 10, pp. 6355–6384, Oct. 2019. itlectures v5.pdf
[48] L. Zhou, V. Y. Tan, and M. Motani, “Discrete lossy Gray-
Wyner revisited: Second-order asymptotics, large and moderate
deviations,” IEEE Transactions on Information Theory, vol. 63,
no. 3, pp. 1766–1791, Mar. 2016.
[49] ——, “Second-order and moderate deviations asymptotics for
successive refinement,” IEEE Transactions on Information
Theory, vol. 63, no. 5, pp. 2896–2921, Mar. 2017.
[50] S. Tatikonda, A. Sahai, and S. Mitter, “Stochastic linear control Peida Tian is a Ph.D candidate in the Department of Electrical Engineering
over a communication channel,” IEEE Transactions on Automatic at California Institute of Technology. He received a B. Engg. in Information
Control, vol. 49, no. 9, pp. 1549–1561, Sep. 2004. Engineering and a B. Sc. in Mathematics from the Chinese University of Hong
Kong (2016), and a M.S. in Electrical Engineering from Caltech (2017). He
[51] V. Kostina and B. Hassibi, “Rate-cost tradeoffs in control,” IEEE is interested in optimization and information theory.

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2021.3050342, IEEE
Transactions on Information Theory
25

Victoria Kostina (S’12–M’14) received the bachelor’s degree from the Moscow
Institute of Physics and Technology in 2004, the master’s degree from the
University of Ottawa in 2006, and the Ph.D. degree from Princeton University
in 2013. She was affiliated with the Institute for Information Transmission
Problems, Russian Academy of Sciences. In 2014, she joined Caltech, where
she is currently a Professor of electrical engineering. Her research spans
information theory, coding, control, learning, and communications. She received
the Natural Sciences and Engineering Research Council of Canada master’s
scholarship in 2009, the Princeton Electrical Engineering Best Dissertation
Award in 2013, the Simons-Berkeley Research Fellowship in 2015, and the
NSF CAREER Award in 2017.

0018-9448 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on February 11,2021 at 22:34:45 UTC from IEEE Xplore. Restrictions apply.

You might also like