Capacity, Mutual Information, and Coding For Finite-State Markov Channels

868 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO.
3, MAY 1996
Capacity, Mutual Information, and Coding

for Finite-State Markov Channels
Andrea J. Goldsmith, Member, IEEE and Pravin P. Varaiya, Fellow, IEEE
Abstruct- The Finite-State Markov Channel (FSMC) is a where h is the entropy function, qn = p ( z n = 1 1 zn-’), q,
discrete time-varying channel whose variation is determined by converges to qm in distribution, and qW is independent of the
a finite-state Markov process. These channels have memory
initial channel state.
due to the Markov channel variation. We obtain the FSMC
capacity as a function of the conditional channel state proba- In this paper we derive the capacity of a more general
bility. We also show that for i.i.d. channel inputs, this condi- finite-state Markov channel, where the channel states are not
tional probability converges weakly, and the channel’s mutual necessarily BSC’s. We model the channel as a Markov chain
information is then a closed-form continuous function of the S, which takes values in a finite state space C of memoryless
input distribution. We next consider coding for FSMC’s. In
general, the complexity of maximum-likelihood decoding grows
channels with finite input and output alphabets. The condi-
exponentially with the channel memory length. Therefore, in tional input/output probability is thus p(y, I x,,S,), where
practice, interleaving and memoryless channel codes are used. x, and y, denote the channel input and output, respectively.
This technique results in some performance loss relative to The channel transition probabilities are independent of the
the inherent capacity of channels with memory. We propose a input, so our model does not include IS1 channels. We refer to
maximum-likelihood decision-feedback decoder with complexity
that is independent of the channel memory. We calculate the the channel model as a finite-state Markov channel (FSMC).
capacity and cutoff rate of our technique, and show that it If the transmitter and receiver have perfect state information,
preserves the capacity of certain FSMC’s. We also compare then the capacity of the FSMC is just the statistical average
the performance of the decision-feedback decoder with that of over all states of the corresponding channel capacity [2]. On
interleaving and memoryless channel coding on a fading channel
with 4PSK modulation.
the other hand, with no information about the channel state
or its transition structure, capacity is reduced to that of the
Index Terms-Finite-state Markov channels, capacity, mutual Arbitrarily Varying Channel [3]. We consider the intermediate
information, decision-feedback maximum-likelihood decoding.
case, where the channel transition structure of the FSMC is
known.
I. INTRODUCTION The memory of the FSMC comes from the dependence of
HIS PAPER extends the capacity and coding results the current channel state on past inputs and outputs. As a
of Mushkin and Bar-David [l] for the Gilbert-Elliot result, the entropy in the channel output is a function of the
channel to a more general time-varying channel model. The channel state conditioned on all past outputs. Similarly, the
Gilbert-Elliot channel is a stationary two-state Markov chain, conditional output entropy given the input is determined by
where each state is a binary-symmetric channel (BSC), as in the channel state probability conditioned on all past inputs and
Fig. 1. The transition probabilities between states are g and b, outputs. We use this fact to obtain a formula for channel ca-
respectively, and the crossover probabilities for the “good” and pacity in terms of these conditional probabilities. Our formula
“bad” BSC’s are p~ and p ~ respectively, , where p~ < p ~ .can be computed recursively, which significantly reduces its
Let 2 , E {O, I}, y, E (0. l}, and xn = 5 , c€ yn denote. computation complexity. We also show that when the channel
respectively, the channel input, channel output, and channel inputs are i.i.d., these conditional state probabilities converge
error on the nth transmission. In [l], the capacity of the in distribution, and their limit distributions are continuous
Gilbert-Elliot channel is derived as functions of the input distribution. Thus for any i.i.d. input
distribution 8,the mutual information of the FSMC is a closed-
form continuous function of 0. This continuity allows us to
find li.i.d, the maximum mutual information relative to all
i.i.d. input distributions, using straightforward maximization
techniques. Since 1i.i.d < C , our result provides a simple lower
Manuscript received February 18, 1994; revised September 15, 199.5. This
work was supported in part by an IBM graduate fellowship, and in part by the
bound for the capacity of general FSMC’s.
PATH program, Institute of Transportation Studies, University of California, The Gilbert-Elliot channel has two features which facilitate
Berkeley. The material in this paper was presented in part at the IEEE its capacity analysis: its conditional entropy H ( Y ” 1 X n ) is
International Symposium on Information Theory, Trondheim, Norway, June
1994. independent of the input distribution, and it is a symmetric
A. J. Goldsmith is with the Dcpartment of Electrical Engineering, Califomia channel, so a uniform input distribution induces a uniform
Institute of Technology, Pasadena, CA 91 125 USA. output distribution. We extend these properties to a general
P. P. Varaiya is with the Department of Electrical Engineering and Computer
Science, University of California, Berkeley, CA 94720 USA. class of FSMC’s and show that for this class, 1i.i.d equals the
Publisher Item Identifier S 0018-9448(96)0293.5-5. channel capacity. This class includes channels varying between
0018-9448/96$05.00 0 1996 IEEE
Authorized licensed use limited to: National Chiao Tung Univ.. Downloaded on May 13,2020 at 18:01:10 UTC from IEEE Xplore. Restrictions apply.
GOLDSMITH AND VAKAIYA: CAPACITY, MUTUAL INFORMATION, AND CODING FOR MARKOV CHANNELS X69
1
1-PG
Fig. 1. Gilbert-Elliot channel.
any finite number of BSC’s, as well as quantized additive diversity is implemented with the code metric instead of the
white noise (AWN) channels with symmetric PSK inputs and interleaver. Thus as with interleaving and memoryless channel
time-varying noise statistics or amplitude fading. encoding, channel correlation information is ignored with these
In principle, communication over a finite-state channel is coding schemes. Maximum-likelihood sequence estimation for
possible at any rate below the channel capacity. However, fading channels without coding has been examined in [lo],
good maximum-likelihood (ML) coding strategies for channels [ l l ] . However, it is difficult lo implement coding with these
with memory are difficult to determine, and the decoder schemes due to the code delays. In our scheme, coding delays
complexity grows exponentially with memory length. Thus a do not result in state decision delays, since the decisions are
common strategy for channels with memory is to disperse the based on estimates of the coded bits. We can introduce coding
memory using an interleaver: if the span of the interleaver is in our decision-feedback scheme with a consequent increase
long, then the cascade of the interleaver, channel, and deinter- in delay and complexity, as we will discuss in Section VI.
leaver can be considered memoryless, and coding techniques The remainder of the paper is organized as follows. In
for memoryless channels may be used [4]. However, this Section I1 we define the FSMC, and obtain some properties of
cascaded channel has a lower inherent Shannon capacity than the channel based on this definition. In Section I11 we derive a
the original channel, since coding is restricted to memoryless recursive relationship for the distribution of the channel state
channel codes. conditioned on past inputs and outputs, or on past outputs
The complexity of ML decoding can be reduced signif- alone. We also show these conditional state distributions
icantly without this capacity degradation by implementing converge to limit distributions for i.i.d. channel inputs. In
a decision-feedback decoder, which consists of a recursive Section IV we obtain the capacity of the FSMC in terms of
estimator for the channel state distribution conditioned on past the condition state distributions, and obtain a simple formula
inputs and outputs, followed by an ML decoder. We will see for I , , d . Uniformly symmeiric variable-noise FSMC’s are
that the estimate T , = p ( S n 1 znp1, , X ~ . ~ ~ - ~ , . . is J ~ ) in Section V. For this channel class (which includes the
. , ?defined
a sufficient statistic for the ML decoder input, given all past Gilbert-Elliot channel), capacity is achieved with uniform i.i.d.
inputs and outputs. Thus the ML decoder operates on a memo- channel inputs. In Section VI we present the decision-feedback
ryless system. The only additional complexity of this approach decoder, and obtain the capacity and cutoff rate penalties of the
over the conventional method of interleaving and memoryless decision-feedback decoding scheme. These penalties vanish
channel encoding is the recursive calculation of 7rn. We will for uniformly symmetric variable-noise channels. Numerical
calculate the capacity penalty of the decision-feedback decoder results for the capacity and cutoff rate of a two-state variable-
for general FSMC’s (ignoring error propagation), and show noise channel with 4PSK modulation and decision-feedback
that this penalty vanishes for a certain class of FSMC’s. decoding are presented in Section VII.
The most common example of an FSMC is a correlated
fading channel. In [ 5 ] ,an FSMC model for Rayleigh fading 11. CHANNELMODEL
is proposed, where the channel state varies over binary-
symmetric channels with different crossover probabilities. Our Let S, be the state at time n of an irreducible, aperiodic, sta-
recursive capacity formula is a generalization of the capacity tionary Markov chain with state space C = { cl . . C K } . S, is
s
found in [ 5 ] , and we also prove the convergence of their positive recurrent and ergodic. The state space C corresponds
to K different discrete memoryless channels (DMC’s), with
recursive algorithm. Since capacity is generally unachievable
for any practical coding scheme, the channel cutoff rate common finite input and output alphabets denoted by X and
JJ, respectively. Let P be the matrix of transition probabilities
indicates the practical achievable information rate of a channel
for S . so
with coding. The cutoff rate for correlated fading channels
with MPSK inputs, assuming channel state information at the
PA,,,,= p ( S n + ~= cm j Sn = CA,) (2)
receiver, was obtained in [6]: we obtain the same cutoff rate
on this channel using decision-feedback decoding. independent of n by stationarity. We denote the input and
Most coding techniques for fading channels rely on built-in output of the FSMC at time r i by z,, and y, respectively,
time diversity in the code to mitigate the fading effect. Code and we assume that the channel inputs are independent of its
designs of this type can be found in [7]-[9] and the references states. We will use the notation
therein. These codes use the same time-diversity idea as
n
interleaving and memoryless channel encoding, except that the T n = ( T I , . . .T n )
870 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 3, MAY 1996
111. CONDITIONAL
STATEDISTRIBUTION
The conditional channel state distribution is the key to
determining the capacity of the FSMC through a recursive
algorithm. It is also a sufficient statistic for the input given all
past inputs and outputs, thus allowing for the reduced complex-
ity of the decision-feedback decoder. In this section we show
that the state distribution conditioned on past input/output pairs
can be calculated using a recursive formula. A similar formula
is derived for the state distribution conditioned on past outputs
alone, under the assumption of independent channel inputs. We
also show that these state distributions converge weakly under
i i d . inputs, and the resulting limit distributions are continuous
functions of the input distribution.
We denote these conditional state distributions by the K -
dimensional random vectors T~ = ( T ~l ) ( ,. . . ,.irn(K))and
,on = (~~(1): p n ( K ) ) respectively,

' ' ~ , where
Fig. 2. Finite-state Markov channel.
P7L(k) = P(Sn = C k I Yn-l) (9)
and and
The following recursive formula for T~ is derived in Appendix

for T = x,y, or S. I:
The FSMC is defined by its conditional inputJoutput prob-
ability at time a, which is determined by the channel state at
time ri
where D(x,, y,) is a diagonal K x K matrix with kth diagonal
term pk(y, I IC,), and 1 = (1,.. . , l)Tis a K-dimensional
vector. Equation (1 1) defines a recursive relation for 7rn which
takes values on the state space
where pk(y I x) = p(y I x , S = c k ) , and I[.]denotes the
indicator function ( I [ & = c k ] = 1 if s, = c k and 0
otherwise). The memory of the FSMC is due to the Markov
structure of the state transitions, which leads to a dependence The initial value for 7rn is
of S,, on previous values. The FSMC is memoryless if and
To = @(So= Cl),.",P(SO = C K ) )
only if PkvL = PJw,for all k , j , and m. The finite-state Markov
channel is illustrated in Fig. 2. and its transition probabilities are
By assumption, the state at time a +
1 is independent of
previous inputloutput pairs when conditioned on S, P(.irn+l=Q. I Tn=P) = l [ ( x n , y n ) :f ( x n , y n , P ) = a I
X,EX
YnEY
P(S,+l 1 Sn,Zn,Yn)
= P(Sn+1 I Sn). (4) .P(Yn ITn=P,xn)P(Zn). (12)
Since the channels in C are memoryless Note that (12) is independent of n for stationary inputs.
For independent inputs, there is a similar recursive formula
P(Yn+l I Sn+l,:L,+l, sn,xn,Y") = P(Yn+l I &+l, xn+1). for Pn
(5)
If we also assume that the x,'s are independent, then
where B(y,) is a diagonal K x K matrix with kth diagonal

term p(yn 1 S, = Q).' The derivation of (13) is similar to that
of (1 1) in Appendix I, using (8) instead of (5) and removing all
IC terms. The variable ,on also takes values on the state space
A, with initial value PO = TO and transition probabilities
P(Pn+l = 01 I P n = P ) = f ^ ( Y n , P )= 4
YnEY
.P(Yn I P n = PI. (14)

'Note that B ( y n ) has an implicit dependence on the distribution of x n .
~
GOLDSMITH AND VARAIYA: CAPACITY, MUTUAL INFORMATION, AND CODING FOR MARKOV CHANNELS 87 1
As for 7rn, the transition probabilities in (14) are independent and

of n when the inputs are stationary. n
We show in Appendix I1 that for i.i.d. inputs, 7rn and pn are H(Y" I X " ) = CH(E:1 X i , Y t - l , x i - l ) . (20)
Markov chains that converge in distribution to limits which 2=1
are independent of the initial channel state, under some mild
constraints on C. These convergence results imply that for any The following lemma, proved in Appendix 111, allows the
bounded continuous function f , the following limits exist and mutual information to be written in terms of 7rn and pn.
are equal for all i: Lemma 4.1:
lirn ~ [ f ( 7 r ~ =
) ] lim E [ ~ ( T ; ) ]
n-oo n-m
and
where
and
r K 1
and
P: = p ( S n I F1,
SO = ea).
This convergence allows us to obtain a closed-form solution Using this lemma in (19) and (20) and substituting into
for the mutual information under i.i.d. inputs. We also show (IS) yields the following theorem.
in Lemmas A2.3 and A2.5 of Appendix I1 that the limit Theorem 4.1: The capacity of the FSMC is given by
distributions for 7r and p are continuous functions of the input
distribution. 1
C = lim max -
Lemmas A2.6 and A2.7 of Appendix I1 show the surprising "-00 P(X-) n
result that 7rn and pn are not necessarily Markov chains when n r r K
the input distribution is Markov. Since the weak convergence
of 7rn and prl requires this Markov property, (15) and (16) are i=l L L k=l J
not valid for general Markov inputs.
Iv. ENTROPY,MUTUALINFORMATION, AND CAPACITY

We now derive the capacity of the FSMC based on the where the dependence on Q E P ( X n ) of the distributions for
distributions of 7rn and pn. We also obtain some additional 7 r a , pa, and ya is implicit. This capacity expression is easier
properties of the entropy and mutual information when the to calculate than Gallager's formula (17), since the 7rn terms
channel inputs are i.i.d. can be computed recursively. The recursive calculation for pt
By definition, the Markov chain Sn is aperiodic and irre- requires independent inputs. Ilowever, for many channels of
ducible over a finite state space, so the effect of its initial state interest H ( Y , 1 p a ) will be a constant independent of the input
dies away exponentially with time [12]. Thus the FSMC is an distribution (such channels are discussed in Section V). For
indecomposable channel. The capacity of an indecomposable these channels, the capacity calculation reduces to minimizing
channel is independent of its initial state, and is given by [13, the second term in (23) relative to the input distribution, and
Theorem 4.6.41 the complexity of this minimization is greatly reduced when
7rt can be calculated easily.
1
C = lim max - I ( X n ; Y n ) (17) Using Lemma 4.1, we can also express the capacity as
n-w~(xn) rb
where I ( . ;.) denotes mutual information and P ( X n ) denotes

the set of all input distributions on X". The mutual informa-
tion can be written as
Although [ 13, Theorem 4.6.41 guarantees the convergence
I ( X " ; Y")= H ( Y " ) - H ( Y n 1 X") (18) of (24), the random vectors 7rn and pn do not necessarily
converge in distribution for general input distributions. We
where H ( Y ) = E [-logp(y)] and H ( Y I X ) = E [-logp(y I proved this convergence in Section 111for i.i.d. inputs. We now
x)]. It is easily shown [14] that derive some additional properties of the entropy and mutual
information under this input restriction. These properties are
summarized in Lemmas 4 . 2 4 . 7 below, which are proved in
Appendix IV.
SI2 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 3, MAY 1996
Lemma 4.2: When the channel inputs are stationary Proof: From (18)
I ( Y " ;X " ) = H ( Y " ) - H ( Y " X " ) .
If we fix 8 E P ( X )
n
H(Y" 1 X") = H(yz I x,,
Y2',X2-') (31)
Lemma 4.3: For i.i.d. input distributions, the following z=1
limits exist and are equal:
by (20), and the terms of the summation are nonnegative and
lim H(Y, I X,,X~--~,Y~-') monotonically decreasing in i by Lemma 4.2. Thus
n i m
= lim H(Yn 1 Xn,XR-',Y"-l,So).

(26)
nim
We now consider the entropy in the output alone.

Lemma 4.4 For stationary inputs,
= lim H ( Y ,
n-w
1 Xn,XnP1,Yn-'). (32)
Similarly, from (19)

n
H(Y") = C H ( Y ,1 yi-1) (33)
Lemma 4.5: For i.i.d. input distributions, the following i=I
limits exist and are equal:
and by Lemma 4.4, the terms of this summation are nonneg-
lim H(Yn 1 Y n - l )= lim H ( Y , I Y " - ' ; S O ) . (28) ative and monotonically decreasing in i . Hence
n i m n-oc
The next lemma is proved using the convergence results for nn

and pn and a change of variables in the entropy expressions
(26) and (28).
Lemma 4.6: For any i.i.d. input distribution 8 E P ( X ) Applying Lemmas 4.1 and 4.6 completes the proof. 0
It is easily shown that since Y' and p B are continuous
functions of 8, IQ is also. Moreover, the calculation of IQ
is relatively simple, since asymptotic values of p and Y
are obtained using the recursive formulas (12) and (14),
respectively. For the channel described in Section VII, these
recursive formulas closely approach their final values after
only 40 iterations. Unfortunately, this simplified formula for
XEX mutual information under i.i.d. inputs cannot be extended to
(29) Markov inputs, since 7 r , and p n are no longer Markov chains
under these conditions.
where the Q superscript on p,, 7rn, and p ( g 1 p ) shows their We now consider the average mutual information maxi-
dependence on the input distribution, U' denotes the limiting mized over all i.i.d. input distributions. Define
distribution of p i , and pB denotes the limiting distribution of
B
TTL. (35)
We now combine the above lemmas to get a closed form
expression for the mutual information under i i d . inputs.
Theorem 4.2: For any i.i.d. input distribution 8 E P ( X ) , Since P ( X ) is compact and IOcontinuous in 8, Ii.i.d.achieves
the average mutual information per channel use is given by its supremum on P ( X ) , and the maximization can be done
using standard techniques for continuous functions. Moreover,
A . 1 it is easily shown that 1i.i.d 5 c.Thus (35) provides a
IQ= lim - I @ ( Y " ; X " )
n-m 'n. relatively simple formula to lower-bound the capacity of
general FSMC's.
J A
The next section will describe a class of channels for which
Y€Y
uniform i.i.d. channel inputs achieve channel capacity. Thus
1i.i.d.= C , and the capacity can be found using the formula of
JA
YCY
Theorem 4.2. This channel class includes fading or variable-
XEX noise channels with symmetric PSK inputs, as well as channels
which vary over a finite set of BSC's.
GOLDSMlTH AND VARAIYA: CAPACITY, MUTUAL INFORMATION, AND CODING FOR MARKOV CHANNELS x73
V. UNIFORMLY
SYKMETRIC
VARIABLE-NOISE
CHANNELS Lemma 5.2: For uniformly symmetric variable-noise
In this section we c' h e two classes of FSMC's: uniformly FSMC's and all i , H(Yn I Xn,7r,) and H ( Y , 1 X,,n;)
symmetric channels i I variable-noise channels. The mutual do not depend on the input distribution.
information and capa ty of these channel classes have ad- Consider an FSMC where each c k t C is an AWN channel
ditional properties wl ch we outline in the lemmas below. with noise density n k . If we let 2 = Y X , then it is
~
Moreover, we will show in the next section that the decision- easily shown that this is a variable-noise channel. However,
feedback decoder achieves capacity for uniformly symmetric such channels have an infinite output alphabet. In general, the
variable-noise FSMC's. output of an AWN channel is quantized to the nearest symbol
Dejnition: For a DMC, let M denote the matrix of in- in a finite output alphabet: we call this the quantized AWN
putloutput probabilities (Q-AWN) channel.
If the Q-AWN channel has a symmetric multiphase input
n alphabet of constant amplitude and output phase quantization
Mij = p(y = .j I 2 =i)j j E y. 1; E x. [4, p. 801, then it is easily checked that pk:(y I x) depends only
A discrete memoryless channel is output-symmetric if the rows on pk( Iy-xl), which in turn depends only on the noise density
of M are permutations of each other, and the columns of M n k . Thus it is a variable-noise channeL3 We show in Appendix
are permutations of each other.2 VI that variable-noise Q-AWN channels with the same input
Dejinition: A FSMC is uniformly symmetric if every chan- and output alphabets are also uniformly symmetric. Uniformly
nel c k E C iy output-symmetric. symmetric variable-noise channels have the property that 1i.i.d.
The next lemma, proved in Appendix V, shows that for equals the channel capacity, as we show in the following
uniformly symmetric FSMC's, the conditional output entropy theorem.
is maximized with uniform i.i.d. inputs. Theorem 5.I : Capacity of uniformly symmetric variable-
Lemma 5.1: For uniformly symmetric FSMC's and any noise channels is achieved with an input distribution that is
initial state So = c,, H ( Y , I p n ) . H ( Y , I &), H ( Y , I 7 r n ) , uniform and i.i.d. The capacity is given by
and H ( Y , I T A ) are all maximized for a uniform and i.i.d. r
input distribution, and these maximum values equal log IYI.
Definition: Let X , and Y, denote the input and output,
respectively, of an FSMC. We say that an FSMC is a variable-
noise channel if there exists a function such that for 2, =
4 ( X n , Y , ) , p ( Z n I X n ) = p ( Z n ) , and 2" is a sufficient
statistic for S" (so S" is independent of X " and Y" given
Z n ) . Typically, d, is associated with an additive noise channel, where p is the limiting distribution for T , under uniform i.i.d.
as we discuss in more detail below. inputs. Moreover, C = limn-+ooC, = limn+oo (7; for all i ,
If 2" is a sufficient statistic for S",then where
n
7r" = p ( S , I xn-1,yn-l)
=p(S, I x-1, y71-1, 2"-1 ) =p(sn
I zrL--l).
(36)
increases with n, and
Using (36) and replacing the pairs (X,.Y,) with Z,, in
the derivation of Appendix I, we can simplify the recursive
calculation of 7r,
decreases with n.
Proof From Lemmas 5.1 and 5.2, C,; Ci, and C are
all maximized with uniform i.i.d. inputs. With this input
distribution
where D ( z n ) is a diagonal K x K matrix with kth diagonal
term p ( z n I S, = c k ) . The transition probabilities are also Crl = log IYI - H(Y" I X,,n,)
simplified
and
c; = log /YI - H(Y" I Xn,7rA).
2,€2 Applying Lemmas 4.2 and 4.3, we get that H ( Y , I X7,,7rn)
The next lemma, proved in Appendix V, shows that for

decreases with n, H ( Y , I X,, TI)
increases with n, and both
a uniformly symmetric variable-noise channel, the output 31f the input alphabet of a Q-AWN channel is not symmetric or the input
symbols have different amplitudes, then the distribution of Z = '1 -
entropy conditioned on the input is independent of the input will depend on the input. To see this, consider a Q-AWN channel with a 16-
distribution. QAM inputhtput alphabet (so the output is quantized to the nearest input
symbol). There are four different sets of Z = 1' - A-/ values, depending
2Symmetric channels, defined in [13, p. 941, are a more general class of on the amplitude of the input symbol. Thus the distribution of Z over all
memoryless channel?; an output-symmetric channel is a symmetric channel its possible values (the union of all four sets) will change, depending on the
with a single output partition. amplitude of the input symbol.
~
n n
INTER-
ENCODER LEAVER FSMC DECODER
Fig. 3. System model.
STATE ESTIMATOR MAXIMUM-LIKELIHOOD DECODER

........................................................................ .................................................................
n
' j / - I :
................................................... :
Fig. 4. Decision-feedback decoder.
converge to the same limit. Finally, under uniform i i d . inputs channel state dies away, the received symbols within any
row of the deinterleaver become independent as J becomes
infinite. However, the symbols within any column of the
deinterleaver are received from consecutive channel uses,
and are thus dependent. This dependence is called the latent
channel memory, and the state estimator enables the ML
by Lemma 4.1 and (32). Applying Lemma 4.6 to decoder to make use of this memory.
Specifically, the state estimator uses the recursive relation-
lim H ( Y n 1 X,,nn)
12-03 ship of (1 1) to estimate nn. It will be shown below that the ML
decoder operates on a memoryless system, and can therefore
completes the proof. 0
determine the ML input sequence on a per-symbol basis. The
The BSC is equivalent to a binary-input Q-AWN channel
input to the ML decoder is the channel output y, and the
with binary quantization [4]. Thus an FSMC where C k indexes
state estimate e,, and its output is the z, which maximizes
a set of BSC's with different crossover probabilities is a
logp(y,. ?, 1 z n ) , assuming equally likely input ~ y m b o l s . ~
uniformly symmetric variable-noise channel. Therefore, both
The soft-decision decoder uses conventional techniques (e.g.,
[l, Proposition 41 and the capacity formula obtained in [ 5 ] are
Viterbi decoding) with branch metrics
corollaries of Theorem 5.1.
(43)
VI. DECISION-FEEDBACK
DECODER
A block diagram for a system with decision-feedback de- We now evaluate the information, capacity, and cutoff rates
coding is depicted in Fig. 3. The system is composed of a of a system using the decision-feedback decoder, assuming
conventional (block or convolutional) encoder for memory- Cn = nn (i.e., ignoring error propagation). We will use the
less channels, block interleaver, FSMC, decision-feedback de- A
notation y31 = yn to explicitly denote that yn is in the j t h
coder, and deinterleaver. Fig. 4 outlines the decision-feedback n
row and Ith column of the deinterleaver. Similarly, njl = T ,
decoder design, which consists of a channel state estimator A
followed by an ML decoder. We will show in this section and zjl = 2 , denote, respectively, the state estimate and
that if we ignore error propagation, a system employing this interleaver input corresponding to yjl. Assume now that the
decision-feedback decoding scheme on uniformly symmetric state estimator is reset every J iterations so, for each 1, the
variable-noise channels is information-lossless:it has the same state estimator goes through j recursions of (1 1) to calculate
capacity as the original FSMC, given by (30) for i.i.d. uniform ~ ~ Byl (12),
. this recursion induces a distribution p(njl) on
inputs. Moreover, we will see that the output of the state n j that
~ depends only on p ( X j - ' ) . Thus the system up to the
estimator is a sufficient statistic for the current output given all output of the state estimator is equivalent to a set of parallel
past inputs and outputs, which reduces the system of Fig. 3 to z-output channels, where the n-output channel is defined, for
a discrete memoryless channel. Thus the ML input sequence a given j , by the input zjl, the output pair (yjl, njl), and the
is determined on a symbol-by-symbol basis, eliminating the inputloutput probability
complexity and delay of sequence decoders.
The interleaver works as follows. The output of the encoder P(Yjl,Tjl I X j l ) = C P k ( Y j l I xjz)Tjl(k)P(njl). (44)
k
is stored row by row in a J x L interleaver, and transmitted
over the channel column by column. The deinterleaver per- 41f the xn are not equally likely, then logp(s,) must be added to the
forms the reverse operation. Because the effect of the initial decoder metric.
GOLDSMITH AND VARAIYA: CAPACITY, MUTUAL INFORMATION, AND CODING FOR MARKOV CHANNELS 87.5
For each j , the r-output channel is the same for I = where

1,2, . . . L , and therefore there are J different r-output
channels, each used L times. We thus drop the I subscript Rj =
of z j l , yjl, and r j l in the decoder block diagram of Fig. 4.
The first 7r-output channel ( j = 1) is equivalent to the FSMC
with interleaving and memoryless channel encoding, since the
estimator is reset and therefore TI^ = T O ; 1 5 1 5 L.
The j t h r-output channel is discrete, since xji and yjl are
taken from finite alphabets, and since 7 r j i can have at most The cutoff rate of the decision-feedback decoding system is
l X l j lylj different values. It is also asymptotically memoryless
with deep interleaving (large <I),which we prove in Appendix n 1
VII. Finally, we show in Appendix VI11 that for a fixed input R~~= lirri rnax - ~ j . (52)
J-cQP(XJ) J j=l
distribution, the .I 7r-output channels are independent, and the
average mutual information of the parallel channels is We show in Appendix IX that for uniformly symmetric
variable-noise channels, the maximizing input distribution
in (52) is uniform and i.i.d., the resulting value of Ri is
increasing in j , and the cutoff rate R d f becomes
/
Let
- .I
where where p is the invariant distribution for ?r under i.i.d. uniform

inputs.
a
c, = H ( Y , I X,) - H ( Y , I X , . d (47) Our calculations throughout this section have ignored the
impact of error propagation. Referring to Fig. 4, error propa-
for the maximizing distribution p ( X J ) . The capacity of the gation occurs when the decision-feedback decoder output for
decision-feedback decoding system is then the maximum-likelihood inpiit symbol 2, is in error, which
will then cause the estimate of +,
to be in error. Since x,
is the value of the coded symbol, the error probability for
8, does not benefit from any coding gain. Unfortunately,
Comparing (48) to (24), we see that the capacity penalty of since block or convolutional decoding introduces delay, the
the decision-feedback decoder is given by post-decoding decisions cannot be fed back to the decision-
feedback decoder to update the +,
value. This is exactly the
difficulty faced by an adaptive decision-feedback equalizer
(DFE), where decoding decisions are used to update the DFE
tap coefficients [16]. New methods to combine DFE’s and
coding have recently been proposed, and reveral of these
methods can be used to obtain some coding gain in the estimate
of x J fed back through our decision-feedback decoder. In
For uniformly symmetric variable-noise channels, uniform particular, the structure of our decision-feedback decoder
i.i.d. inputs achieve both C and Cdf, and with this input already includes the interleaveddeinterleaver pair proposed by
C - Cdf = 0. Thus the decision-feedback decoder preserves Eyuboglu for DFE’s with coding 1171. In his method, this
the inherent capacity of such channels. pair introduced a periodic delay in the received bits such that
Although capacity gives the maximum data rate for any delayed reliable decisions can be used for feedback. Applying
ML encoding scheme, established coding techniques generally this idea to our system effectively combines the decision-
operate at or below the channel cutoff rate [41. Since the T- feedback decoder, deinterleaver, and decoder. Specifically, the
symbols transmitted over each r-output channel are decoded
output channels are independent for a fixed input distribution
p ( X ” ) , the random coding exponent for the parallel set is together, and the symbol decisions output from the decoder are
then used by the decision-feedback decoder to update the
.I values of the subsequent 7r-output channel. The complexity and
) >:R3
E o ( l , p ( X J )= (50) delay of this design increases linearly with the block length
7=l of the T-output channel code, but it is independent of the
~
"G "8
I
I I
X
*
Y t
1-b 1-9
TWO-STATE CHANNEL
Fig. 5. Two-state fading channel.
channel memory since this memory is captured in the sufficient p(acn(n)[G]<a+da)

statistic rr,. Another approach to implement coding gain uses 1 o r
soft decisions on the received symbols to update 7rn, then
:J
later corrects this initial rr, estimate if the decoded symbols - n=15
- n=l5
differ from their initial estimates [ 181. This method truncates 08 _ _ n=10
. _ n=5
_
the number of symbols affected by an incorrect decision, at n=O
a cost of increased complexity to recalculate and update the
0.6
rrn values. Finally, decision-feedback decoding can be done
in parallel, where each parallel path corresponds to a different g=b=.l
estimate of the received symbol. The number of parallel paths da= .01
0.4
will grow exponentially in this case, however we may be able
to apply some of the methods outlined in [I91 and [20] to
reduce the number of paths sustained through the trellis.
VII. TWO-STATE VARIABLE-NOISECHANNEL

We now compute the capacity and cutoff rates of a two- 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
state Q-AWN channel with variable SNR, Gaussian noise, and a
4PSK modulation. The variable SNR can represent different Fig. 6 . Recursive distribution of T,.
fading levels in a multipath channel, or different noise andor
interference levels. The model is shown in Fig. 5. The input
to the channel is a 4PSK symbol, to which noise of variance
n G or nB is added, depending on whether the channel is
in state G (good) or B (bad). We assume that the SNR
is 10 dB for channel G, and -5 dB for channel B.The
channel output is quantized to the nearest input symbol and,
since this is a uniformly symmetric variable-noise channel,
1.1 t
the capacity and cutoff rates are achieved with uniform i.i.d.
inputs. The state transition probabilities are depicted in Fig. 5.
We assume a stationary initial distribution of the state process,
+
so SO = G) = g / ( g b ) and p(S0 = B) = b / ( g b ) . + , , , , . . . . . . . . . . . ... . ............ .. ....................
Fig. 6 shows the iterative calculation of (12) for p(rr, (G) =
- C(i) for g=b=.Ol
m), where
_ _ _ C(j) for g=b=.l
- _ R(i) for g=b=.Ol
...... R(j) for g=b=.l
In this example, the difference of subsequent distributions after 0 5 10 15
40 recursions is below the quantization level (dol = 0.01) j

of the graph. Fig. 7 shows the capacity (C,) and cutoff rate Fig. 7. Capacity and cutoff rate for j t h 7i-output channel.
(R,) of the .7th rr-output channel, given by (47) and (52),
respectively. Note that C,=, and R,=l in this figure are the
For this two-state model, the channel memory can be
capacity and cutoff rate of the FSMC with interleaving and n
memoryless channel encoding. Thus the difference between quantified by the parameter p = I-g-b, since for a E { G, B }
the initial and final values of C, and R, indicate the per- c11
formance improvement of the decision-feedback decoder over
conventional techniques. p(Sn = a I s,= a ) - p ( S , =a I so # a ) = p,'. (54)
~
..;
GOLDSMITH AND VARAIYA: CAPACITY, MUTUAL INFORMATION, AND CODING FOR MARKOV CHANNELS 877
and
We also showed that with i.i.d. inputs, these conditional proba-

bilities converge weakly, and the channel's mutual information
under this input constraint is then a closed-form continuous
function of the input distribution. This continuity allows I , , d ,
the maximum mutual information of the FSMC over all i.i.d.
I inputs, to be found using standard maximization techniques.
1I .o
'll I
I Additional properties of the entropy and capacity for uniformly
I
symmetric variable-noise channels were also derived.
We then proposed an ML decision-feedback decoder, which
calculates recursive estimates of T , from the channel output
and the decision-feedback decoder output. We showed that
0.8
I 1 I I I I I 1 I for asymptotically deep interleaving, a system employing
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 the decision-feedback decoder is equivalent to a discrete
P memoryless channel with input .rn and output (y, rn).Thus
Fig. 8. Decoder performance versus channel memory. the ML sequence decoding can be done on a symbol-by-
symbol basis. Moreover, the decision-feedback decoder pre-
(I) 2.0 serves the inherent capacity of uniformly symmetric variable-
m
- noise channels, assuming the effect of error propagation is
(I) 1.8
negligible. This class of FSMC's includes fading or variable-
C
.
2
c
5
0
m
1.6
1.4
noise channels with symmetric PSK inputs as well as channels
which vary over a finite set of BSC's. For general FSMC's,
we obtained the capacity and cutoff rate penalties of the
1.2
., / . - ' - ---- decision-feedback decoding scheme.
_ _ - -- - _ - - - -
We also presented numerical results for the performance of
1 .o
-- .. ....,...
the decision-feedback decoder on a two-state variable-noise
..... . , ,
channel with 4PSK modulation. These results demonstrate
0.8
..... . . . . . ..,.. .. . ' '
,.... ,..... significant improvement over conventional schemes which

0.6 capacity (b=.l)
~
_ - cutoff rate (b=.l) use interleaving and memoryless channel encoding, and the
0.4 . _ capacity
_ (b=.9) improvement is most pronounced on quasistatic channels.
cutoff rate (b=.9) This result is intuitive, since the longer the FSMC stays
0.2
in a given state, the more accurately the state estimator
0.9 I I I I I I I I I 1 will predict that state. Finally, we present results for the
) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
decoder performance relative to the average fade duration;
9
as expected, the performance improves as the average fade
Fig. 9. Decoder performance versu5 g duration decreases.
In Fig. 8 we show the decision-feedback decoder's capacity APPENDIX 1

and cutoff rates (Cdf and R d f , respectively) as functions In this Appendix, we derive the recursive formula (1 1) for
of p. We expect these performance measures to increase T,. First, we have (55) at the top of the following page, where
as p increases, since more latency in the channel should a,b, and d follow from Bayes rule, and c follows from ( 5 ) .
improve the accuracy of the state estimator; Fig. 8 confirms Moreover
this hypothesis. Finally, in Fig. 9 we show the decision-
feedback decoder's capacity and cutoff rates as functions of p(?, y") = P ( P ,Y", s,, = C k )
g. The parameter g is inversely proportional to the average kEK
number of consecutive B channel states (which corresponds to = p(z,,y, 1 s, = C k , Z n - - l , y n - - l )
a 15 dB fade), thus Fig. 9 can be interpreted as the relationship k€K
between the maximum transmission rate and the average fade . p(S, = C L , zn--l,p )
duration.
= CP(:Y,I sn = Ck,G4n-1,Yn-1)
LEK
VIII. SUMMARY
.p(z, 1 S,,zn--l,yn--l)p(S, = C k , X n - - l , y n - - l )
We have derived the Shannon capacity of an FSMC as a
function of the conditional probabilities
= P ( Y n I s7l =: C k , Z n ) P ( G l I zn-7
kEK
P n ( k ) = P(Sn = Ck IP) .p(Sn= C L 1 2n-1,y"-1)p(~"-1,yn-1) (56)
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 3, MAY 1996
~~
where we again use Bayes rule and the last equality follows To obtain the weak convergence of rn and ,on, we also
from (5). Substituting (56) in the denominator of ( 5 9 , and assume that the channel inputs are i.i.d., since we can then ap-
canceling the common terms p ( z , I IC,-')
and p(xn-', 1~"~') ply convergence results for partially observed Markov chains
yields n
[21]. Consider the new stochastic process U, = (S,,y,, z,)
defined on the state space U = C x Y x X. Since S, is
P(Sn I xn,Yn)
stationary and ergodic and z, is i.i.d., U, is stationary and
-
P(Yn I Sn,zn)p(Sn I xn-l,yn-l) ergodic. It is easily checked that U, is Markov.
P(Yn I s n = Ck,zn)P(sn = Ck 1 xn-'.yn-') Let (S. y. z ) j denote the jth element of U,and J = IUI. To
n
ktK
(57) specify its individual components, we use the notation
n
which, for a particular value of S,, becomes ) )( S , Y > Z ) j .
( s ( j ) > Y ( j ) > q=
P(S, = Cl I x n , Y n ) The J x J probability transition matrix for U , P u , is
-
- P(Yn I sn = CL,z:,)P(Sn = c1 I zn-l,yn-l)
c P(Yn I s, = Ck,x,)P(Sn = C k I xn-1%Yn-l) = P[(Sn+l,Yn+l,GL+l)= (S,Y,.)j I (Sn,Yn,Zn)
k€K = (S,Y,.)k)l (61)
(58)
independent of n. The initial distribution of U , T:, is given by
Finally, from (4)
P ( S 0 = C k ; Yo = Y,5 0 = = TO(k)Pk(YO I zo)p(zo). (62)
P(S,+I -- Cl I z n ,Y") = P(S, = cg I z n ,Y n ) P 3 1 . (59)
JEK Let gy.z: U i Y x X and gy : U i Y be the projections
Substituting this into (58) yields the desired result = (Yn,GL)
gy,z(Sn,Yn,~n)
and
APPENDIXI1
gy(Sn,Yn,Zn)= (Yn).
In this Appendix we show that for i.i.d. inputs, T , and
,on are Markov chains that converge in distribution to a limit These projections form the new processes W, = g,,x[U,] and
which is independent of the initial channel state, and that the V, = gy[Un]. We regard W, and V, as partial observations
resulting limit distributions are continuous functions of the of the Markov chain U,; the pairs (Un,W,) and (U,, Vn)
input distribution ~ ( I c )We
. also show that the Markov property are referred to as partially observed Markov chains. The
does not hold for Markov inputs. distribution of U, conditioned on W, and V,, respectively, is
We begin by showing the Markov property for independent
7r: = (T,u(1),".,7r,u(J))
inputs.
Lemma A2.1: For independent inputs, 7rn is a Markov and
chain. U-
P, = (,o:(l), . . . ,P , u ( J ) )
Proof:
where
I
P(Tn+l T n , . . . , T O ) = P(T,+l I T n , . . . , T O ,Z, Yn)
x n ,Yn T3.i) = P(U, = ( S , Y , Z ) j I Wn) (63)
.P(Z,, Yn IT,,.. '1 710) and
= P(T,+~ I Tn, , IT,)
~ n ) p ( x n~n = P(U, = ( S , Y , Z ) j I V"). (64)
XnrYn
Note that
= P(Tn+l Tn) I (60)
where the second equality follows from (1 1) and (6). Thus 7r, .,u(j) = P(Un = Y , 4 J I W")
(S,
is Markov. A similar argument using (13) and (8) shows that = P(S, = q3)I Z n ,w")li.n = Z(3), Y n = Y(3)l
,on is also Markov for independent inputs. = n n ( k ) l [ z ,= X ( 3 ) , ? J n = Y(3)I (65)
where S(jl = c k . Thus if n," converges in distribution, n, and note that for any y E Y and :E E X
must also converge in distribution. Similarly, p n converges in
distribution if does. M&, x) = N&)I(ic(j) = x). (70)
We will use the following definition for subrectangular To apply Theorem A2.1 to p t , we must find a sequence
matrices in the subsequent theorem. y1, . . . , y1 which yields a nonzero and subrectangular matrix
Definition: Let D = ( D r 3 )denote a d x d matrix. If for the product N(y1) . . . N ( y l ) .Consider the projection onto
Dz1,71 # 0 and Dz2,32# 0 implies that also DzlrJ2# 0 and y of the sequence ( y n , z,), 71 = 1,. . . , nr, from Assumption
DZLr3, # 0, then D is called a subrectangular matrix. 1. Let yn,n = 1,. . . , m denote this projection. Using (70)
We can now state the convergence theorem, due to Kaijser and the fact that all the elements of M are nonnegative, it
[21], for the distribution of a Markov chain conditioned on A
is easily shown that for M = M(yl,sl)...M(y,,z,) and
partial observations. A
Theorem A2. I : Let U, be a stationary and ergodic Markov N = N(y1) . . . N(g,), if for any i and ,I, Mi,j is nonnegative,
chain with transition matrix Pu and state space U . Let y be then Ni,j is nonnegative also. From this we deduce that if M
a function with domain U and range Z.Define a new process is nonzero and subrectangular, then N must also be nonzero
2, = g(U,). For z E Z and U(,) the j t h element of U , define and subrectangular.
matrix M ( z ) by We can now apply Theorem A2.1 to p:, which yields the
convergence in distribution OS (1: and thus p,. Moreover, the
limit distributions of these random vectors are independent of
their initial states. Thus we get the following result, which
Suppose that P u and 9 are such that there exists a finite was stated in (16).
sequence z1,. . . , z , of elements in Z that yield a nonzero sub- Lemma A2.4: For any bounded continuous function f , the
rectangular matrix for the matrix product M ( z 1 ). . . M(z,). following limits exist and are equal for all i :
Then p ( U,, I Z n ) converges in distribution and moreover the
limit distribution is independent of the initial distribution of U .
We first apply this theorem to T,". From (13) and (14), the limit distribution of p n is also a
Assumption 1: Assume that there exists a finite sequence function of the input distribution. The following lemma shows
(yn, .E,), n = 1,.. . , rrL, such that the matrix product that the limit distribution of prL is continuous on P ( X ) .
M(y1, z1). . . M(y,, z,) is nonzero and subrectangular, Lemma A2.5: Let U' denote the limit distribution of p n as
where a function of the i.i.d. distribution I9 E P ( X ) . Then U' is a
continuous function of H , so 0, 4H implies that vBn' 4U'.
Proof of Lemmas A2.3 and A2.5: We must show that for
all 0,, 0 E P ( X ) ,if Qm --+ 0, then p', --+ p' and u ' ~ ~ u~8 . ---f
Then by Theorem A2.1, n," converges in distribution to a limit We first show the convergence of vBm. From [12, p. 3461,
which is independent of the initial distribution. By (65), this in order to show that I,'- --+ U', it suffices to show that
implies that 7rTL also converges in distribution, and its limit (U',} is a tight sequence of probability measure^,^ and that
distribution is independent of T O . We thus get the following any subsequence of U'- which converge? weakly converges
lemma, which was stated in (15). to U'.
Lemma A2.2: For any bounded continuous function f , the Tightness of the sequence {U',} follows from the fact that
following limits exist and are equal for all i A is a compact set. Now suppose there is a subsequence
n
lim E[.f(7rR)].
lini E[f(7rn)]= n-cc (68) U',, = U', which converges weakly to ,si/. We must show that
n-cc
(Sl, = U ' , where 1' is the unique invariant distribution for p
The subrectangularity condition on M is satisfied if for some under the transformation (14) with input distribution p ( x ) = 0.
input ic E X there exists a y E 3 such that p k ( y I .E) > 0 Thus it suffices to show that for every bounded, continuous,
for all k . It is also satisfied if all the elements of the matrix real-valued function 4 on A,
P are nonzero.
From (1 1) and (12), the limit distribution of T~ is a function
of the i.i.d. input distribution. Let P ( X ) denote the set of
4(ff)d)(dQ)= J, J, 4 ( 4 4 W ) P 0 ( f j ( ?I P ) (72)
all possible distributions on X . The following lemma, proved n

where p ' ( a I p) = p(p,+l =- Q I = p) is given by (14)
below, shows that the limit distribution of nn is continuous
under the i.i.d. input distribution 8, and is thus independent of
on P ( X ) .
n. Applying the triangle inequality we get that for any k
Lemma A2.3: Let p' denote the limit distribution of 7r, as
a function of the i.i.d. distribution 0 E P ( X ) . Then p' is a
continuous function of 8 , i.e., 8, 4I9 implies that pLenL 4 p'.
We now consider the convergence and continuity of the

distribution for pn. Define the matrix N by
if gy[(S,y, z ) j ] = y
(69) 5 A sequence of probability measures { v T r zis} tight if for all f > 0 there
otherwise. exists a compact set I< such that 7/(I<) > 1 - f for all 7 1 E {vm}.
f are linear functions of 0, and the denominator is nonzero.

Similarly, d k + B implies that for fixed y and /3, p o k(y I P ) +
p o ( y 1 01, since pe(y 1 p ) is linear in B. Since 4 is continuous,
(74) this implies that for fixed y and P
Thus for any t we can find k sufficiently large such that

(75)
Since this inequality holds for all IC, in order to show

J,i.ifflr ( l J : P ) ) P S k(Y I P ) - 4(f"Yl P))P% I P W k (dP)
(72), we need only show that the three terms (73)-(75) all
converge to zero as k i 00. But (73) converges to zero
5tLvydP) = t. (81)
since v o kconverges weakly to $. Moreover, (74) equals zero
for all k , since v o k is the invariant p distribution under the So (79) converges to zero. Finally, for fixed y and 8,fs(y, p)
transformation (14) with input distribution d k . Substituting and pe(y I P ) are linear in b, so 4(fo(y,P))pe(yI P ) is a
(14) for @ ( a I 0) in (75) yields bounded continuous function of p. Thus (80) converges to zero
by the weak convergence of u O k to ?,b [12, Theorem 25.81. i?
Since the { p o n z ) sequence is also tight, the proof that
pe- i khe follows if the limit of any convergent subsequence
of {born) is the invariant distribution for 7r under (12). This
is shown with essentially the same argument as above for
i v', using (12) instead of (14) f o r p ( a 1 p), ps(y I z,P)
instead of po(y 1 /3), and summations over X x y instead of
Y.The details are omitted.
LemmaA2.6: In general, the Markov property does not
hold for 7rn under Markov inputs.
Prooj? We show this using a counterexample. Let C =
{cl: c2, c3) be the state space for S,, with transition proba-
where f e is given by (13) with p ( z ) = 8, and bilities
K
P= /?II 113 113

113 2;3 ;;:) (82)
and initial distribution no = (1/3, 1/3,1/3). This Markov

chain is irreducible, aperiodic, and stationary. Each of the
states correspond to a memoryless channel, where the input
alphabet is (0:l} and the output alphabet is { O , l , 2 ) . The
memoryless channels c1 , c2, and c3 are defined as follows:
cl : pl(o I 0) = p I ( 2 I 1) = 1, otherwise p l ( y I x) = 0.
e2 : p s ( l I 0) = p ~ ( I2 1) = I, otherwise pz(y I x) = 0.
e3 : p 3 ( 2 IO) = 1,
p 3 ( 0 I 1) = p ~ ( I1I) = l / 2 , otherwise p3(y I x) = 0.
The stochastic process {rn}Tzothen takes values on the

three points 010 = (1/3,1/3,1/3), a1 = (2/3,0,1/3), and
a 2 = (0,2/3,1/3).
Let the Markov input distribution be given by p(x0 = 0) =
) 1 for R > 0. Then
p ( z 0 = I) = 1/2 and p ( x n = z n - ~=
p(r3 = I ~2 = a o , ~=
i ~ 1 =) 113
I /3)vok(dP)
IJ, 4(.f"Y,P))P0(Y while
s,
+
p(r3 QO I ~2 = ~ 0 =) 5/6.
- 4(f0(l/,P))P0(YI P ) V ( d P ) I . (80)
So is not a Markov process.
But for any fixed y and P, d k --+ B implies that f O k (y, p) i Lemma A2.7: In general, the Markov property does not
f ' ( y , P ) , since from (13), the numerator and denominator of hold for pn under Markov inputs.
GOLDSMITH AND VARAIYA: CAPACITY, MUTUAL INFORMATION, AND CODING FOR MARKOV CHANNELS 88 I
Proof We prove this using a counterexample similar to The second inequality in (25) results from the fact that
that of Lemma A2.6. Let the FSMC be as in Lemma A2.6 conditioning on an additional random variable, in this case the
with the following change in the definition of the memoryless initial state So, always reduces the entropy [14]. The proof of
channels c1, c2, and cg: the third inequality in (25) is similar to that of the first
c l : p l ( 1 IO) = p l ( 1 I 1) = 1, otherwise p l ( y I E ) = 0.
e2 : p 2 ( 2 I 0) = p 2 ( 2 I 1) = 1, otherwise p 2 ( y I J ) = 0.
c3 : p 3 ( 0 I 0) = p 3 ( 2 I 0) = 1/2,
I
P 3 ( 0 1) = 1/4,
P 3 ( 2 I 1) = 3/4. otherwise p3(y 1 J ) = 0.
It is easily shown that the state space for the stochastic process
{pn};?Lo includes the points and a1 defined in Lemma
A2.6. Using the same Markov input distribution defined there,
we have
where a and d follow from properties of conditional expec-
P(P3 = 0 0 1 p2 = QO,Pl = CUI) = 5/36
tation, b follows from (4) and (S), c follows from Jensen's
while inequality, and e follows from the channel and input station-
arity. 0
P ( P 3 = (20 I P2 = R o ) = 8/57. Proofof Lemma 4.3: From Lemma 4.1
So { P ~ } , " , ~ is not a Markov process.
APPENDIXI11
In this Appendix, we prove Lemma 4.1. Consider first
H ( Y n 1 X n , X n - l , Y n - l ) . We have
Similarly
H(Y,, I x,,xn-1,Y-1)
= E[-logp(y, I 2 , , zn--l,y n - 9
K
-log p ( y T 11 E, , S,,,
= ck)
k=1
.p ( ~ =
n ck I En-1,Y7~-1)1
n
1
K
where n: = xR for some 1:. Applying (IS) to (86) and (87)
[
= E -log C p l ; ( y n . n;,)n,(k)
= E[-logp(y,
k=l
1 5 , , 7rTL)]
1 completes the proof.
Proof of Lemma 4.4: The proof of this lemma is similar
to that of Lemma 4.2 above. For the first inequality in (27),
= H(KL j X n , x n ) . (83) we have
The argument that H(Y, 1 Yn-')) = H ( Y , I ,on) is the same,
Ef(P[YnI A E:f(P[Y,l+l I Yznl)
with all the z terms removed and x, replaced by pTL. U b
I ?]"I I Yz"))
= E:f(E(P[?In+l
APPENDIXIV 5 EE(f(P[Yn+lI Y,]) I Y,n)
In this Appendix, we prove Lemmas 4.24.6. d
= Ef(P[Yn+lI ? I n ] ) (88)
Proof of Lemma 4.2; We first note that the conditional
H(W I v)
= ElOgP(w I where the log function where a follows from the stationarity of the inputs and channel,
is concave On Lo, l1. To show the first inequality in (2s), let f b and d follow from properties of conditional expectation [12],
denote any concave function. Then and c is a consequence of Jensen's inequality.
Ef(P[% I xn--l, P I ) The second inequality results from the fact that conditioning
'
Z7L,
on an additional random variable reduces entropy. Finally, for
E.f (p[:Ym.+l 1 2TL+1 > z; > Y;1) the third inequality, we have
b
= E.f(E(P[Y,+lI Zn+l,:~%P] I zn+l,G,Y;))
E f ( P [ P n + l I Yn,s o ] ) g E . f ( E ( P [ Y n + l I Yn,SI] I Y"; So))
4 EE(.f(P[?/,,+,I Zn+l,Zn,yn]) Tk+1,2;,Y;) I
4- Ef(E(P[YTL+l I w;", SI] I Y", so))
d
= Ef(p[yn+l 1 271+l,Z'L:yn]) (84)
where U follows from the stationarity of the channel and
5 EE(f(P[Y,+l I Yz", SI]) I Y n , so)
d
the inputs, b and d follow from properties of conditional = Ef(P[Yn+l I Y;:S1])
expectation [ 121, and c is a consequence of Jensen's inequality. A Ef(P[Yn.I so]) (89)
where a and d follow from properties of conditional expecta- is continuous in p and is bounded by log I y I [12, Theorem
tion, b follows from (6), c follows from Jensen's inequality, 25.81.
and e follows from the channel and input stationarity. 0 The limiting conditional entropy H ( Y n 1 X,, T,) is ob-
Proof of Lemma 4.5: Following a similar argument as in tained with a similar argument. Let p: denote the distribution
the proof of Lemma 4.3, we have that of T: and be denote the corresponding limit distribution. Then
a
where p: = pC for some i . Applying (16) to (90) and (91)
= lim
n i x
Y*-
c
s"-lEX"-l
completes the proof. 0
Proof of Lemma 4.6: We first consider the limiting con-
ditional entropy H ( Y n I p:) as n + 00. Let u : denote the
distribution of p: and U' denote the corresponding limit dis-
tribution. Also, let pO(Y 1 .) explicitly denote the dependence
of the (conditional) output probability on 8.Then
lirn H ( Y , I p:)
nim
= 7 2lim
-cc c
Y"-'€Yn-l
Zn-l tX-1
' PO (Yn I Yn-l)PO (w"-')

r PO ( y n - l , x n - l )
= lim r
The second and fourth equalities in (92) follow from the fact >
;% -lOgP(Y I z, T ) P ( Y I 5 , T ) e ( z )
that pn is a function of gnP1. We also use this in the fifth Y EY
X t X
equality to take expectations relative to p n instead of y n - l . (93)
The sixth equality follows from the definition of un and the where we use the fact that T , is a function of zn-' and y n - l ,
stationarity of the channel inputs. The last equality follows and the last equality follows from the weak convergence of
from the weak convergence of p a and the fact that the entropy' 7ra to T O . 0
~
GOISXMITH AND VARAIYA: CAPACITY, MUTUAL INFORMATION, AND CODING FOR MARKOV CHANNELS 883
APPENDIXV So H ( Y , I X,, T,) depends only on the distribution of T,.

In this Appendix, we prove Lemmas 5.1 and 5.2. But by (38), this distribution depends only on the distribution
Proof of Lemma 5.I : From [ 141, of Z"-l. The proof then follows from the fact that p ( Z n I
X " ) =p(2"). 0
H ( K I P n ) 5 H(Y,) 5 log I Y I
APPENDIXVI
and similarly
We consider a Q-AWN channel where the output is quan-
H(Yn I P k ) I H(Y,) Slog IYI tized to the nearest input symbol and the input alphabet
consists of symmetric PSK symbols. We want to show that
for any L . But since each c k E C is output symmetric, for each
IC the columns of M k e { M t = p k ( . j I l),Z E X , j E Y}
for any IC P t e
pk(y = j I z = i ) has rows which are per-
mutations of each other and columns which are permutations
are permutations of each other. Thus, if the marginal p(z,)
of each other. The input/output symbols are given by
is uniform, then p(v, 1 S, = C A ) is also uniform, i.e.,
p(vrl I S, = c k ) = 1/ I 3/ I. Hence for any p, E A - AeJzTTnIM m = 1 , .. ,
Y m = xm - ,M. (97)
K
d?/n I on) = CP(UnI sn
k=l
= Ck)Pn(k)
Define the M x M matrix Z by Z;j = Iy; - z j l and let
f&(Z;,) denote the distribution of the quantized noise, which
- K is determined by the noise density n k and the values of A and
211 from (97). By symmetry of the inpuvoutput symbols and
the noise, the rows of Z are permutations of each other, and
the columns are also permutations of each other.
and similarly p(yn I &) = 1/ I Y I for any i . Thus If M is odd, then
and if M is even
= log IYI (95) Thus P; depends only on the value of Z z J ;the rows of PG are
therefore permutations of each other, and so are the columns.
and similarly
APPENDIXVI1
We will show that the 7r-output channel is asymptotically
for any 7 . Since (95) only requires that p(x,) is uniform memoryless as J i 00. Indeed, since the FSMC is indecom-
for each n, an i.i.d. uniform input distribution achieves this posable and stationary
maximum. Substituting T for p in the above argument yields
the result for H(Y,, 1 T,) and H ( Y , 1 TA). 0 lim P(S,+J, Sn) = lim p ( S , + ~ ) p ( s , ) ( 100)
Proof ($Lemma 5.2: We consider only H(Yn I X,, T,), J-00 J-00
since the same argument applieq for H ( Y n 1 X,.T;). By the for any n,and thus also
output qymmetry of each ~k E C,the sets
are permutations of each other. Thus Therefore, since T ~and . ~J) iterations apart, T ~and
L T ~ ( ~ -are L
T ~ ( ~ are
- ~asymptotically
) independent as J -+ 00.
In order to show that the 7r-output channel is memoryless,
we must show that for any ,j and L
L
p(YjL,TjL I zjL ) - : I
n ~ ( ~ j l , ~x jjl )l . (102)
l=1
We can decompose p ( y j L , -iriL 1 ziL)as follows:

p(yjL,# I ZjL)
884 [EEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 3, MAY 1996
Thus we need only show that the lth factor in the right-hand Thus we obtain (45) as follows:
side of (103) equals p(y,l,?r,l 1 ( c 3 1 ) in the limit as J + x. 1
This result is proved in the following lemma. --I(YJ;7 r J ; X J )
J
Lemma A7.1: For asymptotically large J
= H ( Y J ,7r") - H ( Y J ,7rJ I X J )
p ( y J l ,7r31 I Z31,
@I), 7 r ( l - 1 ) , Z3(z-1) ) = P(Y3l.n.71 I "31). =H(YJ j 7 r J ) + H(7rJ)
(104) - ( H ( Y J j T J , X J ) + H(7rJ I X J ) )
= H ( Y J I7rJ) - H ( Y J l T J , X J )
= CHI?
3=1
1x3) - H(Y, ITJ>XJ) ( 109)
where the third equality follows from the fact that
p(7rJ I ZJ)= p(7rJ 1 ZJ-l) = p(7r")
by definition of 7r and by the memoryless property of the 7r3
where the second equality follows from (4) and ( 5 ) , the third channels The last inequality follows from the fact that
equality follows from (4) and ( l l ) , and the fourth equality
H ( Y , I YJ-1.T") = H ( Y , Ip p J ) = H(Y, I7rJ) (110)
follows from (101) in the asymptotic limit of deep interleaving.
since the 7r3 channels are memoryless and pJ = Ez3--17rJ
APPENDIXVI11 APPENDIXIX
The 7r-output channels are independent if In this Appendix we examine the cutoff rate for uniformly
symmetric variable-noise channels. The first three lemmas
J
show that for these channels, the maximizing distribution of
P(YJ,TJ I Z J ) = flP(Yi,.rri I Z j ) . (52) is uniform and i i d . We then determine that Rj, as given
j=1
by (52), is monotonically increasing in j , and use this to get a
This is shown in the following string of equalities: simplified formula for R d f in terms of the limiting value of Rj.
Lemma A9.I: For all j , Rj depends only on p ( x j ) .
P(YJ,XJ I ZJ) Proof From the proof of Lemma 5.2, 7 r j is a function of
J Zj-l, and is independent of Xj-'. So p(.irj) does not depend
= np(yj,7rj I Zj,y+1,7rj-1,Zj-l) on the input distribution. The result then follows from the
j=1 definition o f Rj. 0
.7
Corollary: An independent input distribution achieves the
= fl& 1 7rj,"j,yj-l,7rj-l,Zj-l) maximum of Rdf.
j=1 Lemma A9.2: For a fixed input distribution p ( X J ) ,the J
. P(7rj I Z j , yj-l, zj-l) corresponding 7r-output channels are all symmetric [13, p. 941.
J Proof We must show that for any j < J , the set of
= np(yj I 7rj,zj)p(7rJ 1 X j , y j - 1 , 7 r - l ; z j - 1 ) outputs for the j t h 7r-output channel can be partitioned into
j=1 subsets such that the corresponding submatrices of transition
J probabilities have rows which are permutations o f each other
=nPivj I 7rj,Zj)P(Xj) (107) and columns which are permutations of each other. We will
j=1 call such a matrix row/column-permutable.
Let nj 5 IXljIJlj be the number of points 6 E A with
where the third equality follows from ( 5 ) and the last equality
p(7rj = 6) > 0, and let {&}:& explicitly denote this set.
follows from the fact that we ignore error propagation, so
Then we can partition the output into nj sets, where the ith
z j - l , yj-', and 7 r - l are all known constants at time j .
set consists of the pairs {(y,&): y E Y } . We want to show
We now determine the average mutual information of the
that the transition probability matrix associated with each of
parallel 7r-output channels for a fixed input distribution p ( X " ) .
these output partitions is rowkolumn-permutable, i.e., that for
The average mutual information of the parallel set is
all i, 1 5 i 5 n j , the 1x1
x IyI matrix
1 ' A
1, = - I ( Y J , 7rJ; X J ) . (108)
J P2= p ( y j = y , " j = 6; I "j = (c), z E x,y E y (111)
From above, the parallel channels are independent, and each has rows which are permutations of each other, and columns
channel is memoryless with asymptotically deep interleaving. which are permutations of each other.
~
Since the FSMC is a variable-noise channel, there is a Following an argument similar to that of Lemma 4.2, we have
n
function f such that pk(y I x) depends only on z = f ( x , y )
for all k , 1 5 k 5 K . Therefore, if for some k’, pkt(y I x) =
pkt(y’ I d), then f(x,y) = f(x’,y’). But since z = f ( x ; y )
is the same for all k , this implies that
Pk(Y I = Pk(Y’ I x’) V k , 1 I lc IK. (112)

r l 2
Fix k’. Then by definition of uniform symmetry, pk/(y I x) is
rowkolumn-permutable. Using (1 12), we get that the x (Y( 1x1
matrix
r l 2
K
pc = C P k ( Y I x), 5 E X , YE Y (1 13)
k=l
is also rowkolumn-permutable. Moreover, multiplying a ma-

YEY
trix by any constant will not change the permutability of its
rows and columns, hence the matrix r l2
p; =
L1 1
C P k ( Y I x) S i P ( T j = S i ) , 32
is also rowkolumn-permutable. But this completes the proof,

E x,y E Y (114)
since
p ( y j = y, 7rj = Si 1 x j = x)
K
= C p k ( y j = y I xj = x)Sip(7rj = S i ) . (115)
k=l
U
Lemma A9.3: For i.i.d. uniform inputs, R;is monotonically
increasing in j.
where a follows from stationarity and b follows from Jensen’s
Proofi For uniform i i d . inputs
inequality. U
Lemma A9.4: For uniformly symmetric variable-noise
channels, a uniform i.i.d. input distribution maximizes R d f .
Moreover
Proof From Lemma A9.2, the maximizing distribution

for R d f is independent. Moreover. from Lemma A9.2, each
of the 7r-output channels are symmetric, therefore from [ 13,
Let
p. 1441, a uniform distribution for p ( X , ) maximizes R, for
all j , and therefore it maximizes R d f . By Lemma A9.3, R, is
monotonically increasing in j for i.i.d. uniform inputs. Finally,
by Lemma A2.2, for f ( 7 r , ) as defined in (117), Ef(.li-,)
converges to a limit which is independent of the initial channel
Then state, and thus so does R, = -log (&Ef(.li-,)). Therefore
. J
We want to show that

ACKNOWLEDGMENT
The authors wish to thank V. Borkar for suggesting the
or, equivalently, that proof of Lemma A2.3. They also wish to thank the reviewers
for their helpful comments and suggestions, and for providing
the counterexample of Lemma A2.6.
REFERENCES for correlated Rayleigh-fading channels,” IEEE Trans. Commun., vol.

42, no. 9, pp. 2684-2692, Sept. 1994.
[ I ] M. Mushkin and 1. Bar-David, “Capacity and coding for the Gilbert- J. H. Lodge and M. L. Moher, “Maximum-likelihood sequence estima-
Elliot channels,” IEEE Trans. Inform. T h e o q , vol. 35, no. 6, pp. tion of CPM signals transmitted over Rayleigh flat-fading channels,”
1277-1290, Nov. 1989. IEEE Trans. Commun., vol. 38, no. 6, pp. 787-794, June 1990.
121 A. J. Goldsmith, “The capacity of time-varying multipath channels.“ P. Billingsley. Probability and Measure. New York: Wiley, 1986
Masters thesis, Dept. of Elec. Eng. Comput. Sci., Univ. of California R. G. Gallager, Information Theory and Reliable Communication. New
at Berkeley, May 1991. York: Wiley, 1968.
131 I. Csiszjr and J. Korner, lnjormation Theory: Coding Theorems for T. M. Cover and J. A. Thomas, Elements of Information Theory. New
Discrete Memoryless Channels. New York: Academic Press, 198 1. York: Wiley, 1991.
141 A. J. Viterbi and J. K. Omura, Principles ofDigital Communication and P. R. Kumar and P. Varaiya, Stochastic Systems: Estimation, Identi-
Coding. New York: McGraw-Hill, 1979. $cation, and Adaptive Control. Englewood Cliffs, NJ: Prentice-Hall,
151 H. S. Wang and N. Moayeri, “Modeling, capacity. and joint 1986.
source/channel coding for Rayleigh fading channels,” Tech. Rep. J. G. Proakis, Digital Communications, 2nd ed. New York: McGraw-
WINLAB-TR-32, Wireless Information Network Lab., Rutgers Univ.. Hill, 1989.
New Brunswick, NJ, May 1992. Also “Finite-state Markov channel-A M. V. Eyuboglu, “Detection of coded modulation signals on linear,
useful model for radio communication channels,” IEEE Trans. Vehic. severely distorted channels using decision-feedback noise prediction
Techno!., vol. 44, no. I , pp. 163-171, Feb. 1995. with interleaving.” IEEE Trans. Commun., vol. 36, no. 4, pp. 401-409,
[6] K. Leeuwin-BoullC and J. C. Belfiore, “The cutoff rate of time correlated Apr. 1988.
Trans. Inform. Theory, vol. 39, no. 2, pp. J. C. S. Cheung and R. Steele, “Soft-decision feedback equalizer
612-617, Mar. 1993. for continuous phase modulated signals in wideband mobile radio
171 N. Seshadri and C.-E. W. Sundberg, “Coded modulations for fading channels,” IEEE Trans. Commun., vol. 42, no. 2/3/4, pp. 1628-1638,
channels-An overview,” European Trans. Telecommun. Related Tech- Feb.-Apr. 1994.
nol., vol. ET-4, no. 3, pp. 309-324, May-June 1993. A. DuelLHallen and C. Heegard, “Delayed decision-feedback sequence
[SI L.-F. Wei, “Coded M-DPSK with built-in time diversity for fading estimation,” IEEE Trans. Commun., vol. 37, no. 5, pp. 428-436, May
channels,” IEEE Trans. Inform. Theory, vol. 39, no. 6, pp. 1820-1839. 1989.
Nov. 1993. M. V. Eyuboglu and S . U. H. Qureshi, “Reduced-state sequence estima-
[9] D. Divsalar and M. K. Simon, “The design of trellis coded MPSK for tion with set partitioning and decision feedback,” IEEE Trans. Commun.,
fading channels: Set partitioning for optimum code design,” IEEE Trans. vol. 36, no. I , pp. 13-20, Jan. 1988.
Commun., vol. 36, no. 9, pp. 1013-1021, Sept. 1988. T. Kaijser, “A limit theorem for partially observed Markov chains,” Ann.
I IO] W. C. Dam and D. P. Taylor, “An adaptive maximum likelihood receiver Probab., vol. 3, no. 4, pp. 677-696, 1975.

Capacity, Mutual Information, and Coding For Finite-State Markov Channels

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Capacity, Mutual Information, and Coding For Finite-State Markov Channels

Uploaded by

Copyright:

Available Formats

868 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO.

Capacity, Mutual Information, and Coding

,on = (~~(1): p n ( K ) ) respectively,

The following recursive formula for T~ is derived in Appendix

where B(y,) is a diagonal K x K matrix with kth diagonal

A, with initial value PO = TO and transition probabilities

.P(Yn I P n = PI. (14)

As for 7rn, the transition probabilities in (14) are independent and

Iv. ENTROPY,MUTUALINFORMATION, AND CAPACITY

where I ( . ;.) denotes mutual information and P ( X n ) denotes

I ( Y " ;X " ) = H ( Y " ) - H ( Y " X " ) .

= lim H(Yn 1 Xn,XR-',Y"-l,So).

We now consider the entropy in the output alone.

Similarly, from (19)

The next lemma is proved using the convergence results for nn

The next lemma, proved in Appendix V, shows that for

Fig. 3. System model.

STATE ESTIMATOR MAXIMUM-LIKELIHOOD DECODER

Fig. 4. Decision-feedback decoder.

For each j , the r-output channel is the same for I = where

where where p is the invariant distribution for ?r under i.i.d. uniform

Fig. 5. Two-state fading channel.

channel memory since this memory is captured in the sufficient p(acn(n)[G]<a+da)

VII. TWO-STATE VARIABLE-NOISECHANNEL

In this example, the difference of subsequent distributions after 0 5 10 15

40 recursions is below the quantization level (dol = 0.01) j

We also showed that with i.i.d. inputs, these conditional proba-

,.... ,..... significant improvement over conventional schemes which

In Fig. 8 we show the decision-feedback decoder's capacity APPENDIX 1

all possible distributions on X . The following lemma, proved n

We now consider the convergence and continuity of the

f are linear functions of 0, and the denominator is nonzero.

Thus for any t we can find k sufficiently large such that

Since this inequality holds for all IC, in order to show

P= /?II 113 113

and initial distribution no = (1/3, 1/3,1/3). This Markov

The stochastic process {rn}Tzothen takes values on the

' PO (Yn I Yn-l)PO (w"-')

APPENDIXV So H ( Y , I X,, T,) depends only on the distribution of T,.

We can decompose p ( y j L , -iriL 1 ziL)as follows:

where the third equality follows from the fact that

p(7rJ I ZJ)= p(7rJ 1 ZJ-l) = p(7r")

by definition of 7r and by the memoryless property of the 7r3

Pk(Y I = Pk(Y’ I x’) V k , 1 I lc IK. (112)

is also rowkolumn-permutable. Moreover, multiplying a ma-

is also rowkolumn-permutable. But this completes the proof,

Proof From Lemma A9.2, the maximizing distribution

We want to show that

REFERENCES for correlated Rayleigh-fading channels,” IEEE Trans. Commun., vol.

You might also like