Professional Documents
Culture Documents
Abstract—Polar coding was conceived originally as a technique decoding, R0 governs the pairwise error probability 2−N R0 ,
for boosting the cutoff rate of sequential decoding, along the lines which leads to the union bound
of earlier schemes of Pinsker and Massey. The key idea in boost-
ing the cutoff rate is to take a vector channel (either given or
artificially built), split it into multiple correlated subchannels, P e < 2−N (R0 −R) (1)
and employ a separate sequential decoder on each subchannel.
Polar coding was originally designed to be a low-complexity recur- on the probability of ML decoding error P e for a randomly
sive channel combining and splitting operation of this type, to selected code with 2 N R codewords. Second, in the context of
be used as the inner code in a concatenated scheme with outer
sequential decoding, R0 emerges as the “computational cutoff
convolutional coding and sequential decoding. However, the polar
inner code turned out to be so effective that no outer code rate” beyond which sequential decoding—a decoding algorithm
was actually needed to achieve the original aim of boosting the for convolutional codes—becomes computationally infeasible.
cutoff rate to channel capacity. This paper explains the cut- While R0 has a fundamental character in its role as an error
off rate considerations that motivated the development of polar exponent as in (1), its significance as the cutoff rate of sequen-
coding.
tial decoding is a fragile one. It has long been known that the
Index Terms—Channel polarization, polar codes, cutoff rate, cutoff rate of sequential decoding can be boosted by design-
sequential decoding. ing variants of sequential decoding that rely on various channel
combining and splitting schemes to create correlated subchan-
I. I NTRODUCTION nels on which multiple sequential decoders are employed to
achieve a sum cutoff rate that goes beyond the sum of the cutoff
T HE most fundamental parameter regarding a communica-
tion channel is unquestionably its capacity C, a concept
introduced by Shannon [1] that marks the highest rate at
rates of the original memoryless channels used in the construc-
tion. An early scheme of this type is due to Pinsker [6], who
used a concatenated coding scheme with an inner block code
which information can be transmitted reliably over the channel.
and an outer sequential decoder to get arbitrarily high reliability
Unfortunately, Shannon’s methods that established capacity
at constant complexity per bit at any rate R < C; however, this
as an achievable limit were non-constructive in nature, and
scheme was not practical. Massey [7] subsequently described
the field of coding theory came into being with the agenda
a scheme to boost the cutoff by splitting a nonbinary erasure
of turning Shannon’s promise into practical reality. Progress
channel into correlated binary erasure channels. We will dis-
in coding theory was very rapid initially, with the first two
cuss both of these schemes in detail, developing the insights
decades producing some of the most innovative ideas in that
that motivated the formulation of polar coding as a practical
field, but no truly practical capacity-achieving coding scheme
scheme for boosting the cutoff rate to its ultimate limit, the
emerged in this early period. A satisfactory solution of the
channel capacity C.
coding problem had to await the invention of turbo codes
The account of polar coding given here is not intended to
[2] in 1990s. Today, there are several classes of capacity-
be the shortest or the most direct introduction to the subject.
achieving codes, among them a refined version of Gallager’s
Rather, the goal is to give a historical account, highlighting
LDPC codes from 1960s [3]. (The fact that LDPC codes could
the ideas that were essential in the course of developing polar
approach capacity with feasible complexity was not realized
codes, but have fallen aside as these codes took their final form.
until after their rediscovery in the mid-1990s.) A story of
On a personal note, my acquaintance with sequential decod-
coding theory from its inception until the attainment of the
ing began in 1980s during my doctoral work [8] which was
major goal of the field can be found in the excellent survey
about sequential decoding for multiple access channels. Early
article [4].
on in this period, I became aware of the “anomalous” behavior
A recent addition to the class of capacity-achieving coding
of the cutoff rate, as exemplified in the papers by Pinsker and
techniques is polar coding [5]. Polar coding was originally con-
Massey cited above, and the resolution of the paradox surround-
ceived as a method of boosting the channel cutoff rate R0 , a
ing the boosting of the cutoff rate has been a central theme of
parameter that appears in two main roles in coding theory. First,
my research over the years. Polar coding is the end result of
in the context of random coding and maximum likelihood (ML)
such efforts.
Manuscript received April 3, 2015; revised September 5, 2015; accepted The rest of this paper is organized as follows. We discuss
October 23, 2015. Date of publication December 1, 2015; date of current the role of R0 in the context of ML decoding of block codes
version January 14, 2016. This work was supported by the FP7 Network of in Section II and its role in the context of sequential decod-
Excellence NEWCOM# under Grant 318306.
ing of tree codes in Section III. In Section IV, we discuss
The author is with the Department of Electrical and Electronics Engineering
Bilkent University, Ankara, Turkey (e-mail: arikan@ee.bilkent.edu.tr). the two methods by Pinsker and Massey mentioned above for
Digital Object Identifier 10.1109/JSAC.2015.2504300 boosting the cutoff rate of sequential decoding. In Section V,
0733-8716 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
210 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 2, FEBRUARY 2016
we examine the successive-cancellation architecture as an alter- where Q is a channel input distribution. We will refer to
native method for boosting the cutoff rate, and in Section VI this ensemble as a “(N , R, Q) ensemble”. We will use the
introduce polar coding as a special instance of that architecture. upper-case notation X N (m) to denote the random codeword
The paper concludes in Section VII with a summary. for message m, viewing x N (m) as a realization of X N (m). The
Throughout the paper, we use the notation W : X → Y to product-form nature of the probability assignment (4) signifies
denote a discrete memoryless channel W with input alphabet X, that the array of M N symbols {X n (m); 1 ≤ n ≤ N , 1 ≤ m ≤
output alphabet Y, and channel transition probabilities W (y|x) M}, constituting the code, are sampled in i.i.d. manner.
(the conditional probability that y ∈ Y is received given that The probability of error averaged over the (N , R, Q) ensem-
x ∈ X is transmitted). We use the notation a N to denote a vec- ble is given by
j
tor (a1 , . . . , a N ) and ai to denote a subvector (ai , . . . , a j ). All
P e (N , R, Q) = Pr(C)Pe (C),
logarithms in the paper are to the base two.
C
“genie” tells the decoder when to stop. More specifically, after that separates two very distinct regimes of operation for an ML-
the first guess m 1 (y N ) is submitted, the genie tells the decoder type guessing decoder on channel W : for R > R0 , the average
if m 1 (y N ) = m; if so, decoding is completed; otherwise, the guesswork is exponentially large in N regardless of how the
second guess m 2 (y N ) is submitted; and, so on. The operation code is chosen; for R < R0 , it is possible to keep the average
continues until the decoder produces the correct message. We guesswork close to 0 by an appropriate choice of the code. In
assume that the decoder never repeats an earlier guess, so the this sense, R0 acts as a computational cutoff rate, beyond which
task is completed after at most M guesses. guessing decoders become computationally infeasible.
An appropriate “score” for such a guessing decoder is the Although a guessing decoder with a genie is an artificial con-
guesswork, which we define as the number of incorrect guesses struct, it provides a valid computational model for studying
G 0 (C) until completion of the decoding task. The guesswork the computational complexity of the sequential decoding algo-
G 0 (C) is a random variable taking values in the range 0 to M − rithm, as we will see in Sect. III. The interpretation of R0 as a
1. It should be clear that the optimal strategy for minimizing computational cutoff rate in guessing will carry over directly to
the average guesswork is to use the ML order: namely, to set sequential decoding.
the first guess m 1 (y N ) equal to a most likely message given y N
(as in (2)), the second guess m 2 (y N ) equal to a second most
likely message given y N , etc. We call a guessing decoder of III. S EQUENTIAL D ECODING
this type an ML-type guessing decoder.
The random-coding results show that a code picked at ran-
Let E[G 0 (C)] denote the average guesswork for an ML-type
dom is likely to be a very good code with an ML decoder
guessing decoder for a specific code C. We observe that an
error probability exponentially small in code block length.
incorrect message m = m precedes the correct message m in
Unfortunately, randomly-chosen codes do not solve the coding
the ML guessing order only if a channel output y N is received
problem, because such codes lack structure, which makes them
such that W N (y N |x N (m )) ≥ W N (y N |x N (m)); thus, m con-
hard to encode and decode. For a practically acceptable bal-
tributes to the guesswork only if a pairwise error event takes
ance between performance and complexity, there are two broad
place between the correct message m and the incorrect message
classes of techniques. One is the algebraic-coding approach that
m . Hence, we have
eliminates random elements from code construction entirely;
1 this approach has produced many codes with low-complexity
E[G 0 (C)] = Pm,m (C) (16) encoding and decoding algorithms, but so far none that is
M
m m =m
capacity-achieving with a practical decoding algorithm. The
where Pm,m (C) is the pairwise error probability under ML second is the probabilistic-coding approach that retains a cer-
decoding as defined in (12). We observe that the right side of tain amount of randomness while imposing a significant degree
(16) is the same as the union bound on the probability of ML of structure on the code so that low-complexity encoding and
decoding error for code C. As in the union bound, rather than decoding are possible. A tree code with sequential decoding is
trying to compute the guesswork for a specific code, we con- an example of this second approach.
sider the ensemble average over all codes in a (N , R, Q) code
ensemble, and obtain
A. Tree Codes
G 0 (N , R, Q) = Pr(C)E[G 0 (C)] A tree code is a code in which the codewords conform to
C
a tree structure. A convolutional code is a tree code in which
1
= P m,m (N , Q), the codewords are closed under vector addition. These codes
M were introduced by Elias [14] with the motivation to reduce the
m m =m
complexity of ML decoding by imposing a tree structure on
which in turn simplifies by (14) to block codes. In the discussion below, we will be considering
tree codes with infinite length and infinite memory in order to
G 0 (N , R, Q) ≤ 2 N [R−R0 (Q)] . avoid distracting details; although, in practice, one would use a
finite-length finite-memory convolutional code.
The bound on the guesswork is minimized if we use an ensem-
The encoding operation for tree codes can be described with
ble (N , R, Q ∗ ) for which Q ∗ achieves the maximum of R0 (Q)
the aid of Fig. 2, which shows the first four levels of a tree code
over all Q; in that case, the bound becomes
with rate R = 1/2. Initially, the encoder is situated at the root
G 0 (N , R, Q ∗ ) ≤ 2 N [R−R0 ] . (17) of the tree and the codeword string is empty. During each unit
of time, one new data bit enters the encoder and causes it to
In [13], the following converse result was provided for any move one level deeper into the tree, taking the upper branch
code C of rate R and block length N : if the input bit is 0, the lower one otherwise. As the encoder
moves from one level to the next, it puts out the two-bit label
E[G 0 (C)] ≥ max 0, 2 N (R−R0 −o(N )) − 1 , (18) on the traversed branch as the current segment of the code-
word. An example of such encoding is shown in the figure,
where o(N ) is a quantity that goes to 0 as N becomes large. where in response to the input string 0101 the encoder produces
Viewed together, (17) and (18) state that R0 is a rate threshold 00111000.
ARIKAN: ON THE ORIGIN OF POLAR CODING 213
Fig. 4. Splitting a QEC into two fully-correlated BECs by input-relabeling. Fig. 5. Capacity and cutoff rates with and without splitting a QEC.
To summarize, this section has explained why R0 appears as which is the same as the capacity of the original QEC. Even
a cutoff rate in sequential decoding by linking sequential decod- more surprisingly, the achievable sum cutoff rate after split-
ing to guessing. While R0 has a firm meaning in its role as ting is 2R0,BEC () = 2 log 1+2
, which is strictly larger than
part of the random-coding exponent, it is a fragile parameter R0,QEC () for any 0 < < 1. The capacity and cutoff rates for
as the cutoff rate in sequential decoding. By devising variants the two coding alternatives are sketched in Fig. 5, showing that
of sequential decoding, it is possible to break the R0 barrier, as substantial gains in the cutoff rate are obtained by splitting.
examples in the next section will demonstrate. The above example demonstrates in very simple terms that
just by splitting a composite channel into its constituent sub-
channels one may be able obtain a net gain in the cutoff rate
IV. B OOSTING THE C UTOFF R ATE IN S EQUENTIAL
without sacrificing capacity. Unfortunately, it is not clear how
D ECODING
to generalize Massey’s idea to other channels. For example, if
In this section, we discuss two methods for boosting the cut- the channel has a binary input alphabet, it cannot be split. Even
off rate in sequential decoding. The first method, due to Pinsker if the original channel is amenable to splitting, ignoring the cor-
[6], was introduced in the context of a theoretical analysis of relations among the subchannels created by splitting may be
the tradeoff between complexity and performance in coding. costly in terms of capacity. So, Massey’s scheme remains an
The second method, due to Massey [7], had more immediate interesting isolated instance of cutoff-rate boosting by channel
practical goals and was introduced in the context of the design splitting. Its main value lies in its simplicity and the sugges-
of a coding and modulation scheme for an optical channel. tion that building correlated subchannels may be the key to
We present these schemes in reverse chronological order since achieving cutoff rate gains. In closing, we refer to [23] for an
Massey’s scheme is simpler and contains the prototypical idea alternative discussion of Massey’s scheme from the viewpoint
for boosting the cutoff rate. of multi-access channels.
well-known data-processing theorem of information theory [11, sequential decoding. Pinsker avoids such technical difficulties
p. 50] states that in his construction by using a linear code and restricting the
discussion to a symmetric channel. In designing systems that
C(V ) ≤ C(V ) ≤ N C(W ), target cutoff rate gains as promised by (20), these points should
where C(V ), C(V ), and C(W ) denote the capacities of V , V , not be overlooked.
and W , respectively. There is also a data-processing theorem
that applies to the cutoff rate, stating that V. S UCCESSIVE -C ANCELLATION A RCHITECTURE
R0 (V ) ≤ R0 (V ) ≤ N R0 (W ). (19) In this section, we examine the successive-cancellation
(SC) architecture, shown in Fig. 8, as a general framework for
The data-processing result for the cutoff rate follows from a boosting the cutoff rate. The SC architecture is more flexible
more general result given in [11, pp. 149-150] in the con- than Pinsker’s architecture in Fig. 7, and may be regarded as
text of “parallel channels.” In words, inequality (19) states a generalization of it. This greater flexibility provides signifi-
that it is impossible to boost the cutoff rate of a channel W cant advantages in terms of building practical coding schemes,
if one employs a single sequential decoder on the channels as we will see in the rest of the paper. As usual, we will
V or V derived from W by any kind of pre-processing and assume that the channel in the system is a binary-input channel
post-processing operations. W : X = {0, 1} → Y.
On the other hand, there are cases where it is possible to split
the derived channel V into K memoryless channels Vi : A →
B, 1 ≤ i ≤ K , so that the normalized cutoff rate after splitting A. Channel Combining and Splitting
shows a cutoff rate gain in the sense that
As seen in Fig. 8, the transmitter in the SC architecture uses
K a 1-1 mapper f N that combines N independent copies of W to
1
R0 (Vi ) > R0 (W ). (20) synthesize a channel
N
i=1
W N : u N ∈ {0, 1} N → y N ∈ Y N
Both Pinsker’s scheme and Massey’s scheme are examples
where (20) is satisfied. In Pinsker’s scheme, the alphabets are with transition probabilities
A = B = {0, 1}, f is an encoder for a binary block code of
rate K 2 /N2 , g is an ML decoder for the block code, and the
N
W N (y N |u N ) = W (yi |xi ), x N = f N (u N ).
bit channel Vi is the channel between u i and z i , 1 ≤ i ≤ K 2 .
i=1
In Massey’s scheme, with the QEC labeled as in Fig. 4-(a), the
length parameters are K = 2 and N = 1, f is the identity map The SC architecture has room for N CEs but these encoders
on A = {0, 1}2 , and g is the identity map on B = {0, 1, ?}2 . do not have to operate at the same rate, which is one differ-
As we conclude this section, a word of caution is neces- ence between the SC architecture and Pinsker’s scheme. The
sary about the application of the above framework for cutoff intended mode of operation in the SC architecture is to set the
rate gains. The coordinate channels {Vi } created by the above rate of the ith encoder CEi to a value commensurate with the
scheme are in general not memoryless; they interfere with capability of that channel.
each other in complex ways depending on the specific f and The receiver side in the SC architecture consists of a soft-
g employed. For a channel Vi with memory, the parame- decision generator (SDG) and a chain of SDs that carry out SC
ter R0 (Vi ) loses its operational meaning as the cutoff rate of decoding. To discuss the details of the receiver operation, let us
ARIKAN: ON THE ORIGIN OF POLAR CODING 217
index the blocks in the system by t. At time t, the tth code block difficult problem. To avoid such difficulties, we will assume that
x N , denoted x N (t), is transmitted (over N copies of W ) and the outer code in Fig. 8 is perfect, so that
y N (t) is delivered to the receiver. Assume that each round of
transmission lasts for T time units, with {x N (1), . . . , x N (T )} Pr(Û N = U N ) = 1. (21)
being sent and {y N (1), . . . , y N (T )} received. Let us write This assumption eliminates the complications arising from
{x N (t)} to denote {x N (1), . . . , x N (T )} briefly. Let us use sim- error-propagation; however, the capacity and cutoff rates cal-
ilar time-indexing for all other signals in the system, for culated under this assumption will be optimistic estimates of
example, let di (t) denote the data at the input of the ith encoder what can be achieved by any real system. Still, the analysis will
CEi at time t. serve as a roadmap and provide benchmarks for practical sys-
Decoding in the SC architecture is done layer-by-layer, in N tem design. (In the case of polar codes, we will see that the
layers: first, the data sequence {d1 (t) : 1 ≤ t ≤ T } is decoded, estimates obtained under the above ideal system model are in
then {d2 (t)} is decoded, and so on. To decode the first layer fact achievable.) Finally, we define the soft-decision random
of data {d1 (t)}, the SDG computes the soft-decision vari- vector L N at the output of the SDG so that its ith coordinate
ables {1 (t)} as a function of {y N (t)} and feeds them into is given by
the first sequential decoder SD1 . Given {1 (t)}, SD1 calcu-
lates two sequences: the estimates {d̂1 (t)} of {d1 (t)}, which L i = (Y N , U i−1 ), 1 ≤ i ≤ N.
it sends out as its final decisions about {d1 (t)}; and, the esti-
mates {û 1 (t)} of {u 1 (t)}, which it feeds back to the SDG. (If it were not for the modeling assumption (21), it would be
Having received {û 1 (t)}, the SDG proceeds to compute the appropriate to use Û i−1 in the definition of L i instead of U i−1 .)
soft-decision sequence {2 (t)} and feeds them into SD2 , which, This completes the specification of the probabilistic model for
in turn, computes the estimates {d̂2 (t)} and {û 2 (t)}, sends out all parts of the system.
{d̂2 (t)}, and feeds {û 2 (t)} back into the SDG. In general, at We first focus on the capacity of the channel W N created by
the ith layer of SC decoding, the SDG computes the sequence combining N copies of W . Since we have specified a uniform
{i (t)} and feeds it to SDi , which in turn computes a data deci- distribution for channel inputs, the applicable notion of capacity
sion sequence {d̂i (t)}, which it sends out, and a second decision in this analysis is the symmetric capacity, defined as
sequence {û i (t)}, which it feeds back to SDG. The operation is
Csym (W ) = C(W, Q sym ),
completed when the N th decoder SD N computes and sends out
the data decisions {d̂ N (t)}. where Q sym is the uniform distribution, Q sym (0) = Q sym (1) =
1/2. Likewise, the applicable cutoff rate is now the symmetric
one, defined as
B. Capacity and Cutoff Rate Analysis
For capacity and cutoff rate analysis of the SC architecture, R0,sym (W ) = R0 (W, Q sym ).
we need to first specify a probabilistic model that covers all In general, the symmetric capacity may be strictly less than
parts of the system. As usual, we will use upper-case notation the true capacity, so there is a penalty for using a uniform dis-
to denote the random variables and vectors in the system. In tribution at channel inputs. However, since we will be dealing
particular, we will write X N to denote the random vector at the with linear codes, the uniform distribution is the only appropri-
output of the 1-1 mapper f N ; likewise, we will write Y N to ate distribution here. Fortunately, for many channels of practical
denote the random vector at the input of the SDG g N . We will interest, the uniform distribution is actually the optimal one for
assume that X N is uniformly distributed, achieving the channel capacity and the cutoff rate. These are the
p X N (x N ) = 1/2 N , for all x N ∈ {0, 1} N . class of symmetric channels. A binary-input channel is called
symmetric if for each output letter y there exists a “paired”
Since X N and Y N are connected by N independent copies of output letter y (not necessarily distinct from y) such that
W , we will have W (y|0) = W (y |1). Examples of symmetric channels include
the BSC, the BEC, and the additive Gaussian noise channel
N with binary inputs. As shown in [11, p. 94], for a symmetric
pY N |X N (y N |x N ) = W (yi |xi ). channel, the symmetric versions of channel capacity and cutoff
i=1 rate coincide with the true ones.
We now turn to the analysis of the capacities of the bit-
The ensemble (X N , Y N ), thus specified, will serve as the core
channels created by the SC architecture. The SC architecture
of the probabilistic analysis. Next, we expand the probabilistic
splits the vector channel W N into N bit-channels, which we
model to cover other signals of interest in the system. We define (i)
U N as the random vector that appears at the output of the CEs will denote by W N , 1 ≤ i ≤ N . The ith bit-channel connects
in Fig. 8. Since U N is in 1-1 correspondence with X N , it is the output Ui of CEi to the input L i of SDi ,
uniformly distributed. We define Û N as the random vector that (i)
W N : Ui → L i = (Y N , U i−1 ),
the SDs feed back to the SDG as the estimate of U N . Ordinarily,
any practical system has some non-zero probability that Û N = and has symmetric capacity
U N . However, modeling such decision errors and dealing with (i)
the consequent error propagation effects in the SC chain is a Csym (W N ) = I (Ui ; L i ) = I (Ui ; Y N , U i−1 ).
218 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 2, FEBRUARY 2016
N
(i)
N
Fig. 9. Basic polar code construction.
Csym (W N ) = I (Ui ; Y N U i−1 )
i=1 i=1
(1)
N
(2)
= I (Ui ; Y N |U i−1 ) = I (U N ; Y N )
i=1
(3) (4)
N
= I (X N ; Y N ) = I (X i ; Yi ) = N Csym (W ) (22)
i=1
where the equality (1) is due to the fact that Ui and U i−1 are
independent, (2) is by the chain rule, (3) by the 1-1 property of
f N , (4) by the memoryless property of the channel W . Thus,
the aggregate symmetric capacity of the underlying N copies
of W is preserved by the combining and splitting operations.
Our main interest in using the SC architecture is to obtain
a gain in the aggregate cutoff rate. We define the normalized
symmetric cutoff rate under SC decoding as
Fig. 10. Size-4 polar code construction.
1
N
(i)
R 0,sym (W N ) = R0,sym (W N ). (23)
N
i=1
A. Channel Combining and Splitting for Polar Codes
The objective in applying the SC architecture may be stated as The channel combining and splitting operations in polar cod-
devising schemes for which ing follow the general principles already described in detail in
Sect. V-A. We only need to describe the particular 1-1 transfor-
R 0,sym (W N ) > R0,sym (W ) (24) mation f N that is used for constructing a polar code of size N .
holds by a significant margin. The ultimate goal would be to We will begin this description starting with N = 2.
have R 0,sym (W N ) approach Csym (W ) as N increases. The basic module of the channel combining operation in
For specific examples of schemes that follow the SC architec- polar coding is shown in Fig. 9, in which two independent
ture and achieve cutoff rate gains in the sense of (24), we refer copies of W : {0, 1} → Y are combined into a channel W2 :
to [26] and the references therein. We must mention in this con- {0, 1}2 → Y2 using a 1-1 mapping f 2 defined by
nection that the multilevel coding scheme of Imai and Hirakawa
[25] is perhaps the first example of the SC architecture in the lit- f 2 (u 1 , u 2 ) = (u 1 ⊕ u 2 , u 2 ) (26)
erature, but the focus there was not to boost the cutoff rate. In
the next section, we discuss polar coding as another scheme that where ⊕ denotes modulo-2 addition in the binary field F2 =
conforms to the SC architecture and provides the type of cutoff {0, 1}. We call the basic transform f 2 the kernel of the construc-
rate gains envisaged by (24). tion. (We defer the discussion of how to find a suitable kernel
until the end of this subsection.)
Polar coding extends the above basic combining operation
VI. P OLAR C ODING recursively to constructions of size N = 2n , for any n ≥ 1. For
N = 4, the polar code construction is shown in Fig. 10, where
Polar coding is an example of a coding scheme that fits into
4 independent copies of W are combined by a 1-1 mapping f 4
the framework of the preceding section and has the property
into a channel W4 .
that
The general form of the recursion in polar code construction
R 0,sym (W N ) → Csym (W ) as N → ∞. (25) is illustrated in Fig. 11 and can be expressed algebraically as
The name “polar” refers to a phenomenon called “polariza- f 2N (u 2N ) = ( f N (u N ) ⊕ f N (u 2N
N +1 ), f N (u N +1 )),
2N
(27)
tion” that will be described later in this section. We begin by
describing the channel combining and splitting operations in where ⊕ denotes the componentwise mod-2 addition of two
polar coding. vectors of the same length over F2 .
ARIKAN: ON THE ORIGIN OF POLAR CODING 219
Z (W − ) + Z (W + ) ≤ 2 Z (W ), (31) Fig. 14. A second size-2 polar code construction inside a size-4 construction.
Likewise, we have, from (30), assumption (21) to any desired degree of accuracy by using con-
volutional codes of sufficiently long constraint lengths. Luckily,
R0,sym (W −− ) + R0,sym (W −+ ) ≥ 2 R0,sym (W − ), it turns out using such convolutional codes is unnecessary to
R0,sym (W +− ) + R0,sym (W ++ ) ≥ 2 R0,sym (W + ). have a practically viable scheme. The polarization phenomenon
creates sufficient number of sufficiently good channels fast
Here, we extended the notation and used W −− to denote enough that the validity of (25) can be maintained without any
(W − )− , and similarly for W −+ , etc. help from an outer convolutional code and sequential decoder.
If we normalize the aggregate cutoff rates for N = 4 and The details of this last step of polar code construction are as
compare with the normalized cutoff rate for N = 2, we obtain follows.
Let us reconsider the scheme in Fig. 8. At the outset, the
R 0,sym (W4 ) ≥ R 0,sym (W2 ) ≥ R0,sym (W ). plan was to operate the ith convolutional encoder CEi at a rate
(i)
These inequalities are strict unless W is extreme. just below the symmetric cutoff rate R0,sym (W N ). However, in
The recursive argument given above can be applied to the light of the polarization phenomenon, we know that almost all
situtation in Fig. 11 to show that for any N = 2n , n ≥ 1, and the cutoff rates R0,sym (W N(i) ) are clustered around 0 or 1 for N
1 ≤ i ≤ N , the following relations hold large. This suggests rounding off the rates of all convolutional
encoders to 0 or 1, effectively eliminating the outer code. Such a
(2i−1) (i) (2i) (i)
W2N ≡ (W N )− , W2N ≡ (W N )+ , revised scheme is highly attractive due to its simplicity, but dis-
pensing with the outer code exposes the system to unmitigated
from which it follows that error propagation in the SC chain.
To analyze the performance of the scheme that has no pro-
R 0,sym (W2N ) ≥ R 0,sym (W N ).
tection by an outer code, let A denote the set of indices i ∈
These results establish that the sequence of normalized cut- {1, . . . , N } of input variables Ui that will carry data at rate
off rates {R 0,sym (W N )} is monotone non-decreasing in N . Since 1. We call A the set of “active” variables. Let Ac denote the
R 0,sym (W N ) ≤ Csym (W ) for all N , the sequence must converge complement of A, and call this set the set of “frozen” vari-
to a limit. It turns out, as might be expected, that this limit ables. We will denote the active variables collectively by UA =
is the symmetric capacity Csym (W ). We examine the asymp-
(Ui : i ∈ A) and the frozen ones by UAc = (Ui : i ∈ Ac ), each
totic behavior of the polar code construction process in the next vector regarded as a subvector of U N . Let K denote the size
subsection. of A.
Encoding is done by setting UA = D K and UAc = b N −K
where D K is user data equally likely to take any value in
C. Polarization and Elimination of the Outer Code
{0, 1} K and b N −K ∈ {0, 1} N −K is a fixed pattern. The user
As the construction size in the polar transform is increased, data may change from block to the next, but the frozen pat-
gradually a “polarization” phenomenon takes holds. All chan- tern remains the same and is known to the decoder. This system
(i)
nels {W N : 1 ≤ i ≤ N } created by the polar transform, except carries K bits of data in each block of N channel uses, for a
for a vanishing fraction, approach extreme limits (becom- transmission rate of R = K /N .
ing near perfect or useless) with increasing N . One form of At the receiver, we suppose that there is an SC decoder that
expressing polarization more precisely is the following. For computes its decision Û N by calculating the likelihood ratio
any fixed δ > 0, the channels created by the polar transform
satisfy L i = Pr Ui = 0|Y N , Û i−1 Pr Ui = 1|Y N , Û i−1 ,
1
(i)
i : R0,sym (W N ) > 1 − δ → Csym (W ) (34) and setting
N
⎧
and ⎪
⎨Ui , if i ∈ Ac ,
1
(i)
Ûi = 0,
⎪
if i ∈ A and L i > 1;
i : R0,sym (W N ) < δ → 1 − Csym (W ) (35) ⎩
N 1, if i ∈ A and L i ≤ 1,
as N increases. (|A| denotes the number of elements in set A.)
successively, starting with i = 1. Since the variables UAc are
A proof of this result using martingale theory can be found in
fixed to b N −K , this decoding rule can be implemented at the
[5]; for a recent simpler proof that avoids martingales, we refer
decoder. The probability of frame error for this system is
to [28].
given by
As an immediate corollary to (34), we obtain (25), estab-
lishing the main goal of this analysis. While this result is very
Pe A, b N −K = P ÛA = UA |UAc = b N −K .
reassuring, there are many remaining technical details that have
to be taken care of before we can claim to have a practical cod-
ing scheme. First of all, we should not forget that the validity For a symmetric channel, the error probability Pe (A, b N −K )
of (25) rests on the assumption (21) that there are no errors does not depend on b N −K [5]. A convenient choice in that case
in the SC decoding chain. We may argue that we can satisfy may be to set b N −K to the zero vector. For a general channel,
222 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 2, FEBRUARY 2016
we consider the average of the error probability over all possible the vector channel into correlated subchannels. With proper
choices for b N −K , namely, combining and splitting, it is possible to obtain an improvement
1 in the aggregate cutoff rate. Polar coding is a recursive imple-
Pe (A) = (N −K ) P ÛA = UA |UAc = b N −K . mentation of this basic idea. The recursiveness renders polar
2 N −K N −K codes both analytically tractable, which leads to an explicit
b ∈{0,1}
code construction algorithm, and also makes it possible to
It is shown in [5] that encode and decode these codes at low-complexity.
(i) Although polar coding was originally intended to be the inner
Pe (A) ≤ Z (W N ), (36) code in a concatenated scheme, it turned out (to our pleasant
i∈A surprise) that the inner code was so reliable that there was no
(i) (i) need for the outer convolutional code or the sequential decoder.
where Z (W N ) is the Bhattacharyya parameter of W N .
However, to further improve polar coding, one could still
The bound (36) suggests that A should be chosen so as to
consider adding an outer coding scheme, as originally planned.
minimize the upper bound on Pe (A). The performance attain-
able by such a design rule can be calculated directly from the
following polarization result from [29]. A PPENDIX
For any fixed β < 12 , the Bhattacharyya parameters created D ERIVATION OF THE PAIRWISE E RROR B OUND
by the polar transform satisfy This appendix provides a proof of the pairwise error bound
1
(i)
β
(14). The proof below is standard textbook material. It is repro-
i : Z (W N ) < 2−N → Csym (W ). (37) duced here for completeness and to demonstrate the simplicity
N
of the basic idea underlying the cutoff rate.
(i) Let C = {x N (1), . . . , x N (M)} be a specific code, let R =
In particular, if we fix β = 0.49, the fraction of channels W N
(i) (1/N ) log M be the rate of the code. Fix two distinct messages
in the population {W N : 1 ≤ i ≤ N } satisfying
m = m , 1 ≤ m, m ≤ M, and define Pm,m (C) as in (12). Then,
(i)
Z (W N ) < 2−N
0.49
(38) Pm,m (C) = W N (y N |x N (m))
y N ∈E m,m (C)
approaches the symmetric capacity Csym (W ) as N becomes
large. So, if we fix the rate R < Csym (W ), then for all N suf- (1) W N (y N |x N (m ))
ficiently large, we will be able to select an active set A N of ≤ W N (y N |x N (m))
W N (y N |x N (m))
size K = N R such that (38) holds for each i ∈ A N . With A N yN
selected in this way, the probability of error is bounded as
= W N (y N |x N (m))W N (y N |x N (m ))
(i)
Z (W N ) ≤ N 2−N
0.49
Pe (A N ) < → 0. yN
i∈A N
(2)
N
= W (yn |xn (m))W (yn |xn (m ))
This establishes the feasibility of constructing polar codes that n=1 yn
operate at any rate R < Csym (W√) with a probability of error
(3)
N
going to 0 exponentially as ≈ 2− N . = Z m,m (n),
This brings us to the end of our discussion of polar codes. n=1
We will close by mentioning two important facts that relate to where the inequality (1) follows by the simple observation that
complexity. The SC decoding algorithm for polar codes can be
implemented in complexity O(N log N ) [5]. The construction W N (y N |x N (m )) 1, y N ∈ E m,m (C)
≥ ,
complexity of polar codes, namely, the selection of an opti- W (y |x (m))
N N N 0, y N ∈
/ E m,m (C).
mal (subject to numerical precision) set A of active channels
(i) equality (2) follows by the memoryless channel assumption,
(either by computing the Bhattacharyya parameters {Z N } or
and (3) by the definition
some related set of quality parameters) can be done in com-
plexity O(N ), as shown in the sequence of papers [30], [31],
Z m,m (n) = W (y|xn (m))W (y|xn (m )). (39)
and [32]. y
where the overbar denotes averaging with respect to the code [14] P. Elias, “Coding for noisy channels,” IRE Conv. Rec., pt. 4, pp. 37–46,
ensemble, the equality (1) is due to the independence of the Mar. 1955. Reprinted in Key Papers in The Development of Information
Theory, D. Slepian, Ed. Piscataway, NJ, USA: IEEE Press, 1974, pp. 102–
random variables {Z m,m (n) : 1 ≤ n ≤ N }, and (2) is by the fact 111.
that the same set of random variables are identically-distributed. [15] J. M. Wozencraft, “Sequential decoding for reliable communication,”
Finally, we note that, Res. Lab. Elect., MIT, Cambridge, MA, USA, Tech. Rep. 325, Aug.
1957.
[16] J. M. Wozencraft and B. Reiffen, Sequential Decoding. Cambridge, MA,
Z m,m (1) = W (y|X 1 (m))W (y|X 1 (m )) USA: MIT Press, 1961.
y [17] R. M. Fano, “A heuristic discussion of probabilistic decoding,” IEEE
Trans. Inf. Theory, vol. IT-9, no. 2, pp. 64–74, Apr. 1963.
= W (y|X 1 (m)) · W (y|X 1 (m )) [18] J. W. Layland and W. A. Lushbaugh, “A flexible high-speed sequen-
tial decoder for deep space channels,” IEEE Trans. Commun. Technol.,
y vol. 19, no. 5, pp. 813–820, Oct. 1971.
2 [19] G. D. Forney Jr. “1995 Shannon lecture—Performance and complexity,”
IEEE Inf. Theory Soc. Newslett., vol. 46, pp. 3–4, Mar. 1996.
= Q(x) W (y|x) . [20] B. Reiffen, “Sequential encoding and decoding for the discrete memory-
y x less channel,” Res. Lab. Elect., MIT, Cambridge, MA, USA, Tech. Rep.
374, 1960.
In view of the definition (10), this proves that P m,m (N , Q) ≤ [21] I. M. Jacobs and E. R. Berlekamp, “A lower bound to the distribution of
computation for sequential decoding,” IEEE Trans. Inf. Theory, vol. IT-
2−N R0 (Q) , for any m = m . 13, no. 2, pp. 167–174, Apr. 1967.
[22] E. Arıkan, “An upper bound on the cutoff rate of sequential decoding,”
IEEE Trans. Inf. Theory, vol. 34, no. 1, pp. 55–63, Jan. 1988.
ACKNOWLEDGMENT [23] R. G. Gallager, “A perspective on multiaccess channels,” IEEE Trans. Inf.
Theory, vol. IT-31, no. 2, pp. 124–142, Mar. 1985.
The author would like to thank Senior Editor M. Pursley [24] P. Elias, “Error-free coding,” IRE Trans. Inf. Theory, vol. 4, pp. 29–37,
Sep. 1954.
and anonymous reviewers for extremely helpful suggestions to [25] H. Imai and S. Hirakawa, “A new multilevel coding method using error-
improve the paper. correcting codes,” IEEE Trans. Inf. Theory, vol. IT-23, no. 3, pp. 371–
377, May 1977.
[26] E. Arıkan, “Channel combining and splitting for cutoff rate improve-
ment,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 628–639, Feb.
R EFERENCES 2006.
[1] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. [27] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed.
Tech. J., vol. 27, pp. 379–423, 623–656, Jul./Oct. 1948. Hoboken, NJ, USA: Wiley, 2006.
[2] C. Berrou and A. Glavieux, “Near optimum error correcting coding [28] M. Alsan and E. Telatar, “A simple proof of polarization and polarization
and decoding: Turbo-codes,” IEEE Trans. Commun., vol. 44, no. 10, for non-stationary channels,” in Proc. IEEE Int. Symp. Inf. Theory, 2014,
pp. 1261–1271, Oct. 1996. pp. 301–305.
[3] R. G. Gallager, “Low-density parity-check codes,” IRE Trans. Inf. [29] E. Arıkan and E. Telatar, “On the rate of channel polarization,” in Proc.
Theory, vol. 8, pp. 21–28, Jan. 1962. IEEE Int. Symp. Inf. Theory, Seoul, South Korea, Jun. 28/Jul. 3, 2009,
[4] D. J. Costello and G. D. Forney Jr., “Channel coding: The road to channel pp. 1493–1495.
capacity,” Proc. IEEE, vol. 95, no. 6, pp. 1150–1177, Jun. 2007. [30] R. Mori and T. Tanaka, “Performance and construction of polar codes on
[5] E. Arıkan, “Channel polarization: A method for constructing capacity- symmetric binary-input memoryless channels,” in Proc. IEEE Int. Symp.
achieving codes for symmetric binary-input memoryless channels,” IEEE Inf. Theory, 2009, pp. 1496–1500.
Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, Jul. 2009. [31] I. Tal and A. Vardy, “How to construct polar codes,” IEEE Trans. Inf.
[6] M. S. Pinsker, “On the complexity of decoding,” Probl. Peredachi Inf., Theory, vol. 59, no. 10, pp. 6562–6582, Oct. 2013.
vol. 1, no. 1, pp. 84–86, 1965. [32] R. Pedarsani, S. H. Hassani, I. Tal, and E. Telatar, “On the construction
[7] J. L. Massey, “Capacity, cutoff rate, and coding for a direct-detection of polar codes,” in Proc. IEEE Int. Symp. Inf. Theory, 2011, pp. 11–15.
optical channel,” IEEE Trans. Commun., vol. COM-29, no. 11, pp. 1615–
1621, Nov. 1981.
[8] E. Arıkan, “Sequential decoding for multiple access channels,” Lab. Inf.
Dec. Syst., MIT, Cambridge, MA, USA, Tech. Rep. LIDS-TH-1517, Erdal Arıkan (S’84–M’79–SM’94–F’11) was born
1985. in Ankara, Turkey, in 1958. He received the B.S.
[9] R. M. Fano, Transmission of Information, 4th ed. Cambridge, MA, USA: degree from the California Institute of Technology,
MIT Press, 1961. Pasadena, CA, USA, and the S.M. and Ph.D. degrees
[10] R. G. Gallager, “A simple derivation of the coding theorem and some from the Massachusetts Institute of Technology,
applications,” IEEE Trans. Inf. Theory, vol. IT-11, no. 1, pp. 3–18, Jan. Cambridge, MA, USA, all in electrical engineering,
1965. in 1981, 1982, and 1985, respectively. Since 1987,
[11] R. G. Gallager, Information Theory and Reliable Communication. he has been with the Department of Electrical and
Hoboken, NJ, USA: Wiley, 1968. Electronics Engineering, Bilkent University, Ankara,
[12] R. G. Gallager, “The random coding bound is tight for the average code,” Turkey, where he works as a Professor. He was
IEEE Trans. Inf. Theory, vol. IT-19, no. 2, pp. 244–246, Mar. 1973. the recipient of the 2010 IEEE Information Theory
[13] E. Arıkan, “An inequality on guessing and its application to sequential Society Paper Award and the 2013 IEEE W.R.G. Baker Award, both for his
decoding,” IEEE Trans. Inf. Theory, vol. 42, no. 1, pp. 99–105, Jan. 1996. work on polar coding.