You are on page 1of 4

Implementing LDPC Decoding on Network-On-Chip

T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin


Penn State University
{theochar, link, vijay, mji}@cse.psu.edu

Abstract researched. Researchers in [3-7] conclude that the primary


problems in hardware implementations are the high memory
Low-Density Parity Check codes are a form of Error Correcting requirements per node to store computed results, and the
Codes used in various wireless communication applications and in complicated interconnect structures. Attempted implementations
disk drives. While LDPC codes are desirable due to their ability to addressing the mentioned problems are either limited in the amount
achieve near Shannon-limit communication channel capacity, the of types of LDPC codes supported, or constrained by the hardware
computational complexity of the decoder is a major concern. architecture (FPGA implementation).
LDPC decoding consists of a series of iterative computations In this paper, we propose a LDPC decoder architecture based
derived from a message-passing bipartite graph. In order to on a network-on-chip interconnect [9] to efficiently support the
efficiently support the communication intensive nature of this communication intensive nature of the application. This
application, we present a LDPC decoder architecture based on a architecture exploits the inherent parallelism of the decoding
network-on-chip communication fabric that provides a 1.2Gbps algorithm and has been implemented using 160nm CMOS
decoded throughput rate for a 3/4 code rate, 1024-bit block LDPC technology. The resulting decoder operating at 500MHz is shown
code. The proposed architecture can be reconfigured to support to provide an effective decoded throughput of 1.2Gbps for a 1024-
other LDPC codes of different block sizes and code rates. We also bits per block, 3/4 code rate LDPC code. The architecture can be
propose two novel power-aware optimizations that reduce the reconfigured to support other code rates and block sizes based on
power consumption by up to 30%. the desired error protection demanded by the application.
Similarly, the architecture supports both regular and irregular
1. Introduction LDPC codes through the use of reprogrammable configuration
memories. The architecture also includes power reduction
Low Density Parity Check (LDPC) codes are a form of error optimizations that exploit the intrinsic properties of the decoding
correcting codes using iterative decoding, with the advantage that algorithm. We demonstrate that these techniques reduce the power
they can achieve near Shannon-limit communication channel consumption of the decoder by 30.11% in the above configuration.
capacity [1, 2]. They offer excellent decoding performance as well
as good block error performance. There is almost no error floor and 2. LDPC Decoder Architecture
they are decodable in linear time. Consequently, LDPC codes are
used for various wireless communications and for reliable high- 2.1 Algorithm Overview
throughput disk data transfers [2]. Design of power efficient LDPC We now briefly review the operations performed by an LDPC
encoders and decoders with low bit-error rate (BER) in low signal- decoder. Details of the algorithm can be found in [1-3, 8]. The
to-noise ratio (SNR) channels is critical for these environments. decoder consists of two major computation units, the bit node and
This paper presents a power-efficient reconfigurable architecture the check node, which iteratively communicate with each other
for an LDPC decoder which achieves a high throughput even in with the common objective in decoding a message bit. When a
noisy channels. message is about to be decoded, for every message bit, the
Check nodes corresponding bit node receives an initial bit likelihood ratio,
A BC DE F GH I a b c d e f
a 1 0 0 1 0 0 1 0 0
R
text
B
R
text
B
R
text
B
R
text
B
R
text
B
defined as the ratio of the probability of the bit being 1 to the
b 0 1 0 0 1 0 0 1 0 R
text
R
text
R
text
R
text
R
text
probability of the bit being zero. The computation starts from the
B C C C B
c 0 0 1 0 0 1 0 0 1
d 1 0 1 0 0 0 0 1 0
R
text
R
text
R
text
R
text
R
text
bit node. Each bit node receives the initial bit likelihood ratio,
B C C C B
e 0 0 0 1 0 0 1 0 1 R
text
R
text
R
text
R
text
R
text
passing it to every check node that the current bit is associated
f 0 1 0 0 1 1 0 0 0 B C C C B

R
text
R
text
R
text
R
text
R
text
with. Each check node receives a set of likelihood ratios (lri_old)
B B B B B
from the various bit nodes it is associated with. Once the entire set
Parity check matrix, H A B C D E F G H I
arrives, the check node computes and sends new likelihood ratios
Bit nodes
(lri_new) to each bit node. Computing the operation in the log
Figure 1: H-Matrix – Bipartite Graph – Network on Chip
domain results in significant advantages [8], hence likelihood ratio
An LDPC code is a linear message encoding technique, defined
becomes now logarithmic likelihood ratio (llr).The check node is
by a set of two very sparse parity check matrices, G and H. The
responsible for computing the llr for each bit participating in a
message to be sent is encoded using the G matrix. When it reaches
check. The new llr depends on the llrs of the other bits
its destination, it is decoded using the H matrix. In contrast to the
participating in that check. The check node operation is shown in
encoder, the LDPC decoding is computationally intensive and
Eq. 1. Details of the operation are described in [3].
consists of iterative computations derived from a message-passing
bipartite graph (shown in Figure 1). Message passing iterations are
llri _ new > @
B B llr0 _ old  B llr1 _ old    B llrn _ old  B llri _ old
performed by the two computation units - the bit node and the (Eq.1 – Check Node Operation)
check node [2]. The LDPC algorithm is communication intensive The function B (llr) is the logarithmic bilinear transform,
and that selection of the interconnect structure of the decoder 1  e llr i _ old
B ( llr i _ new ) ln( )
architecture is critical. 1  e llr i _ old .
LDPC codes offer excellent communication over noisy
This function is implemented using a ROM based look-up table
channels, hence design of LDPC decoders has been heavily
(LUT) for the check node computation and overall sign-magnitude
This work was supported in part by grants from NSF CAREER 0093085, data representation for the entire computation based on our prior
NSF 0130143, NSF 0202007 and MARCO/DARPA GSRC-PAS
Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID’05)
1063-9667/05 $20.00 © 2005 IEEE
work described in [8]. The check node computes the new llrs for used, with deterministic XY routing as the routing algorithm. As
all bits participating in each check, and sends each result back to losing a single data value is catastrophic to LDPC computation, our
the originating bit node. The bit node now in turn, accumulates the network sends on/off backpressure information, preventing packet
new llrs it receives from all checks that the bit participates in, drop. This inability to drop packets removes the need for
excluding the current bit, until the llr converges towards either complicated ACK/NACK based protocols, and reduces storage and
negative or positive infinity. Convergence to positive/negative energy requirements throughout the system. Each packet contains
infinity translates to the decoded bit being either 0 or 1 not only a physical destination address in the network (which
respectively. The operation (log domain) is outlined in Eq. 2: requires 5 bits in the 25-node example network), but also contains
llri _ new stored _ llri  llr0  llr1    llrn  llri _ new 6 bits of virtual identification information that identifies which of
the 64 virtual nodes in a physical node the packet is destined for. In
(Eq.2 – Bit Node Operation)
addition, the packet contains virtual identification information that
The term stored_llr describes the previously stored logarithmic
identifies from which particular partnered virtual node the packet
likelihood ratio for that bit or the initial logarithmic likelihood ratio
was sent, allowing return address information to be extracted. As
if this is the first iteration. The rest are the results from each check
we assume that a given node will never communicate with more
node that the bit participates in. The new llr is stored in the bit
than 16 partners, this requires an additional 4 bits. Finally, a single
node and also sent back to the participating checks to complete the
bit is reserved to mark configuration packets, for a total header size
computation. The bit node is responsible for controlling the entire
of 16 bits per packet.
operation on each bit, as it controls the number of iterations
Each packet consists of a single 48-bit phit, and as such, there is
necessary for convergence.
no distinction between wormhole and packet switching. The
A critical design parameter which affects both the performance
routers themselves are not pipelined, and as such, a full packet of
and the power consumption of the decoder is the bit width of the
data moves one hop per clock cycle. The distance the packets must
data word used in the computation. A large data word results in a
move across the network is minimized by intelligently mapping the
lower BER even in noisy channels with SNR of 0.5db, but at the
bipartite graph onto the physical network, resulting in an average
cost of increased power consumption. Power consumption is
network distance half of that obtained with random placement.
therefore dependent on the amount of noise expected in the
In a direct implementation of LDPC decoding on NoC, as the
channel, as to achieve a low BER we need both a more precise
analog signal arrives and is converted into llr values after ADC
number representation as well as an increased number of iterations
conversion, the llr values are grouped into message blocks. The llr
of the algorithm to achieve convergence. Note that the power
values from each message block are then packetized and sent into
optimizations we present here can be applied to all designs,
the bit nodes. Each output value during the decoding process is
regardless of bit-width. Our test case uses a 16 bit signed-
also contained in its own unique packet, for a one-to-one ratio of
magnitude fixed-point representation.
header bits to data bits. In our architecture, two blocks are decoded
2.2 Network on Chip (NoC) Architecture in parallel, allowing 66% of the network traffic to be useful data.
Both message blocks are decoded using the same H-matrix (an thus
The decoder is implemented as an on-chip network, where the communication pattern) allowing for an increased device
bit and check nodes act as processing elements (PEs), which throughput at the cost of a slightly increased decode latency for
communicate via on-chip network routers, as shown in Figure 1. In every other packet. This behavior is shown in Figure 2 below.
the example shown, the network consists of 25 nodes, 16 bits and 9
checks, communicating in a 2D mesh topology. By providing Message 1 ADC
H D1 D2 Decode
Decoded
Message 1
configuration memory to each PE, we can map more than one node H D1 D2
b

b
c

Incoming Message b c

on a single PE. In the case that there are virtual nodes that cannot Stream H D1 D2

Packets
be mapped all at once on a single chip, we use the serial approach Message 2 ADC Values
Packetized Sent Decoded
To Bitnodes Message 2
outlined in [3].
Figure 2: Message Decoding Behavior
The architecture is initialized to reflect the mapping of the
nodes of the bipartite graph onto the physical nodes. For the given 2.4 Bit Node PE Architecture
LDPC code, the H matrix is used to generate the configuration data
Stage 1 Stage 2 Stage 3 Stage 4
which includes the number of virtual nodes mapped on to a PE and
the connectivity information capturing the communication pattern P
Header read HEADER FLOW TABLE (HFT) Header
P
A A
between the different nodes. Each PE has a dedicated memory to Data
C Header
C Data
Data K
store this configuration information. The configuration information Valid
K
Concentrator
Header Write
E
E Ready Ready
T
Valid
is routed from the I/O to the individual PEs using packet based Ready T (DC) Valid Valid
Ready
Data1
Execution Data1 G
communication. D Data2
16
16 Unit Data2 E
E N
After initialization, computation enters the decoding stage, C
(EU) E
O Valid 6 R
where data comes into the network in the form of logarithmic D Init Ready
A
T
likelihood ratios. The bit nodes then load the incoming data into E
llr1 llr2 Comp ID O
R R
their memories, and initialize the computation, which develops into (PD) Configuration Memory (CM)
(PG)
the iterative message passing between checks and bit nodes. When
the specified number of iterations have completed (and hence the
values of individual bits have converged), the bit node outputs the Figure 3: Bit Node Architecture
The bit node architecture is shown in detail in Figure 3. The
bit value to the I/O. The on-chip communication is detailed in the
interface to the network consists a single 48-bit input port and a
next subsection.
48-bit output port, each capable of transmitting an entire packet per
2.3 On-Chip Communication clock cycle. As each packet arrives, the data concentrator directs
values to the appropriate accumulator, based on the virtual
Inter-PE communication is handled by an on-chip network identification information in the packet header. Once all input
consisting of a number of small on-chip routers. In our values for a given virtual bit node are received, the computation
implementation of the network, a simple 2D mesh topology is proceeds to the execution unit, where the individual llr values are
Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID’05)
1063-9667/05 $20.00 © 2005 IEEE
subtracted from the accumulated sum, as in Equation 2 (see page to obtain the bit error rates. This simulation also provides us with
4). Finally, the source information contained in the header is used the number of cycles required for completing the decoding. When
to index the header flow table, generating the header for the combined with the clock frequency of our synthesized designs, the
outgoing packet containing the new llr. Note the parallel 16-bit proposed architecture is determined to provide a decoded bit
datapath throughout the bit node, supporting the lockstep throughput of 750 Mbps when a maximum number of iterations is
computation of two messages simultaneously. not set (instead the bit nodes had to converge to positive or
negative infinity). This is however a pessimistic approach; setting a
2.5 Check Node PE Architecture maximum iteration count and determining the bit value from the
The check node architecture is shown in Figure 4. The check ratio at the point where the maximum iterations are completed,
node has to compute the new llrs based on Equation 1. The data gives a higher throughput. The number of iterations is related to
flow through the check node is similar to that of the bit node, with the accuracy of the computation, as convergence is not guaranteed
the most obvious difference being the addition of four with a lower number of iterations. For low noise channels, it is
computational units (CU’s), which perform the bilinear transform desirable to limit the number of iterations, as it will result in a
in the logarithmic domain. The first two CUs, located just as the higher throughput, with the BER not as affected as in high noise
packet arrives, perform the operation on each individual incoming channels.
data value, allowing the accumulators in stage four to perform the Next, we used the binary data patterns captured from the cycle-
summation ¦ B (llr _ i ) . After all values have been received for a accurate NOCSim simulation as input pattern to our synthesized
given virtual check node, the execution unit subtracts away the design and simulated that using Power Compiler to obtain the
individual B(llr) values and performs the bilinear transform in the power consumption values. For the irregular LDPC code, the
logarithmic domain once more, completing the operations found in average power consumption in decoding 1024 bits is 34.8 Watts, a
Equation 1. Since each output llrx is generated in the same order as significant portion (43%) of which is consumed in the on-chip
the packet associated with x arrived, the stored headers are scrolled interconnect. The check nodes (23%) also consume more power
out and used as inputs to the header flow table, allowing each than the bit nodes (22%) as expected given their larger pipeline
packet to be addressed properly to return to the appropriate bit length and the computation complexity. Approximately 12% of the
node. Again, note that the node supports the simultaneous chip power is consumed in leakage power.
decoding of two independent message blocks, allowing for
lockstep operation.
4. Power Optimization
Stage 5 Stage 6
Stage 3 Stage 7
Stage 1

Valid
Stage 2
Stage 4
Stage 8
4.1 Power Driven Message Encoding
HEADER Header Flow Table
Data BUFFER Represented Value Binary data to be transmitted Transmitted Data
DATA
CONCENTRATOR Bits S1-S2 Bits 0-15
Ready CU 1 FIFO d1 +=
Execution Unit
16 Data Zero 0000000000000000 00 <previous data value>
Packet Generator

-
Data Redirector

FIFO h1
1
CU 3
Packet D ecoder

Arbiter llr Valid +infinity 0111111111111111 01 <previous data value>


FIFO d2 += 16
Ready
All Others XXXXXXXXXXXXXXXX 10 <new data value>
1

FIFO h2
- infinity 1111111111111111 11 <previous data value>
16
FIFO dn += - 1
CU 4
llr Figure 5a: Message Encoding Scheme.
FIFO hn
CU 2
S1 S1
S2 S2

PE 16 bit data payload X-BAR 16 bit data payload NETWORK


RST
header CONTROL header
CLK Control Logic (Clock, Data Flow management)
Router
Figure 5b: Encoding scheme as applied in NoC routers.
Figure 4: Check Node Architecture The LDPC algorithm involves a series of messages passed
between each node, eventually resulting in each computed log
3. Experimental Platform and Results
likelihood ratios converging at either positive or negative infinity.
The first step in evaluating our architecture was to obtain the H Implementing this in hardware results in a large number of values
matrices for the desired LDPC codes; for this we used the LDPC passed between nodes truncated to either zero, negative infinity or
software implementation by [10] to generate multiple H matrices positive infinity. A range between 25% and 40% of the total data
based on inputs such as desired block size, regular/irregular codes values passed between each node are either zero or +/- infinity.
and code rate. Given that our architecture is designed to This implies a high switching activity on the interconnect, as the
accommodate multiple types of LDPC codes, we generated most commonly transferred values are the compliment of each
matrices which consisted of both regular and irregular LDPC other. We can however take advantage of this LDPC behavior, by
codes. Next, we designed a 5x5 2D mesh network, consisting of 16 encoding these values using 2 additional bits and avoid switching
physical bit nodes and 9 physical check nodes as the PEs, in activity when these values are transmitted. We propose the
Verilog and synthesized it using commercial 160nm cell library. introduction of two additional bit lines within the data bits, which
Each PE supports up to 64 virtual nodes, allowing simultaneous we label as S1 and S2. Leaving the data bit lines untouched, at the
on-chip decoding of 1024 bits. The chosen H matrices however output of each node (the packet generator), we check to see if the
were of various column sizes, therefore the block decoding size result is either zero, or +/- infinity. If this is the case, we do not
varied from 512 to 1024 bits. The synthesized design operates on a touch the message to be transmitted – instead we only set S1 and
500MHz clock at 1.8V Vdd. It occupies an area of ~110mm2. S2 to the corresponding value as shown in Figure 5a, leaving the
output buffers and thus the interconnect at the previous value. The
In order to verify the functionality of our design, we used a node output port only changes two of the bits. The switching
cycle-accurate NoC Simulator, NOCSim [11], to model our design activity is therefore limited only to two bits. The scheme is applied
and perform cycle-accurate simulation that captures both the across the network; routers which propagate the messages are
initialization and computation phases of the decoder using the responsible for also applying the scheme. Each router - in parallel
input H-matrix. We verified the decoded output from our simulated with determining the destination PE (i.e. routing operation) - looks
design to that obtained using the software implementation of [10] at the two bits. In the case of an encoded message, the router
Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID’05)
1063-9667/05 $20.00 © 2005 IEEE
simply forwards the two bits to the destination, leaving the rest of encoding scheme discussed in Section 4.1 applies, resulting in
the data bit lines unaffected. Figure 5b shows the router hardware further savings. When we combine this technique with the message
implication. encoding scheme, there is a 30% reduction in system power
We simulated the proposed scheme using the same procedure consumption (24.32W) as compared to the non-optimized
detailed in Section 3.5, and found that this optimization reduces the implementation of Section 3 (34.8W). The hardware overhead to
total power consumption to 30.36 W from 34.8W, constituting a achieve both optimizations is less than 5%, while the BER remains
power saving of 12.75 %. largely unaffected. Note that while the BER and system power are
based on a specific channel with SNR of 0.5, particular H matrix,
4.2 Early Termination and incoming message, the optimizations we discuss apply to all
The LDPC algorithm is based on iterative convergence of a bit LDPC codes and incoming signals. Using the early termination
probability towards zero or one [1]. A particular property of the approach, we can also reduce leakage power consumption by
LDPC algorithm is the convergence pattern. Convergence patterns turning off the power supply to the computation pipeline of check
have been studied extensively among researchers, and there are nodes and bit nodes which are not computing.
several suggestions that deal with theoretical limits and suggesting
for particular ways to terminate [12]. The convergence patterns can
5. Conclusions
be more deterministic due to the finite precision used in our The paper presents the design of a high throughput LDPC
hardware implementation and can provide a useful ground for decoder architecture based on a scalable network-on-chip
optimization. interconnect. The decoder maintains a low BER even through
Convergence Patterns Convergence patterns separate noisy channels, and has the ability to support multiple LDPC codes
after a number of iterations of varying types. The use of NoC design allows for a high
Initial
1 Ratios performance scalable interconnect that offers dynamic
Bit Likelihood Ratio

0.9 communication in highly parallel architectures, and provides a


0.8 convenient platform for LDPC decoding. As the decoding of other
0.7
0.6 graph based codes, such as turbo codes, is also interconnect driven,
0.5 Final Ratios we plan to extend our work into the design of other reconfigurable
0.4 low-power decoders. By developing methods of increasing
0.3 efficiency of the interconnect structures, such as those presented in
0.2
0.1
this work, highly advanced error protection can be extended into a
0 large number of application domains.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# of Iterations 6. References
Convergence to 1-Case1 Convergence to 1 - Case 2
Convergence to 0 - Case 1 Convergence to 0 - Case 2 [1] R. G. Gallager. “Low-Density Parity-Check Codes”, IEEE
Transactions on Information Theory, Jan. 1962, pp. 21-28.
Figure 6: LDPC Convergence patterns
[2] D. Mackay R. Neal. “Near Shannon limit performance of low
Figure 6 shows the convergence patterns for four different
density parity check codes”, IEEE Electronics Letters, Vol.33, no.
cases, where the initial probability leads to a convergence towards
6, March 1997, pp 457-458.
zero or towards one. After a set of initial iterations where the
[3] E. Yeo, B. Nikolic and V. Anantharam. Architectures and
outcome is not determined (approximately 10), the convergence
Implementations of Low-Density Parity-Check Decoding
patterns start to separate either towards zero or towards 1. This
Algorithms, invited paper at IEEE IMSCS, Aug 4-7, 2002
particular property can be exploited in our implementation and
[4] M. M. Mansour and N. R. Shanbag, “Low-Power VLSI
allow a PE to produce its result sooner than it would have
Decoder Architectures for LDPC Codes”, Proc. of the 2002
otherwise taken. The iteration count where the convergence cannot
ISLPED, Page(s): 284-289.
be determined is dependent on the amount of noise present in the
[5] L. W. Lee, A. Wu. “VLSI implementation for low density
communications channel.
parity check decoder”, Proc. of the IEEE ICECS, 2001. Volume:
A low noise channel would result in the convergence patterns to
3, 2-5 Sept. 2001, pp:1223 - 1226
separate faster, whereas higher noise levels would result in the
[6] C. Howland and A. Blanksby, “A 690mW 1Gb/s 1024-Bit
iteration count larger. Determining the optimal iteration for a
Rate-1/2 Low Density Parity Check Code Decoder”, Proceedings
pattern to converge can be done by keeping track of the number of
of the 2001 IEEE CICC, San Diego, CA , May 2001, pp. 293-296
iterations for the previous computed bits. The hardware to do this
[7] B. Levine et al. “Implementation of Near Shannon Limit Error-
operation is simple. For each virtual bit node mapped on a physical
Correcting Codes using Reconfigurable Hardware”, Proceedings of
bit node, we need to detect a pattern whose outcome can be
the IEEE FPCCM, 2000. Page(s):217-226.
determined by looking at the previous ratios in that pattern. We do
[8] T. Theocharides, et.al. “Evaluating Alternative
that by counting the number of incoming llr values which fall
Implementations for the LDPC Check Node Function”,
under (over) a certain threshold llr. Our simulations have shown
Proceedings of the IEEE ISVLSI, February 2004
that when five consecutive incoming llr values are under 0.25 or
[9] L. Benini and G. De Micheli. “Networks on chips: a new SoC
over 0.75, then we can set the convergence outcome immediately
paradigm”, IEEE Computer, Volume 35, pp. 70--78, January 2002
without the need to further iterate. When convergence has been
[10] R. M. Neal, “Software for Low Density Parity Check Codes”,
determined, the bit node pipeline can be bypassed, as the bit node
ftp://ftp.cs.utoronto.ca/pub/radford/LDPC-2001-11-18/index.html
simply needs to send the determined bit value to all appropriate
[11] http://www.ece.cmu.edu/~djw2/NOCsim/
check nodes. Reducing the number of iterations (and hence
[12] S. Brink, “Convergence Behavior of Iteratively Decoded
computations) results in both savings in power consumption and
Parallel Concatenated Codes”, IEEE Transactions on
increased throughput, with the only drawback being a slight (less
Communications, VOL. 49, No. 10, October 2001, pp: 1727-1737.
than 1%) increase in the BER. Also, as the converged bit values
fall within the values described in Section 4.1 (0, ± ’) the message
Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID’05)
1063-9667/05 $20.00 © 2005 IEEE

You might also like