Professional Documents
Culture Documents
R
text
R
text
R
text
R
text
R
text
with. Each check node receives a set of likelihood ratios (lri_old)
B B B B B
from the various bit nodes it is associated with. Once the entire set
Parity check matrix, H A B C D E F G H I
arrives, the check node computes and sends new likelihood ratios
Bit nodes
(lri_new) to each bit node. Computing the operation in the log
Figure 1: H-Matrix – Bipartite Graph – Network on Chip
domain results in significant advantages [8], hence likelihood ratio
An LDPC code is a linear message encoding technique, defined
becomes now logarithmic likelihood ratio (llr).The check node is
by a set of two very sparse parity check matrices, G and H. The
responsible for computing the llr for each bit participating in a
message to be sent is encoded using the G matrix. When it reaches
check. The new llr depends on the llrs of the other bits
its destination, it is decoded using the H matrix. In contrast to the
participating in that check. The check node operation is shown in
encoder, the LDPC decoding is computationally intensive and
Eq. 1. Details of the operation are described in [3].
consists of iterative computations derived from a message-passing
bipartite graph (shown in Figure 1). Message passing iterations are
llri _ new > @
B Bllr0 _ old Bllr1 _ old Bllrn _ old Bllri _ old
performed by the two computation units - the bit node and the (Eq.1 – Check Node Operation)
check node [2]. The LDPC algorithm is communication intensive The function B (llr) is the logarithmic bilinear transform,
and that selection of the interconnect structure of the decoder 1 e llr i _ old
B ( llr i _ new ) ln( )
architecture is critical. 1 e llr i _ old .
LDPC codes offer excellent communication over noisy
This function is implemented using a ROM based look-up table
channels, hence design of LDPC decoders has been heavily
(LUT) for the check node computation and overall sign-magnitude
This work was supported in part by grants from NSF CAREER 0093085, data representation for the entire computation based on our prior
NSF 0130143, NSF 0202007 and MARCO/DARPA GSRC-PAS
Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID’05)
1063-9667/05 $20.00 © 2005 IEEE
work described in [8]. The check node computes the new llrs for used, with deterministic XY routing as the routing algorithm. As
all bits participating in each check, and sends each result back to losing a single data value is catastrophic to LDPC computation, our
the originating bit node. The bit node now in turn, accumulates the network sends on/off backpressure information, preventing packet
new llrs it receives from all checks that the bit participates in, drop. This inability to drop packets removes the need for
excluding the current bit, until the llr converges towards either complicated ACK/NACK based protocols, and reduces storage and
negative or positive infinity. Convergence to positive/negative energy requirements throughout the system. Each packet contains
infinity translates to the decoded bit being either 0 or 1 not only a physical destination address in the network (which
respectively. The operation (log domain) is outlined in Eq. 2: requires 5 bits in the 25-node example network), but also contains
llri _ new stored _ llri llr0 llr1 llrn llri _ new 6 bits of virtual identification information that identifies which of
the 64 virtual nodes in a physical node the packet is destined for. In
(Eq.2 – Bit Node Operation)
addition, the packet contains virtual identification information that
The term stored_llr describes the previously stored logarithmic
identifies from which particular partnered virtual node the packet
likelihood ratio for that bit or the initial logarithmic likelihood ratio
was sent, allowing return address information to be extracted. As
if this is the first iteration. The rest are the results from each check
we assume that a given node will never communicate with more
node that the bit participates in. The new llr is stored in the bit
than 16 partners, this requires an additional 4 bits. Finally, a single
node and also sent back to the participating checks to complete the
bit is reserved to mark configuration packets, for a total header size
computation. The bit node is responsible for controlling the entire
of 16 bits per packet.
operation on each bit, as it controls the number of iterations
Each packet consists of a single 48-bit phit, and as such, there is
necessary for convergence.
no distinction between wormhole and packet switching. The
A critical design parameter which affects both the performance
routers themselves are not pipelined, and as such, a full packet of
and the power consumption of the decoder is the bit width of the
data moves one hop per clock cycle. The distance the packets must
data word used in the computation. A large data word results in a
move across the network is minimized by intelligently mapping the
lower BER even in noisy channels with SNR of 0.5db, but at the
bipartite graph onto the physical network, resulting in an average
cost of increased power consumption. Power consumption is
network distance half of that obtained with random placement.
therefore dependent on the amount of noise expected in the
In a direct implementation of LDPC decoding on NoC, as the
channel, as to achieve a low BER we need both a more precise
analog signal arrives and is converted into llr values after ADC
number representation as well as an increased number of iterations
conversion, the llr values are grouped into message blocks. The llr
of the algorithm to achieve convergence. Note that the power
values from each message block are then packetized and sent into
optimizations we present here can be applied to all designs,
the bit nodes. Each output value during the decoding process is
regardless of bit-width. Our test case uses a 16 bit signed-
also contained in its own unique packet, for a one-to-one ratio of
magnitude fixed-point representation.
header bits to data bits. In our architecture, two blocks are decoded
2.2 Network on Chip (NoC) Architecture in parallel, allowing 66% of the network traffic to be useful data.
Both message blocks are decoded using the same H-matrix (an thus
The decoder is implemented as an on-chip network, where the communication pattern) allowing for an increased device
bit and check nodes act as processing elements (PEs), which throughput at the cost of a slightly increased decode latency for
communicate via on-chip network routers, as shown in Figure 1. In every other packet. This behavior is shown in Figure 2 below.
the example shown, the network consists of 25 nodes, 16 bits and 9
checks, communicating in a 2D mesh topology. By providing Message 1 ADC
H D1 D2 Decode
Decoded
Message 1
configuration memory to each PE, we can map more than one node H D1 D2
b
b
c
Incoming Message b c
on a single PE. In the case that there are virtual nodes that cannot Stream H D1 D2
Packets
be mapped all at once on a single chip, we use the serial approach Message 2 ADC Values
Packetized Sent Decoded
To Bitnodes Message 2
outlined in [3].
Figure 2: Message Decoding Behavior
The architecture is initialized to reflect the mapping of the
nodes of the bipartite graph onto the physical nodes. For the given 2.4 Bit Node PE Architecture
LDPC code, the H matrix is used to generate the configuration data
Stage 1 Stage 2 Stage 3 Stage 4
which includes the number of virtual nodes mapped on to a PE and
the connectivity information capturing the communication pattern P
Header read HEADER FLOW TABLE (HFT) Header
P
A A
between the different nodes. Each PE has a dedicated memory to Data
C Header
C Data
Data K
store this configuration information. The configuration information Valid
K
Concentrator
Header Write
E
E Ready Ready
T
Valid
is routed from the I/O to the individual PEs using packet based Ready T (DC) Valid Valid
Ready
Data1
Execution Data1 G
communication. D Data2
16
16 Unit Data2 E
E N
After initialization, computation enters the decoding stage, C
(EU) E
O Valid 6 R
where data comes into the network in the form of logarithmic D Init Ready
A
T
likelihood ratios. The bit nodes then load the incoming data into E
llr1 llr2 Comp ID O
R R
their memories, and initialize the computation, which develops into (PD) Configuration Memory (CM)
(PG)
the iterative message passing between checks and bit nodes. When
the specified number of iterations have completed (and hence the
values of individual bits have converged), the bit node outputs the Figure 3: Bit Node Architecture
The bit node architecture is shown in detail in Figure 3. The
bit value to the I/O. The on-chip communication is detailed in the
interface to the network consists a single 48-bit input port and a
next subsection.
48-bit output port, each capable of transmitting an entire packet per
2.3 On-Chip Communication clock cycle. As each packet arrives, the data concentrator directs
values to the appropriate accumulator, based on the virtual
Inter-PE communication is handled by an on-chip network identification information in the packet header. Once all input
consisting of a number of small on-chip routers. In our values for a given virtual bit node are received, the computation
implementation of the network, a simple 2D mesh topology is proceeds to the execution unit, where the individual llr values are
Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID’05)
1063-9667/05 $20.00 © 2005 IEEE
subtracted from the accumulated sum, as in Equation 2 (see page to obtain the bit error rates. This simulation also provides us with
4). Finally, the source information contained in the header is used the number of cycles required for completing the decoding. When
to index the header flow table, generating the header for the combined with the clock frequency of our synthesized designs, the
outgoing packet containing the new llr. Note the parallel 16-bit proposed architecture is determined to provide a decoded bit
datapath throughout the bit node, supporting the lockstep throughput of 750 Mbps when a maximum number of iterations is
computation of two messages simultaneously. not set (instead the bit nodes had to converge to positive or
negative infinity). This is however a pessimistic approach; setting a
2.5 Check Node PE Architecture maximum iteration count and determining the bit value from the
The check node architecture is shown in Figure 4. The check ratio at the point where the maximum iterations are completed,
node has to compute the new llrs based on Equation 1. The data gives a higher throughput. The number of iterations is related to
flow through the check node is similar to that of the bit node, with the accuracy of the computation, as convergence is not guaranteed
the most obvious difference being the addition of four with a lower number of iterations. For low noise channels, it is
computational units (CU’s), which perform the bilinear transform desirable to limit the number of iterations, as it will result in a
in the logarithmic domain. The first two CUs, located just as the higher throughput, with the BER not as affected as in high noise
packet arrives, perform the operation on each individual incoming channels.
data value, allowing the accumulators in stage four to perform the Next, we used the binary data patterns captured from the cycle-
summation ¦ B (llr _ i ) . After all values have been received for a accurate NOCSim simulation as input pattern to our synthesized
given virtual check node, the execution unit subtracts away the design and simulated that using Power Compiler to obtain the
individual B(llr) values and performs the bilinear transform in the power consumption values. For the irregular LDPC code, the
logarithmic domain once more, completing the operations found in average power consumption in decoding 1024 bits is 34.8 Watts, a
Equation 1. Since each output llrx is generated in the same order as significant portion (43%) of which is consumed in the on-chip
the packet associated with x arrived, the stored headers are scrolled interconnect. The check nodes (23%) also consume more power
out and used as inputs to the header flow table, allowing each than the bit nodes (22%) as expected given their larger pipeline
packet to be addressed properly to return to the appropriate bit length and the computation complexity. Approximately 12% of the
node. Again, note that the node supports the simultaneous chip power is consumed in leakage power.
decoding of two independent message blocks, allowing for
lockstep operation.
4. Power Optimization
Stage 5 Stage 6
Stage 3 Stage 7
Stage 1
Valid
Stage 2
Stage 4
Stage 8
4.1 Power Driven Message Encoding
HEADER Header Flow Table
Data BUFFER Represented Value Binary data to be transmitted Transmitted Data
DATA
CONCENTRATOR Bits S1-S2 Bits 0-15
Ready CU 1 FIFO d1 +=
Execution Unit
16 Data Zero 0000000000000000 00 <previous data value>
Packet Generator
-
Data Redirector
FIFO h1
1
CU 3
Packet D ecoder
FIFO h2
- infinity 1111111111111111 11 <previous data value>
16
FIFO dn += - 1
CU 4
llr Figure 5a: Message Encoding Scheme.
FIFO hn
CU 2
S1 S1
S2 S2