This action might not be possible to undo. Are you sure you want to continue?
IEEE TRANSACTIONS ON COMMUNICATIONS. VOL. 43, NO. 9. SEPTEMBER 1995
Algebraic Survivor Memory Management Design for Viterbi Detectors
Gerhard Fettweis, Member, IEEE
AbstractThe problem of survivor memory management of a Viterbi detector is classically solved either by a registerexchange implementation which has minimal latency, but large hardware complexity and power consumption, or by a traceback scheme but larger latency. Here an with small power consumption, algebraic formulation ofthe survivor memorymanagement is introduced which provides framework for the derivation of new a algorithmic and architectural solutions. This allows for solutions ih to be designed w t greatly reduced latency andor complexity, as well as for achieving a tradeoff between latency and complexity. VLSI case studies of specific new solutions have shown that at minimal latency more than50%savings are possible in hardware complexity as well as power consumption.
I. INTRODUCTION
YNAMIC PROGRAMMING is a wellestablished approach for a large variety of problems concerning mul[l]. One spccific application of tistage decision processes dynamic programming is the search for the best path through agraphofweightedbranches. These branch weights in the metrics. The path following will be referred to as branch through graph the which is to be foundis that onc with themaximum (or minimum) cost, Le., themaximumvalue of accumulated branch rnetrics. An example of such a graph is thetrellis (the statetransition diagram) of adiscretetime finite machine. state Thestate sequcncc of the finite state machinemarks a paththrough the trellis. If thispath is to he estimated with the help of noisy measurements of the output of the finite state machine,and if this is solved by dynamic programming, then in communications this is called the “Viterbi algorithm” (VA) [2]. The VA was introduced in 1967 as a methodto decode convolutional codes [3]. In themeantimethe VA hasfound widespreadapplications in transmission, magnetic communications as. e.g., in digital recording and speechrecognition.A comprehensive tUtorbd1 on the VA is given in [4]. The VA can bedivided into threefunctional units, the branch mctric unit (BMU), the addcompareselect unit (ACSU), and the survivor memory unit (SMU). Whereas the BMUand ACSU perform arithmeticoperations as addition, multiplication, andmaximudminimum selection.the SMUhasto of decision pointers trace the course of a path with the help
D
thatweregenerated in the ACSU.Two basic methods for implementing the SMU are known, the registerexchange and traceback SMU, of which the first has minimal latency but large hardware complexity, and the latter has a smaller hardware complexity but longer latency. The focus of this paper is on providing a novel algebraic framework for describing the survivor memory management problem. This enables the easy design of new SMU architectures, tailored to the desired latency/complexityoptimization goal. Following, a brief introduction in the VA is given in Section 11. Section 111 describes the survivor memory problem. andfurthermorc itsalgebraicformulation is introduced [ 5 ] . Based on this, the following two sections outline architectural alternatives, Le., continuousflow processing in Section IV, and blockprocessing in Section V.
1 . T H E VITERBI 1 ALGORITHM
Assume a discretetime finite state machine with N states. Without loss of generality we assume that the transition diagram and the transition rate 1/T are constant in time. The trellis,which shows thetransitiondynamics, is a twodimensional graph which is described in vertical direction and in horizontal direction by timc instants by M states kT (T = 1). The statcs of time instant k are connected with those of time k 1 by the branches of time interval ( k , k I). Below we refer to a specific state z at time instant k as “node” s i , h . A simple example of a trellis is given in Fig. l(a) for N = 2 states. The notation used can be summarized ab
+
+
N
5
s;,k
number of states time instant node: ith possible state of time instant
k
The finitestate machine chooses a path through the trellis, and with thehelpoftheobserved state transitions (over a noisy channel) the branch metrics of time interval ( A : , k: 1) are computed. The best path through the trellis is calculated recursively by the VA, where best can mean, e.g., thc “most likeliest”. This is done recursivcly by computing N paths, i.e., the optimum path to each of the N nodes of time k . The N new optimum Paper approbed by I. Treng.theEditor for VLSI in Communicallorls paths of time k+ 1 are calculated with the help of the old paths of thc WEE Communications Society. Manuscript received June 9, 1992; 1).This shall be revised Map 20, 1993. This paper was presented in part at the Proceed~ngs and the branch metrics of timc step ( k , k of the IEEE International Conference on Communications (ICC’Y?). Chicago, explained for thc simple trellis shown in Fig. 1 (a). As indicated 1L. June 1992. in Fig. l(b), each of the optimum paths of time k , i.e., each The author IS with the Dresden Universlty ofTechnology, D01062 Drcsdcn, node ,sL,k, hasapathmetric 7 i . k which is theaccumulation Germany. IEEE Log Number 94 13 168. of its branch metrics. Now the new optimum path leading to
+
+
00906778/95$04.00 C 1995 IEEE
FETTWEIS,MEMORYMANAGEMENTDESIGNFOR
VITERBI DETECTORS
% 
...
% 
    
A
P 1 , k
 k
 ...
tlT
7
k k
kll
k
kt1
...
/
Fig. 3.
Example of traceback decision pointers T r .\= 4 o
lime
k'D k k+l
(b)
h g , 1. Example of trellis, addcompareselect, and detected paths. (a) Trellis with "V = 2 states, (b) Decoding the optimum path tn node .sl , ~ c +attime ~ ! 1 . Thc paths merge when theyare traced back D t m e steps. i
memory management is introduced which provides a framework for the dcrivation of new algorithmic and architectural solution$. VISI case studies of specific new solutionshave shown that more than 50% savings are possible in hardware complexity as well as power consumption.
+
111. THE SUKVIVUK MEMUKY UNIT
of concern in this paper i s The unit of the VU which is the SMU. Generally. two basic methods have been proposed for solving the problem of processing the decisions made in the ACSU to reccivc thc dctcctcd path: thc rcgistcrcxchange (RE) and thetraceback (TB) SMU [ h ] . In case of an RESMU the new decisions of each iteration k are used to compute and store all N paths recursively, one to every state. Then the state of time k  U is simply determined ? by reading out the state of time A: l o f one of the paths. In case of a TBSMU the decisions are stored in a RAM, and thcn one path is traced back recursively D steps by using the stored decisions to determine the state of time k U . At a first glance this might seem not to be well suited for VLSI, sinceateach time stepone new decision is written to the RAM and D decisions are read during the traceback, making this a bottleneck for the iteration speed of the VD. However, 7 by blockwisetracing back more than i steps at a time, a blockofmorethan one state is determinedpertraceback. Combining this with multiple traceback pointers operating on multiple RAM'S in parallel has allowed for the derivation of many efficienthardwaresolutions [7][9].
~ ~
( OuaUt
Fig. 2. Block diagram of the Viterbi
detector
node .sl,l;+l is the path with maximum metric leading to this node. Thcrcfore the new path metric  y l , k + l of node sl.l;+l is
~ I , A  += I
n=im4All.k
+ Y I , ~ ,A 1 2 , k + Y Z , ~ )
and thepathmetric of node . s z , ~ + lis computed in analogy. This is referred to as the addcompareselect (ACS) recursion of the VA. The problem which needs to be solved is to dctcrmine the best (unique) path with the help of the decisions of the ACSrecursion. If all N paths are tracedback in time then they merge into auniqucpath,andthis is exactlythebest one which is to befound.Thenumber of timestepsthathave tobe traced back for thepaths to havemerged with high probability is calledthe survivor depth D . Therefore, i n a practical implementation of the VA the latency of decoding is ? at least l lime steps. An implementation of the VA, referred to as Viterbi detector (VD), can be divided into three basic units, as shown in Fig. 2. The input data is used in thebranchmetric unit (BMU) to calculate the set of branch mctrics for each new time step. These arethen fed to the addcompareselect unit (ACSU) which accumulates branch the metrics recursively as path metrics according to the ACSrecursion. The survivor memory unit (SMC) processes the decisions which are being made in the ACSU, and outputs the estimated path with a latency of at least D . The problcm solved by the SMU can therefore be stated as: find the state of time k L). This is classically solved either by a registerexchange implementation which has minimal latency. but large hardware complexity and power consumption. or by a traccback scheme with small power consumption, but larger latency. Here an algebraic formulation of the survivor
~
A. The TraceBack SMU
A more detailed description of the traceback scheme is as follows. At time k the current decision of state i points to its preceding state, for which we will use the notation h k ( : ) , with the value of b,(i) E { 1. . . . , X } pointing to the state preceding state i . Hence, a set of .v pointers { bk.( 1).. . . . hi (X)} r' makes up the decisions of time k . For ease of understanding see the example for N = 4 shown in Fig. 3. Now traceback the procedure works by starting at an arbitrary state b at time k . Itsdecision h k ( b ) determines thc precedingstate of time k  1, and thedecision of this state determines the state of time k  2. as O k  l ( h , ( b ) ) , etc.; until by looking up D decisions in this traceback manner
h k & D + 1 ( . . ' b k & l ( h k ( b ) ) .. .)
~
(1)
the state o f time k D is determined. As can be seen by thenature of thisdecisiontraceback, the usual way of implementation is by using multiplexers to
2460
lEEE TRANSACTIONS ON COMMUNIC4TIONS. VOL. 43, NO. 9, SEPTE.MBER 1995
pick thenextdecisionpointer in the scheme. However,this traceback can also be formulatedalgebraically by introducing another notation for 6 k ( i ) . For 6 h ( i ) = j the Wdimensional vector J k ( i , ) is defined as the allzero vector except for a 1 entry at the jth position bk.2) := (0, I . .
v
1
, 0 ; . , . 0).
(2)
Fig. 4. Linear lookahead pipelineinterleaving architecture (regisler exchange).
7th position for S k ( i ) = j.
Iv.
PIPELINE INTERLEAVING
LOOKAHEAD ARCHIT~CTURES
Now the set of N decisions at time IC form the square matrix
Since the traceback decoding of the dccisions principally has to take place at every new time instant, it is clear that the multiplication given in ( 5 ) is to be viewed upon as a sliding windowoperation over the sequence { A h } . Hcnce. at time k:+l
Hence, if the starting state of the traceback 6 is written as a vector b [in analogy to (2)1, then & ( / I ) can be written as
hasto bc evaluated, and so on. It is to be noticedthat, due ( D  1)fold tothefactthattheassociativelawholds,the matrixmatrix multiplication of ( 5 )
Example: Assume a 4statetrcllis time k arc as shown in Fig. 3
bh(1)
where thedecisions
of
Ak
,
’ ‘
Ak.L)+1
(7)
= 2. b k ( 2 ) = 1. h k ( 3 ) = 2. h k ( 4 ) = 3
then
0 1 0 0
A k = (10 1 0 0 ) O O
0 0 1 0
can be carried out first, and then the row of interest can be picked by applying b. The continuous “sliding window” computation of the expression (7) is analogous to thetype of operationwhich is referred to as “pipeline interleaving lookahead computation” for the parallelization of linear feedback loops’ [ 101, [ I I ] . Hence, all pipeline interleaving architecturcs known for lookahead computation can be applied for the continuous (sliding) evaluationof (7).
A. The RegisterExxchunge SMU
If we now multiplythismatrix by (0.0,1,0) this rcads out the third row of A,. i.e., it determines the preceding state of state 3 as 6 ( ) = 2. ,3
For notational ease the shorthand notation
/0
1 0
0\
shall be introduced. The architecture known as “linear lookahead” [lo] for the slidingwindow evaluation of (7) is shown in Fig. 4. In this case the current A, is multiplied with D stored values in parallel, to obtain the following D results
The significant result of the algebraic formulation is that the Dfold traceback can now be written as
Thisis a Dfold vectormatrixproduct.It isto he noticed thatthis isjust analgebraicformulation of thetraceback procedure.Hence,conventionally used multiplexerarchitectures for tracebackdecoding of course can bc appliedhere for theimplementation of the vectormatrixmultiplications. Furthermore, due to the simplicity of the matrix operations it is clear that this can also be done by simple gate logic. The most important aspect of ( 5 ) is that thc multiplication operation is associative. Therefore it can be carried out not only from left to right, but also in an arbitrary order as, e g . , in a faster treelike manner. In the followingwe shall now make use of this algebraic feature.
As can be seen, the first element, A,, indicates the prcceding states of the N current paths. The next elemcnt, 1, determines the state of two time steps back of every current path. By carrying this on, i t can be seen that (8) yields exactly the state sequence of all N current paths of time k ovcr the whole interval ( k  D + 1:k ) . Thus, it can easily be seen that thelinearlookaheadarchitcctureofFig. 4 is thealgebraic formulation of thc RESMU.
‘Onc other very important application of such a Utold mdtlplication IS the carry computatlon of a binary adder, for which different algorithm are known as e.6. carryripple, cmyskip, cqselect, and carrylookahead [IO]. Now these architectures can all be transferred to derlve SMU realizations.
2462
COMMUNICATIONS, IEEE TRANSACTIONS ON
VOL. 43, NO. 9. SEFTEMBER 1995
architecture RESMU example Fig. 5(a) Dblock TBSMU example Fig. 6
total memory
registers
RAM

R/w
vectormatrix mult.
Db
D 3D('1.Y + 1) D( A' + 1)
DA T
D nlugl (D).\t 3s 4 x\ 3
D .v
log2(D).\\t 2
latency D D 4D 2D
7\ .
( D  7).Y
D(4.Y D(S
+ 1)
+ 1)
+ +
s+ 1
+Note,a$ mentioned in Section IVR, these multiplicationscan he more complex, especially for large .Y.
Initialize
algorithms and architectures that can be derived, two directions of further research shall be pointed out in the following. computing If multiplier a feedback loop is used for L A k = 7 i L , then the coarse grain sequcnce { T , & = ~ ~ L } can be used to perform tracebacks in the larger stepsize of L , to cut downthe latency of the SMU. If expression (IO) is examined more closely, it can be seen that it can also be written as the product of the factor ( b . ~ A k ) with a vector
I;
Flg. 6. Single ~nultiplier feedback loop followed by blocktracehack.
for starting pointer at
= JID,
(b ' D a k = n A l )
. (akn.nkD.n,ol;. . , a k  o . . . a k D,\1+1). .
(12)
of time interval ( r t U , n L ) U ) , and has given out m out of D decisions for the traceback. The total RAMsize therefore is only M = D pointers, each of complexity N . It is to bc noticcd that a blockwisc traceback always leads to giving out blocks of the detected path, which internally are in timereversed order. This can be corrected by a second RAM of size D , which again is operating in the same blockbyblock LIFO manner [8]. Hence, the total latency is D M = 2 0 . The additional RAMsize is only D state indexes. This results in a total s i x o f both I J F O RAM's o F D x ()Xr+ 1 ). Compared tothe analogous conventional2pointertraceback schemes [7]. [SI, this amounts to at least 50% savings in hardware as well as latency. Of course method this can be generalized t o the case where 1. is a 11 divisor of D. D = f M . Then a number of f multiplier feedback loops operate in parallel on the computation of { , + l = 7 , ~ ~ } . In this case the total latency comprised 01 (raceback and of the time ordering blockLIFO is reduced from 2 0 to and the RAM size is reduced to D X V M X 1 = D X ( N j). In comparison to conventional traceback methods thi\ new class of algorithms has a substantially reduced latency, RAMsizeandtracebackpointerlogic, by thecost of one,orin matrix multipliers. Since these general f 2 1, additional multipliers operate sequentially on one single decision matrix at a time,theircomplexity is exactlythat of one stage of a conventional RESMU.
+
+
The contents of the vector in expression (12) is exactly what is computed by a registerexchange SMU of length hil, see Section IVA. Hence, combinations of registerexchange and traceback promise lo yield further solutions of interest. In addition note that the algebraic formulation can also lead tosimplifiedsoftwareimplementations. For example, for an N = 2 state problem it can casily bc seen that the logarithmic lookahead RESMU of Fig. 5 can be much more efficient to implement than any other solution. VI. DISCUSSION Due to the variety of possibledifferenttechnologies that the discussed may be used for implementing architectures in thispaper,it is difficult to find an objectivemeasureto comparethem. To allow forsome objective comparisons to be made,the total amount of memorymust be dividedinto memory which can be realized by RAM and memory that must be realized by registers. In addition, the multiplications can be divided into vectormatrix and matrixmatrix multiplicatlons, where the latter I S N times as complex as the former since it comprises N vectormatrix multiplications. Abasicmeasureof power consumption is the number 01. vectormatrix multiplication and the number of read and write (W) operations that are necessary. Therefore, for power consumption comparisons,the number of R / M 7 operations must be added as ameasure. Using these more detailed measures, the solutions which are compared i n Table I are the RESMU, the TBSMU with block traceback of block length D , and the new SMU architectures of Figs. 5(a) and 6. It can be seen that thealgebraicformulation 01' the SMU problem allowed for an easy design of new architectures which are sample points in the large space of solutions with differing latency. memory complexity, and arithmetic complexity. The
+
FD,
+
B. Furlher Methods
The description of all possible algorithms and architectures for survivormemory managementwould by far exceedthe scope of this paper. The intention here lies in showing that the algebraicnotationprovides a frameworkfor finding a large variety of new solutions. To point out the large span of new
FFTTWEIS, MEMORY MANAGEMENT DESIGN FOR VITERBI DETECTORS
2463
algebraicformulationenablessolutions tobe designed with [7] R. Cypher and B. Shung, “Generalized trace hack techniques for survivor memory management in the Viterbi algorithm,” in IEEE greatly reduced latency and/or complexity, as well as it allows Diego, CA, Dec. 1990, vol. 2, pp. 13181322 GLOBECOM, San for achieving a tradeoff between latency, hardware complexity, (707A.l). [B] G. Feugin and P. G. Gulak, “Survivor mcmory managcmcnt in Viterbi and power consumption.
decoders,” IEEE Trans. Commun., vol. 39, 1991.
VII. CONCLUSION In paper this an algebraic formulation of the survivor memory management of Viterbi detectors is introduced. This revealsthefactthattheproblem of survivormemoryimplementation is analogous to the realization of lookahead in parallelized linear feedback loops. Hence, next to finding new solutions, a wide range of known solutions can be transferred and adapted from chis wellknown problem. They mainly presentnovelapproachesforsurvivormemoryrealization. VLSI case studies of novel algorithms and architectures have shown that 50% savingsinhardwareand/orlatencycanbe achieved. The algebraic formulation introduced here is related to the algebraic formulation of the addcompareselect recursion of theViterbidetector, introduced in [14], [15]. Hence.itnow is easy to derive well matched survivor memory realizations also for all parallelized Viterbi detectors.
[SI T. K. Truong,M.T. Shih, 1. S . Reed, and E. H. Satorius, “A VLSl design for a traceback Viterbi decoder,” IEEE Trans. Commun., vol.
40, pp. 61G624, Mar. 1992. [ I O ] G. Fettweis, L. Thiele, and H. Meyr, “Algorithm transformations for unlimited parallelism.” in Proc. IEEE Int. Symp. Circuits and Sysr., New Orleans, LA, May 1990, vol. 2, pp. 17561759. [ I l l K. K. Parhi and D. G . Messerschmidt, “Block digital filtering cia incremental blockstatestructures,” in Proc. IEEE Int. Symp. Circults and Syst., Philadelphia, PA, 1987, pp. 645448. [I21 , “Pipelined VLSI recursive composition,” in Proc. IEEE Znt. Con$ Acoust., Speech and Signal Processing, New York, 1988, pp. 212C2123. [I31 L. Thiele and G. Fettweis, “Algorithm transformations forunlim~ted parallelism,” Elecrron. and Commun. ( A E i i ) , vol. 2, pp. 8391. Apr. 1990. [I41 G. Fettweis and H. Meyr. “Highspeed Viterbi processor: A systolic array solution,” IEEE J. Select. Areas Commim., vol. 8 pp. 15201534. Oct. 1990. [I51 , “Highspeed parallel Viterhi decoding,” lEEE Commun. Mug.. pp. 4 6 5 5 , May 1 9 9 1 .
REFERENCES
R. E. Bellman and S . E. Dreyfus, Applied Dynumic Prugrumming.
Princeton, NJ: Princelon University Press, 1962. A. 1. Omura, “On the Viterbi algorithm,” IEEE Trans. Inform. Theory, pp, 177179, Jan. 1969. A. 1. Viterhi, “Error bounds for convolutional coding and an asymptotically optimum decoding algorithm,” IEEE Trans. Inform. Theory. vol. IT13, pp. 260269, Apr. 1967. G . D. Forney. “The Viterbi algorithm.” Proc. IEEE, Mar. 1973, vol. 61, pp. 268278. G. Fettweis, “Algebraic survivor memory management for Viterbi dctectors,’‘ IEEE Int. Conf: Commun. (ICC’YZJ,Chicago, IL, June 1992. pp. 313.4.1313.3.5. C. M. Rader, “Memory management in a Viterbi decoder,” IEEE Trans. Commun., vol. COM29, pp. 13991401, Scpt. 1981.
Gerhard Fettweis (S’84M’90) received the Dipl: Ing. and the Ph.D. degrees in electrical engineering from the Aachen University of Technology, Aachen, Germany. in lY86 and 19YO. 1espectlvely. He is a scientist at TCSI Corporation, Berkeley, CA In 1986 he worked at ABB rescarch laboratory, Baden. Switzerland on his Diplomthesis. During 1991 he was a visiting scientist at the IBM Almaden Research Center, San Jose, CA. His interests are in microelectronlcs and digital wlreless communications, especially the interaction between algorithm and architecture design for highperformance VLSI proces5or Implementations. Dr. Fettweis is a member of the IEEE Solid Slate Circuits Council ab representative of the IEEE Communications Society, and is Associate Editor ON CIRCUITS AND SYSTEMS I1 of the IEEE TRANSACTIOKS
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.