You are on page 1of 6



Algebraic Survivor Memory Management

Design for Viterbi Detectors
Gerhard Fettweis, Member, IEEE

Abstract-The problem of survivor memory management of a thatweregenerated in the ACSU.Two basic methods for
Viterbi detector is classically solved either by a register-exchange implementing the SMU are known, the register-exchange and
implementation which has minimal latency, but large hardware trace-back SMU, of which the first has minimal latency but
complexity and power consumption, or by a trace-back scheme
withsmallpowerconsumption, but largerlatency.Here an large hardware complexity, and the latter has a smaller hard-
algebraic formulation ofthe survivor memorymanagement is ware complexity but longer latency. The focus of this paper
introduced which providesa framework for the derivation of new is on providing a novel algebraic framework for describing
algorithmic and architectural solutions. This allows for solutions the survivor memory management problem. This enables the
to be designed with greatly reduced latency andor complexity, as easy design of new SMU architectures, tailored to the desired
well as for achieving a tradeoff between latency and complexity.
VLSI case studies of specific new solutions have shown that at latency/complexityoptimization goal.
minimal latency more than50%savings are possible in hardware Following, a brief introduction in the VA isgiven in
complexity as well as power consumption. Section 11. Section 111 describes the survivor memory problem.
andfurthermorc itsalgebraicformulation is introduced [ 5 ] .
I. INTRODUCTION Based on this, the following two sections outline architectural
alternatives, Le., continuous-flow processing in Section IV, and

D YNAMIC PROGRAMMING is a well-established ap-

proach for a large variety of problems concerning mul-
tistagedecisionprocesses [l]. One spccificapplication of
blockprocessing in Section V.

dynamic programming is the search for the best path through
agraphofweightedbranches. These branch weights in the Assume a discrete-time finite state machine with N states.
following will be referredto as branchmetrics. The path Without loss ofgeneralitywe assume that thetransition
throughthegraphwhich is to be foundis that onc with diagram and thetransition rate 1/T are constant in time.
themaximum (or minimum) cost, Le., themaximumvalue The trellis,which shows thetransitiondynamics, is a two-
of accumulated branch rnetrics. An example of such a graph dimensionalgraphwhich is described in verticaldirection
is thetrellis (the statetransition diagram) of adiscrete-time by M statesand in horizontaldirection by timcinstants
finitestatemachine. Thestate sequcncc of the finite state kT (T = 1). The statcs of time instant k are connected with
machinemarks a paththrough the trellis. If thispath is to +
those of time k 1 by the branches of time interval ( k , k I). +
he estimated with the help ofnoisy measurements of the Below we refer to a specific state z at time instant k as “node”
output of the finite state machine,and if this is solved by s i , h . A simple example of a trellis is given in Fig. l(a) for

dynamic programming, then in communications this is called N = 2 states. The notation used can be summarized ab
the “Viterbi algorithm” (VA) [2]. The VA was introduced
in 1967 as a methodto decode convolutional codes [3]. In N number of states
themeantimethe VA hasfound widespreadapplications in 5 time
communications as. e.g., in digitaltransmission,magnetic s;,k node: ith possible state of time instant k
recording and speechrecognition.A comprehensive tUtorbd1
on the VA is given in [4]. The finite-state machine chooses a path through the trellis,
The VA can bedivided into threefunctional units, the branch and with thehelpoftheobserved state transitions (over a
mctric unit (BMU), the add-compare-select unit (ACSU), and noisy channel) the branch metrics +
of time interval ( A : , k: 1)
thesurvivormemory unit (SMU). Whereas the BMUand are computed.
ACSUperform arithmeticoperations asaddition,multipli- The best path through the trellis is calculated recursively by
cation, andmaximudminimum selection.the SMUhasto the VA, where best can mean, e.g., thc “most likeliest”. This
trace the course of a path with the help of decision pointers is done recursivcly by computing N paths, i.e., the optimum
path to each of the N nodes of time k . The N new optimum
Paper approbed by I. Treng.theEditor for VLSI in Communicallorls paths of time k+ 1 are calculated with the help of the old paths
of thc WEE Communications Society. Manuscript received June 9, 1992;
revised Map 20, 1993. This paper was presented in part at the Proceed~ngs and the branch metrics of timc step ( k , k + 1).This shall be
of the IEEE International Conference on Communications (ICC’Y?). Chicago, explained for thc simple trellis shown in Fig. 1 (a). As indicated
1L. June 1992. in Fig. l(b), each of the optimum paths of time k , i.e., each
The author IS with the Dresden Universlty ofTechnology, D-01062 Drcsdcn,
Germany. node ,sL,k, hasapathmetric 7 i . k which is theaccumulation
IEEE Log Number 94 13 168. of its branch metrics. Now the new optimum path leading to
0090-6778/95$04.00 C 1995 IEEE

% -
- - - P 1 , k
A - ... 7
k k kll k

% - - - - - -
k kt1 tlT ... /

Fig. 3. Example of trace-back decision pointers Tor .\-= 4

memory management is introduced which provides a frame-

lime work for the dcrivation of new algorithmic and architectural
k-'D k k+l
solution$. VISI case studies of specific new solutionshave
shown that more than 50% savings are possible in hardware
complexity as well as power consumption.
h g , 1. Example of trellis, add-compare-select, and detected paths. (a) Trellis
with "V = 2 states, (b) Decoding the optimum path tn node .sl , ~ c +at~time
! +
i 1 . Thc paths merge when theyare traced back D t m e steps. 111. THE SUKVIVUK MEMUKYUNIT
The unit of the VU which isof concern in this paper i s
the SMU. Generally. two basic methods have been proposed
for solving the problem of processing the decisions made in
(- OuaUt
the ACSU to reccivc thc dctcctcd path: thc rcgistcr-cxchange
(RE) and thetrace-back (TB) SMU [ h ] .
In case of an RE-SMU the new decisions of each iteration k
Fig. 2. Block diagram of theViterbi detector are used to compute and store all N paths recursively, one to
every state. Then the state of time k - U is simply determined
node .sl,l;+l is the path with maximum metric leading to this by reading out the state of time A: l? o f one of the paths.

node. Thcrcfore the new path metric - y l , k + l of node sl.l;+l is In case of a TB-SMU the decisions are stored in a RAM,
and thcn one path is traced back recursively D steps by using
~ I , A - +=
I n=im4All.k + Y I , ~ ,A 1 2 , k + Y Z , ~ ) the stored decisions to determine the state of time k U . At ~

and thepathmetric of node . s z , ~ + lis computed in analogy. a first glance this might seem not to be well suited for VLSI,
This is referred to as the add-compare-select (ACS) recursion sinceateach time stepone new decision is written to the
of the VA. RAM and D decisions are read during the trace-back, making
The problem which needs to be solved is to dctcrmine the this a bottleneck for the iteration speed of the VD. However,
best (unique) path with the help of the decisions of the ACS- by block-wisetracing back more than i 7 steps at a time, a
recursion. If all N paths are tracedback in time then they blockofmorethan one state is determinedpertrace-back.
merge into auniqucpath,andthis is exactlythebest one Combining this with multiple trace-back pointers operating on
which is to befound.Thenumber of timestepsthathave multiple RAM'S in parallel has allowed for the derivation of
tobe traced back for thepaths to havemerged with high many efficienthardwaresolutions [7]-[9].
probability is calledthe survivor depth D . Therefore, i n a
practical implementation of the VA the latency of decoding is A. The Trace-Back SMU
at least l? lime steps. A more detailed description of the trace-back scheme is as
An implementation of the VA, referred to as Viterbi detector follows. At time k the current decision of state i points to its
(VD), can be divided into three basic units, as shown in Fig. 2. preceding state, for which we will use the notation h k ( : ) , with
The input data is used in thebranchmetric unit (BMU) to the value of b,(i) E { 1. . . . , X } pointing to the state preceding
calculate the set of branch mctrics
for each new time state i . Hence, a set of .rv' pointers { bk.( 1).. . . . hi (X)} makes
step. These arethen fed to the add-compare-select unit (ACSU) up the decisions of time k . For ease of understanding see the
whichaccumulatesthebranchmetricsrecursively as path example for N = 4 shown in Fig. 3.
metrics according to the ACS-recursion. The survivormemory Nowthetrace-backprocedureworks by starting at an
unit (SMC) processes the decisions which are being made in arbitrary state b at time k . Itsdecision h k ( b ) determines thc
the ACSU, and outputs the estimated path with a latency of precedingstate of time k - 1, and thedecision of this state
at least D . determines the state of time k - 2. as O k - l ( h , ( b ) ) , etc.; until
The problcm solved by the SMU can therefore be stated as: by looking up D decisions in this trace-back manner
find the state of time k L). This is classically solved either
h k & D + 1 ( . . ' b k & l ( h k ( b ) ) .. .)

by a register-exchange implementation which has minimal la-
tency. but large hardware complexity and power consumption. the state o f time k D is determined.

or by a tracc-back scheme with small power consumption, but As can be seen by thenature of thisdecisiontrace-back,
larger latency. Here an algebraic formulation of the survivor the usual way of implementation is by using multiplexers to

pick thenextdecisionpointer in the scheme. However,this

trace-back can also be formulatedalgebraically by introducing
another notation for 6 k ( i ) .
For 6 h ( i ) = j the Wdimensional vector J k ( i , ) is defined as
the all-zero vector except for a 1 entry at the j-th position
Fig. 4. Linear look-ahead pipeline-interleaving architecture (regisler ex-
bk.2) := (0, I . .
- 1 , 0 ; . , . 0). change).
7th position
Since the trace-back decoding of the dccisions principally
Now the set of N decisions at time IC form the square matrix
has to take place at every new time instant, it is clear that the
multiplication given in ( 5 ) is to be viewed upon as a sliding
windowoperation over the sequence { A h } . Hcnce. at time

Hence, if the starting state of the trace-back 6 is written as a

vector b [in analogy to (2)1, then & ( / I ) can be written as hasto bc evaluated, and so on. It is to be noticedthat, due
tothefactthattheassociativelawholds,the ( D - 1)-fold
matrix-matrix multiplication of ( 5 )
Example: Assume a 4-statetrcllis where thedecisions of Ak , ’ ‘ Ak.--L)+1 (7)
time k arc as shown in Fig. 3
can be carried out first, and then the row of interest can be
bh(1) = 2. b k ( 2 ) = 1. h k ( 3 ) = 2. h k ( 4 ) = 3
picked by applying b.
The continuous “sliding window” computation of the ex-
pression (7) is analogous to thetype of operationwhich is
0 1 0 0 referred to as “pipeline interleaving look-ahead computation”
for the parallelization of linear feedback loops’ [ 101, [ I I ] .
A k = (10 1O 0O 0 ) Hence, all pipeline interleaving architecturcs known for look-
0 0 1 0 ahead computation can be applied for the continuous (sliding)
evaluationof (7).
If we now multiplythismatrix by (0.0,1,0) this rcads out
the third row of A,. i.e., it determines the preceding state of A. The Register-Exxchunge SMU
state 3 as 6,(3) = 2.
For notational ease the short-hand notation
/0 1 0 0\

shall be introduced. The architecture known as “linear look-

ahead” [lo] for the sliding-window evaluation of (7) is shown
The significant result of the algebraic formulation is that the in Fig. 4. In this case the current A, is multiplied with D
D-fold trace-back can now be written as stored values in parallel, to obtain the following D results

Thisis a D-fold vector-matrixproduct.It isto he noticed As can be seen, the first element, A,, indicates the prcceding
thatthis isjust analgebraicformulation of thetrace-back states of the N current paths. The next elemcnt, 1,
procedure.Hence,conventionally used multiplexerarchitec- determines the state of two time steps back of every current
tures for trace-backdecoding of course can bc appliedhere path. By carrying this on, i t can be seen that (8) yields exactly
for theimplementation of the vector-matrixmultiplications. the state sequence of all N current paths of time k ovcr the
Furthermore, due to the simplicity of the matrix operations it whole interval ( k - D + 1:k ) . Thus, it can easily be seen that
is clear that this can also be done by simple gate logic. thelinearlook-aheadarchitcctureofFig. 4 is thealgebraic
The most important aspect of ( 5 ) is that thc multiplication formulation of thc RE-SMU.
operation is associative. Therefore it can be carried out not
only from left to right, but also in an arbitrary order as, e g . , ‘Onc other very important application of such a U-told mdtlplication IS
the carry computatlon of a binary adder, for which different algorithm are
in a faster tree-like manner. In the followingwe shall now known as e.6. carry-ripple, cmy-skip, cq-select, and carry-look-ahead [IO].
make use of this algebraic feature. Now these architectures can all be transferred to derlve SMU realizations.
ON VOL. 43, NO. 9. SEFTEMBER 1995

architecture total memory registers RAM R/w vector-matrix mult. latency

RE-SMU Db DA T - D n- D .v D
example Fig. 5(a) D 3- 7.\- ( D - 7).Y lugl (D).\-t log2(D).\\-t D
D-block TB-SMU D('1.Y + 1) - D(4.Y + 1) 3s 4+ 2 4D
example Fig. 6 D( A' + 1) - D(S + 1) +
x\- 3 s+ 1 2D
+Note,a$ mentioned in Section IV-R, these multiplicationscan he more complex, especially for large .Y.

Initialize algorithms and architectures that can be derived, two directions

of further research shall be pointed out in the following.
If multiplier
a feedback loop is used for computing
L A k = 7 i L , then
the coarse grainsequcnce { T , & = ~ ~ L } can
be used to perform trace-backs in the larger step-size of L , to
cut downthe latency of the SMU.
If expression (IO) is examined more closely, it can be seen
that it can also be written as the product of the factor ( b . ~ A k )
with a vector

(b ' D a k = n A l )
Flg. 6. Single~nultiplierfeedbackloop for starting pointer at I; = JID,
followed by blocktrace-hack. . (ak-n.nk-D.n,-o-l;. . . , a k - o . . . a k D-,\1+1).
of time interval ( r t U , n L ) U ) , and has given out m out of The contents of the vector in expression (12) is exactly what
D decisions for the trace-back. The total RAM-size therefore is computed by a register-exchange SMU of length hil, see
is only M = D pointers, each of complexity N . Section IV-A. Hence, combinations of register-exchange and
It is to bc noticcd that a block-wisc trace-back always leads trace-back promise lo yield further solutions of interest.
to giving out blocks of the detected path, which internally are In addition note that the algebraic formulation can also lead
in time-reversed order. This can be corrected by a second RAM tosimplifiedsoftwareimplementations. For example, for an
of size D , which again is operating in the same block-by-block N = 2 state problem it can casily bc seen that the logarithmic
LIFO manner [8]. Hence, the total latency is D M = 2 0 . + look-ahead RE-SMU of Fig. 5 can be much more efficient to
The additional RAM-size is only D state indexes. This results implement than any other solution.
in a total s i x o f both I J F O RAM's o F D x ()Xr+ 1 ). Compared
tothe analogous conventional2-pointertrace-back schemes
[7]. [SI, this amounts to at least 50% savings in hardware as VI. DISCUSSION
well as latency. Due to the variety of possibledifferenttechnologies that
Of coursethismethodcan be generalized t o the case may be used for implementingthearchitecturesdiscussed
where 11.1 is adivisor of D. D = f M . Then a number in thispaper,it is difficult to find an objectivemeasureto
of f multiplier feedbackloopsoperate in parallel on the comparethem. To allow forsome objective comparisons to
computation of { , + l = 7 , ~ ~ } . In this case the total latency be made,the total amount of memorymust be dividedinto
comprised 01- (race-back and of the time ordering block-LIFO memory which can be realized by RAM and memory that must
is reduced from 2 0 to FD, and the RAM size is reduced be realized by registers. In addition, the multiplications can be
to D X -V M X 1 = D X ( N +
j). divided into vector-matrix and matrix-matrix multiplicatlons,
In comparison to conventional trace-back methods thi\ new where the latter I S N times as complex as the former since it
class of algorithms has a substantially reduced latency, RAM- comprises N vector-matrix multiplications.
sizeandtrace-backpointerlogic, by thecost of one,orin Abasicmeasureof power consumption is the number 01.
general f 2 1, additionalmatrixmultipliers. Sincethese vector-matrix multiplication and the number of read and write
multipliers operate sequentially on one single decision matrix (W)operations that are necessary.Therefore,for power
at a time,theircomplexity is exactlythat of one stage of a consumption comparisons,the number of R / M 7 operations must
conventional RE-SMU. be added as ameasure.
Using these more detailed measures, the solutions which are
compared i n Table I are the RE-SMU, the TB-SMU with block
B. Furlher Methods trace-back of block length D , and the new SMU architectures
The description of all possible algorithms and architectures of Figs. 5(a) and 6.
for survivormemory managementwould by far exceedthe It can be seen that thealgebraicformulation 01' the SMU
scope of this paper. The intention here lies in showing that the problem allowed for an easy design of new architectures which
algebraicnotationprovides a frameworkfor finding a large are sample points in the large space of solutions with differing
variety of new solutions. To point out the large span of new latency. memory complexity, and arithmetic complexity. The

algebraicformulationenablessolutions tobe designed with [7] R. Cypher and B. Shung, “Generalized trace hack techniques for
greatly reduced latency and/or complexity, as well as it allows survivor memory management in the Viterbi algorithm,” in IEEE
GLOBECOM, SanDiego, CA, Dec. 1990, vol. 2, pp. 1318-1322
for achieving a tradeoff between latency, hardware complexity, (707A.l).
and power consumption. [B] G. Feugin and P. G. Gulak, “Survivor mcmory managcmcnt in Viterbi
decoders,” IEEE Trans. Commun., vol. 39, 1991.
[SI T. K. Truong,M.-T. Shih, 1. S . Reed, and E. H. Satorius, “A VLSl
VII. CONCLUSION design for a trace-back Viterbi decoder,” IEEE Trans. Commun., vol.
40, pp. 61G624, Mar. 1992.
Inthispaper an algebraicformulation of thesurvivor [ I O ] G. Fettweis, L. Thiele, and H. Meyr, “Algorithm transformations for
memory management of Viterbi detectors is introduced. This unlimited parallelism.” in Proc. IEEE Int. Symp. Circuits and Sysr., New
Orleans, LA, May 1990, vol. 2, pp. 1756-1759.
revealsthefactthattheproblem of survivormemoryim- [ I l l K. K. Parhi and D. G . Messerschmidt, “Block digital filtering cia
plementation is analogous to the realization of look-ahead in incremental block-state-structures,” in Proc. IEEE Int. Symp. Circults
parallelized linear feedback loops. Hence, next to finding new and Syst., Philadelphia, PA, 1987, pp. 645448.
[I21 -, “Pipelined VLSI recursive composition,” in Proc. IEEE Znt.
solutions, a wide range of known solutions can be transferred Con$ Acoust., Speech and Signal Processing, New York, 1988, pp.
and adaptedfrom chis well-knownproblem.Theymainly 212C2123.
presentnovelapproachesforsurvivormemoryrealization. [I31 L. Thiele and G. Fettweis, “Algorithm transformations forunlim~ted
parallelism,” Elecrron. and Commun. ( A E i i ) , vol. 2, pp. 83-91. Apr.
VLSI case studies of novel algorithms and architectures have 1990.
shown that 50% savingsinhardwareand/orlatencycanbe [I41 G. Fettweis and H. Meyr. “High-speed Viterbi processor: A systolic
achieved. array solution,” IEEE J. Select. Areas Commim., vol. 8 pp. 1520-1534.
Oct. 1990.
The algebraic formulation introduced here is related to the [I51 -, “High-speed parallel Viterhi decoding,” lEEE Commun. Mug..
algebraic formulation of the add-compare-select recursion of pp. 4 6 5 5 , May 1 9 9 1 .
theViterbidetector, introduced in [14], [15]. Hence.itnow
is easy to derive well matched survivor memory realizations
also for all parallelized Viterbi detectors.

REFERENCES Gerhard Fettweis (S’84-M’90) received the Dipl:

Ing. and the Ph.D. degrees in electrical engineering
R. E. Bellman and S . E. Dreyfus, Applied Dynumic Prugrumming. from the Aachen University of Technology, Aachen,
Princeton, NJ: Princelon University Press, 1962. Germany. in lY86 and 19YO. 1-espectlvely.
A. 1. Omura, “On the Viterbi algorithm,” IEEE Trans. Inform. Theory, He is a scientist at TCSI Corporation, Berkeley,
pp, 177-179, Jan. 1969. CA In 1986 he worked at ABB rescarch labo-
A. 1. Viterhi, “Error bounds for convolutional coding and an asymptot- ratory, Baden. Switzerland on his Diplom-thesis.
ically optimum decoding algorithm,” IEEE Trans. Inform. Theory. vol. During 1991 he was a visiting scientist at the
IT-13, pp. 260-269, Apr. 1967. IBM Almaden Research Center, San Jose, CA. His
G . D. Forney. “The Viterbi algorithm.” Proc. IEEE, Mar. 1973, vol. 61, interests are in microelectronlcs and digital wlreless
pp. 268-278. communications, especially the interaction between
G. Fettweis, “Algebraic survivor memory management for Viterbi algorithm and architecture design for high-performance VLSI proces5or
dctectors,’‘ IEEE Int. Conf: Commun. (ICC’YZJ,Chicago, IL, June 1992. Implementations.
pp. 313.4.1-313.3.5. Dr. Fettweis is a member of the IEEE Solid Slate Circuits Council ab
C. M. Rader, “Memory management in a Viterbi decoder,” IEEE Trans. representative of the IEEE Communications Society, and is Associate Editor
Commun., vol. COM-29, pp. 1399-1401, Scpt. 1981. of the IEEE TRANSACTIOKS ON CIRCUITS AND SYSTEMSI1