__________________________________________________________

Convol ut i onal Codi ng on
Xt ensa
®
Pr oc essor s
Appl i c at i on Not e














Tensilica, Inc.
3255-6 Scott Blvd.
Santa Clara, CA 95054
(408) 986-8000
Fax (408) 986-8919
www.tensilica.com

January 2009 Doc Number: AN01-123-04
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors


© 2005-2008 Tensilica, Inc.
All Rights Reserved
Printed in the United States of America

This publication is provided “AS IS.” Tensilica, Inc. (hereafter “Tensilica”) does not make any warranty of any kind, either expressed or implied, including, but not
limited to, the implied warranties of merchantability and fitness for a particular purpose. Information in this document is provided solely to enable system and
software developers to use Tensilica processors. Unless specifically set forth herein, there are no express or implied patent, copyright or any other intellectual
property rights or licenses granted hereunder to design or fabricate Tensilica integrated circuits or integrated circuits based on the information in this document.
Tensilica does not warrant that the contents of this publication, whether individually or as one or more groups, meets your requirements or that the publication
is error-free. This publication could include technical inaccuracies or typographical errors. Changes may be made to the information herein, and these changes
may be incorporated in new editions of this publication.

The following terms are trademarks of Tensilica, Inc.: OSKit, Tensilica, Vectra, and Xtensa. All other trademarks and registered trademarks are the property of
their respective companies.























Document Change History:


September 1998 (Revised January, 2001; February, 2005)
January 2009



ii
© TENSILICA, INC.
Digitally signed by
Tensilica Technical
Publications
Reason: Certified original
Tensilica document 1/2009
Convolutional Coding on Xtensa Processors
Contents
1 Communication System Challenges............................................................................1
2 A Simple Encoder.......................................................................................................1
3 The Encoding Process................................................................................................3
4 Viterbi Decoding.........................................................................................................6
5 Details of the Viterbi Algorithm....................................................................................7
6 Distance Metric Calculation........................................................................................7
7 The Trellis Decode Butterfly........................................................................................9
8 Implementation on Base Xtensa...............................................................................11
9 Full Optimization with TIE.........................................................................................12
10 Demonstration Instructions.......................................................................................16
11 Summary..................................................................................................................16
Appendix A – VTB2.TIE Code........................................................................................17






iii
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors
Figures
Figure 1: Communication System Block Diagram.............................................................1
Figure 2: Simple Convolutional Encoder...........................................................................2
Figure 3: Convolutional Encoder State Diagram...............................................................2
Figure 4: Trellis Diagram Showing Most-Likely Path Through States................................6
Figure 5: Distance Metric Graph.......................................................................................8
Figure 6: Four Butterflies in a Trellis Time Step (K=4).....................................................9
Figure 7: Butterfly with Distance Metric.......................................................................... 10
Figure 8: Adding State and Branch Distances Metrics.................................................... 10
Figure 9: Selecting Smallest Accumulated Distance Metric............................................ 10
Figure 10: Butterfly Operation Diagram.......................................................................... 11


Tables

Table1: Distance Metric Values........................................................................................9

iv
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors
Abstract
This application note looks briefly at popular techniques for convolutional encoding and
decoding, especially Viterbi decoding, and illustrates the power of a configurable processor in
handling the performance-intensive signal processing demands of coding and decoding.
Application-specific processors are quickly designed, simulated, built in silicon, and offer
significantly better programmability, performance, and power-efficiency than most popular
digital signal processors (DSPs). In particular, this paper describes user-defined TIE (Tensilica
Instruction Extension) instructions which accelerate distance metric calculations, the most
performance-critical task in Viterbi decoding, by 32x over most popular DSPs and 155x over
most popular 32-bit RISC cores.
This application note makes the assumption that the reader is familiar with Viterbi decoding,
the Xtensa Instruction Set Architecture, and the Tensilica Instruction Extension description
language. Please refer to the Xtensa ISA Reference Manual and the Tensilica Instruction
Extension (TIE) Language User’s Guide for additional information.

v
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors



vi
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors

1 Communication System Challenges
One of greatest challenges in communication system design is efficient transmission and
reception of information in the presence of errors introduced by the communication channel.
The presence of errors is especially pronounced in radio communication, due to the variety of
noise sources in the channel. Designers have adopted block coding methods that add
redundancy in the encoding of information before transmission. Although the addition of
redundant data reduces the overall throughput of the channel, forward error correction
improves performance by using the redundant data to correct errors during decoding at the
receiver, as shown in Figure 1.
encoder decoder
noisy
channel
original
data
stream
encoded
data
encoded
data +
noise
recovered
data
stream

FIGURE 1: COMMUNICATION SYSTEM BLOCK DIAGRAM
Convolutional coding, that is, coding based on time-invariant finite state machines, is widely
used in wireless communications. This application note looks briefly at popular techniques for
convolutional encoding and decoding, especially Viterbi decoding. It illustrates the power of a
configurable processor in handling the performance-intensive signal processing demands of
coding and decoding. Specifically, user-defined instructions in the Tensilica Instruction
Extension Language (TIE) will be described which accelerate distance metric calculations, the
most performance-critical task in Viterbi decoding, by more than 32 times over most popular
digital signal processors and 155 times over most popular 32 bit RISC cores.

2 A Simple Encoder
In convolutional encoding, each new coded bit for transmission is generated by a convolution of
the current input bit with some number of earlier input bits and a masking polynomial. The
ability of the decoder to detect and correct errors in transmission depends on the number of
input bits used in the convolution. That number of bits is called the constraint length.
Redundancy is added to the bit stream by the generation of more than one bit of encoded
output for each input bit. This ratio of input bits to output bits is called the coding rate. For
example, a coding rate of 1/2 will generate 2 output bits from 1 input bit. Popular wireless
communication standards (GSM, IS-95, IS-136) use constraint lengths from 5 to 9 and coding
rates from 1/2 to 1/4.
A simple convolutional encoder, with a constraint length of 4 and coding rate of 1/2 is shown in
Figure 2. For each new input x( I ) , two new outputs, G0 and G1, are generated for
transmission.
1
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors
X
i
D
D
D D
G
0,i
G
1,i
one sample delay
exclusive OR

FIGURE 2: SIMPLE CONVOLUTIONAL ENCODER
This example implements the convolution code represented by the polynomials:
G0 = 1 + x + x
3
and G1 = 1 + x + x
2
+ x
3

The polynomial formulas listed above are a convenient way to represent inputs from current bit
(X
0
=1) and delayed bits (X
1
,X
2
,X
3
) into XOR logic to form the output. For example, output G0
(1+x+x
3
) is calculated by performing XOR calculation on the current bit (X
0
=1), the previous bit
(X
1
), and the third previous bit (X
3
). Output G1 (1+x+x
2
+x
3
) is calculated by performing XOR
calculation on the current bit (X
0
=1), the previous bit (X
1
), the second previous bit (X
2
), and the
third previous bit (X
3
).
This encoder can also be expressed as a state diagram, as shown in Figure 3. Each of the
states is labeled with a state number corresponding to the state of the three delay elements of
the circuit above. Note that the most recent bit is assigned to the LSB, while the third previous
bit is assigned to the MSB. Each of the arcs is labeled x, G0, G1 (the input bit x for that arc, and
the G0, G1 outputs for that input).

000 010
100
001 011
101
110
111
1,0,0


0,1,1
0,1,0
1,0,0
0,1,1
1,1,1 1,1,1
1,0,1
1,0,1
0,0,1 0,0,1
0,0,0
1,1,0
0,0,0
0,1,0
1,1,0
FIGURE 3: CONVOLUTIONAL ENCODER STATE DIAGRAM
It is convenient to view the encoder as a state diagram showing arcs from one encoder state to
another. Each arc is labeled with the corresponding input bit and encoder output bits. Later,
this state diagram is converted to a trellis diagram to represent state arcs with respect to time.
Note that except for the encoder outputs, the state representation remains unchanged for any
basic convolution encoder with the same constraint length due to the fact that the shifting
pattern of bits through the encoder will remain the same. Different polynomials will generate
different outputs for each arc going from one state to another.

2
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors
3 The Encoding Process
The convolution encoder described in the previous sections can be implemented either as a
hardware state machine or as a software routine running on a processor. Although the
hardware implementation for a given encoding polynomial is typically quite simple, a software
implementation offers valuable flexibility. The increasing need for adaptive and multi-protocol
communication equipment make a processor-based solution appropriate in many
circumstances.
Below is a C implementation of the encoder that was shown earlier.

/ / Sampl e Convol ut i onal Encoder
/ / Const r ai nt l engt h 4 and codi ng r at e 1/ 2
/ / G0 = 1 + x + x^3 and G1 = 1 + x + x^2 + x^3
/ * i nput dat a f or Convol ut i onal Encoder */
char I N[ Fr ameSi ze] ;
/ * out put dat a f r omConvol ut i onal Encoder */
char G0[ Fr ameSi ze] , G1[ Fr ameSi ze] ;

voi d convol ve( )
{
i nt f , t emp;
f or ( f =0; f <FS; f ++)
{
i f ( f >= 3)
{
/ / Not e t hat ANSI C XOR “^” oper at i ons ar e “+” i n pol ynomi al r epr esent at i on
G0[ f ] = I N[ f ] ^I N[ f - 1] ^I N[ f - 3] ;
G1[ f ] = I N[ f ] ^I N[ f - 1] ^I N[ f - 2] ^I N[ f - 3] ;
}
el se i f ( f == 2) / / Assume Del ay el ement 3 f l ushed t o zer o
{
G0[ f ] = I N[ f ] ^I N[ f - 1] ;
G1[ f ] = I N[ f ] ^I N[ f - 1] ^I N[ f - 2] ;
}
el se i f ( f == 1) / / Assume Del ay el ement s 2- 3 f l ushed t o zer o
{
G0[ f ] = I N[ f ] ^I N[ f - 1] ;
G1[ f ] = I N[ f ] ^I N[ f - 1] ;
}
el se i f ( f == 0) / / I ni t i al Condi t i on:
/ / Al l Del ay el ement s ar e f l ushed t o zer o
{
G0[ f ] = I N[ f ] ;
G1[ f ] = I N[ f ] ;
}
}
}
3
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors
Encoding can be rewritten, as in the pseudo code below, to take advantage of the Xtensa
processor’s funnel shift and XOR instructions.

/ / Pseudo Code f or encoder
/ / G0=1+X+X3 & G1=1+X+X2+X3
/ / N = number of i nput bi t s i n f r ame

/ / Assi gn Encoder I nput & Out put St r eam
i nt *I nput
_
Pt r =&I nput ;
i nt *Out put
_
Pt r
_
G0=&Out put
_
G0;
i nt *Out put
_
Pt r
_
G1=&Out put
_
G1;

/ / I ni t i al i ze I nput 32
_
ol d t o zer o
I nput 32
_
ol d=0;

/ / Encode 32 i nput bi t s per i t er at i on
f or ( i =0; i <( N/ 32) ; i ++)

I nput 32
_
new = *I nput
_
Pt r ++;
/ / del ay ( I nput 32
_
new, I nput 32
_
ol d) by 1 bi t usi ng f unnel shi f t
I nput 32
_
del ay1 = {I nput 32
_
new[ 30: 0] , I nput 32
_
ol d[ 31] };
/ / del ay ( I nput 32
_
new, I nput 32
_
ol d) by 2 bi t s usi ng f unnel shi f t
I nput 32
_
del ay2 = {I nput 32
_
new[ 29: 0] , I nput 32
_
ol d[ 31: 30] };
/ / del ay ( I nput 32
_
new, I nput 32
_
ol d) by 3 bi t s usi ng f unnel shi f t
I nput 32
_
del ay3 = {I nput 32
_
new[ 28: 0] , I nput 32
_
ol d[ 31: 29] };
/ / Wr i t e out G0=1+X+X3
*Out put
_
pt r
_
G0++ = I nput 32
_
new ^ I nput 32
_
del ay1 ^ I nput 32
_
del ay3;
/ / Wr i t e out G1=1+X+X2+X3
*Out put
_
pt r
_
G1++ = I nput 32
_
new ^ I nput 32
_
del ay1 ^ I nput 32
_
del ay2
^ I nput 32
_
del ay3;
I nput 32
_
ol d = I nput 32
_
new;
}
The previous pseudo code is rewritten below in Assembly Language for the Xtensa processor to
perform the convolution encoding of 64 bits within a single iteration.

4
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors

/ / i nner l oop of k = 4, r = 1/ 2 encodi ng f or t he
/ / G0 = 1 + x + x^3 and G1 = 1 + x + x^2 + x^3
/ / i nput dat a f or Convol ut i onal Encoder / /

/ / comput es 64 out put pai r s per i t er at i on

/ / a2 poi nt s t o t he wor d cont ai ni ng t he next 64 i nput bi t s
/ / or gani zed wi t h ol dest bi t i n t he msb of t he wor d
/ / a14 poi nt s t o t he out put buf f er f or G0
/ / a15 poi nt s t o t he out put buf f er f or G1
/ / a8 cont ai ns t he ol dest 32 i nput bi t s f r omt he pr evi ous i t er at i on

movi . n a1, N/ 64
l oopnez a1, l oopend / / use zer o over head l oop, N i s number of bi t s t o encode

l 32i a3, a2, 0 / / a3 cont ai ns l ow 32b of i nput st r eam( 1)
l 32i a9, a2, 4 / / a9 cont ai ns hi gh 32b of i nput st r eam( 1)
/ / not e t hat a8 cont ai ns hi gh 32b of pr evi ous i t er at i on
ssai 1 / / f unnel shi f t 64b by one sampl e t i me
sr c a4, a8, a3 / / a4 cont ai ns l ow del ayed by one ( x)
sr c a10, a3, a9 / / a10 cont ai ns hi gh del ayed by one ( x)
ssai 2 / / f unnel shi f t 64b by t wo sampl e t i mes
sr c a5, a8, a3 / / a5 cont ai ns l ow del ayed by t wo ( x^2)
sr c a11, a3, a9 / / a11 cont ai ns hi gh del ayed by t wo ( x^2)
ssai 3 / / f unnel shi f t 64b by t hr ee sampl e t i mes
sr c a6, a8, a3 / / a6 cont ai ns l ow del ayed by t hr ee ( x^3)
sr c a12, a3, a9 / / a12 cont ai ns hi gh del ayed by t hr ee ( x^3)

/ / comput e G0 & G1 f or al l l ow 32b
xor a4, a4, a3 / / G0= 1 + x
xor a4, a4, a6 / / +x^3
xor a5, a5, a4 / / G1 = G0 + x^2

/ / comput e G0 & G1 f or al l hi gh 32b
xor a10, a10, a9 / / G0= 1 + x
xor a10, a10, a12 / / +X^3
xor a11, a11, a10 / / G1 = G0 + x^2

s32i a4, a14, 0 / / st or e G0 l ow 32b
s32i a5, a15, 0 / / st or e G1 l ow 32b
s32i a10, a14, 4 / / st or e G0 hi gh 32b
s32i a11, a15, 4 / / st or e G1 hi gh 32b
addi a2, a2, 8 / / advance i nput poi nt er by 64b
addi a14, a14, 8 / / and out put poi nt er s by 64b
addi a15, a15, 8
mov a8, a9 / / save hi gh 32b f or use i n next i t er at i on
l oopend:

The assembly routine listed above is capable of encoding 2.5 bits per cycle. The performance
of this convolutional coding technique can be generalized to 11+((k-1)*5) cycles for each 64
input bits, where k is the constraint length. The actual performance is dependent on the
polynomials used. The convolutional coding performance of a base Xtensa processor is
comparable to a 16-bit DSP, such as members of the Texas Instruments TMS320C54x family.
This class of DSPs is capable of coding 1.5 bits per cycle for a set of polynomials with k=5 (see:
Viterbi Decoding Techniques in the TMS320C54x Family, Henry Hendrix, Texas Instruments
Application Note SPRA071, June 1996). For the same polynomials, performance on an Xtensa
processor is about 1.8 bits per cycle.
5
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors
4 Viterbi Decoding
The goal of decoding a received bit stream is to find the maximum-likelihood output sequence
given the received sequence - a combination of the transmitted sequence plus noise. Viterbi
decoding offers an efficient algorithm to find this output sequence. It is based on a decoder
that attempts to estimate, using the received data sequence, the likelihood that the encoder is
in each of its possible states. The graphical modeling of all possible state transitions has come
to be called a trellis diagram. A simple trellis diagram is shown below. The trellis diagram is a
different way of modeling the state diagram that was shown earlier, but with the added
dimension of time. This diagram is used to determine the correct path through the states,
based on a particular transmitted sequence, assuming the encoder started in the idle state
(000). The challenge for the decoder is to predict this path even when some of the incoming
bits (G0, G1) may have been corrupted by noise.
Received G0,G1:
000 000 000 000 000
010 010 010 010 010
011 011 011 011 011
100 100 100 100 100
101 101 101 101 101
110 110 110 110 110
111 111 111 111 111
001 001 001 001 001
1,0 1,0 0,1 0,1
Time 0 Time 1 Time 2 Time 3 Time 4

FIGURE 4: TRELLIS DIAGRAM SHOWING MOST-LIKELY PATH THROUGH STATES
6
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors
5 Details of the Viterbi Algorithm
The Viterbi decode algorithm works in two phases. In the first phase, the update phase, the
incoming data is analyzed in sequence order. The maximum-likelihood decoder works by
maintaining a running estimate of the appropriateness of each possible path through the trellis
for the received data sequence. Starting from a known initial state and for each successive
received input pair (G0,G1), the decoder calculates a distance metric between the received
input pair and the input pair corresponding to each state arc in the diagram. The distance
metric calculation method will be discussed later. The shortest path, the series of arcs with the
smallest total distance metric, is taken to be the most-likely path through the trellis diagram.
Each path implies a unique state sequence in the encoder, and thus a unique input sequence.
This phase is considered the most CPU-intensive task within the Viterbi Algorithm, so the
remainder of this application note focuses on this area.
In the second phase, the trace back phase, the sequence of arc decisions must be traced back
to reconstruct the inferred inputs to the encoder. Recalling that the most recent data shifted
into the delay line is the LSB of the state, the inputs based upon the trellis diagram above are
inferred to be (1,0,0,0). This phase can be easily accomplished by examining the LSB of each
of the states, tracing backward through the most-likely path.
Several popular techniques are used to calculate distance metrics. In general, these methods
are categorized as either hard decision decoding or soft decision decoding. In a soft decision
decoder, the input to the decoder is an integer in the range between +B and -B. Therefore, the
strength of the signal can be used as information by the decoder. In a hard decision decoder,
threshold detection is used to quantize input signals into either of two states: +1 or -1. Soft
decision decoding with infinite range provides approximately 2.2db better coding gain than hard
decision decoding at the expense of slightly more complexity in the decoder.
6 Distance Metric Calculation
In the trellis diagram shown in Figure 4, there are arcs leading from states in one trellis column
to states in the next trellis column. Each of these arcs has an associated local distance (branch
metric). Recall that the state diagram shown earlier labels each arc with the encoder outputs
for each transition. The local distance is determined by comparing the actual received data to
expected encoder outputs for a given arc.
The Hamming Distance technique is one of the more popular techniques used for calculating
distance metrics. For a coding rate of 1/2, we can imagine the actual data, G0 and G1, to
indicate position in two different dimensions. Each arc in the trellis diagram has a
corresponding input pair, R0 and R1, which is the expected output for each arc. The diagram
below shows both actual and expected data represented as points in a Cartesian plane. The
Hamming distance is determined by adding the differences of each dimension
((G0-R0) + (G1-R1)).
7
© TENSILICA, INC.
Convolutional Coding on Xtensa
Hamming
Distance
Straight-line
Distance
(R0, R1)
Expected
Actual
(G0, G1)

FIGURE 5: DISTANCE METRIC GRAPH
Another popular distance metric technique is the Euclidian (Square) Distance technique. The
Euclidian (Square) Distance is determined by calculating the square root of the straight-line
distance between two symbols. Using the Pythagorean Theorem, the straight-line distance
between the actual and expected input pairs of the previous diagram is calculated as follows:
2 2
R1) ( G1 R0) - ( G0 − +
Remove the square root from the straight-line distance calculation to get the Euclidian (Square)
Distance. There is a slight bit error rate (BER) performance penalty for using the Euclidian
(Square) Distance when compared to the straight-line distance, yet this penalty is negligible
when compared with the reduction in complexity. Expanding the Euclidian (Square) distance
metric results in the following equation:
G0
2
- 2( R0*G0) + R0
2
+ G1
2
- 2( R1*G1) +R1
2
Note that the distance metric for a given arc will be compared against distance metrics of other
arcs within the same trellis column. Addition of constants or multiplication by a constant will
not affect the comparison. Therefore distance metric calculation can be simplified by removing
constants and constant multipliers. G0 and G1 are actual inputs, which have a range between
+B and –B, yet are constant throughout the trellis column. Therefore the square of G0 and G1
can be eliminated. Since expected inputs, R0 and R1 have possible values of +B or -B, the
square of R0 and R1 become B
2
, which is a constant and can be eliminated. Thus, the distance
metric can be further simplified as follows by removing these constants.
- 2( R0*G0) –2( R1*G1)
Removing the constant multiplier 2 in the equation above, leaves
- ( R0*G0) –( R1*G1)
Recalling that R0 and R1 have possible values of +B or –B, the distance metric is simplified as
shown in the following table:
8
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors

TABLE1: DISTANCE METRIC VALUES
Expected Data
(R0, R1)
Distance
Metric
Removing
Constant B
Replace with
Sum, Diff
+B, +B -BG0-BG1 -G0-G1 -Sum
+B, -B -BG0+BG1 -(G0-G1) -Diff
-B, -B +BG0+BG1 G0+G1 Sum
-B, +B +BG0-BG1 G0-G1 Diff
Note: Sum=G0+G1; Diff=G0-G1

The distance metric calculation has been greatly simplified to +/- the sum or difference of the
received data. To determine the local distance of a particular arc, determine the expected data
for that arc and replace it with the corresponding equation using the table above.


7 The Trellis Decode Butterfly
To aid in implementation, it is often helpful to arrange calculations in functional groups. The
procedure for doing the calculations on a single group can become a template to be used on
other like groups. A butterfly can be visualized as a grouping of 2 source states, 2 destination
states, and 4 arcs between them. For the trellis diagram shown earlier, with 8 states per
column, a time step from one trellis column to another can be visualized as 4 butterflies as
shown below.
100 001
000 000
011 101
010 001
110 101
100 010
111 111
011 110

FIGURE 6: FOUR BUTTERFLIES IN A TRELLIS TIME STEP (K=4)
9
© TENSILICA, INC.
Convolutional Coding on Xtensa
Let’s take a closer look at a single butterfly calculation. The diagram below shows a butterfly
diagram with corresponding encoder output values for each arc. The encoder outputs are
translated into local distances as per the previous table.
-
S
u
m
-
S
u
m
+Sum
000 000
+Sum
100 001
+
B

+
B
+
B

+
B
-B -B
000 000
-B -B
100 001

FIGURE 7: BUTTERFLY WITH DISTANCE METRIC
The heart of the butterfly calculation is sometimes called the ADD-COMPARE-SELECT operation.
In the ADD stage, the accumulated distance metric is calculated by taking the local distance of
each arc in the butterfly, and adding it to the accumulated distance metric from the originating
state. Considering that the accumulated distance metric of the originating state is named
StateN (N = number of state), the diagram below shows each arc’s accumulated distance
metric after the ADD stage.
State0+Sum
State4+Sum
S
ta
te
0
-S
u
m
S
ta
te
4
-S
u
m
000 000
001 100

FIGURE 8: ADDING STATE AND BRANCH DISTANCES METRICS
In the COMPARE stage, the distance metric for each arc into a destination state is compared. In
the butterfly diagram, there are two arcs and two corresponding distance metrics leading into
each destination state. Of the two arcs, the arc with the smallest distance metric is considered
as the most-likely arc and the other arc is discarded.
In the SELECT stage, the most-likely arc’s accumulated distance metric is stored as the new
accumulated distance metric for the state. The diagram below shows the selected arcs and
updated accumulated distant metric, State 0 and State1, assuming State0+Sum < State4-Sum
and State0-Sum < State4+Sum.

State0+Sum
100
000
001
000
S
ta
t
e
0
-S
u
m
State0=State0+Sum
State4=State0-Sum

FIGURE 9: SELECTING SMALLEST ACCUMULATED DISTANCE METRIC
The selected arcs are recorded so this information can be used during the trace back phase to
reconstruct the most-likely path through the trellis. One way to code the selected arc is to use
the MSB of the originating state. Hence, the most-likely arc into State 0 is coded as 0, and the
most-likely arc into State 1 is also coded as 0.
The regularity of the butterfly computation suggests a set of special instructions intended to
accelerate the calculation of distance metrics. Variations of the add-compare-select instructions
have been implemented on advanced digital signal processors. In our C-based implementation,
a macro called ACS is used to implement a variation of the ADD-COMPARE-SELECT calculation.
The macro and sample usage is shown for a single butterfly operation.
10
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors


/ ******************************************************************
ACS i s a macr o whi ch per f or ms a var i at i on of t he ADD- Compar e- Sel ect
oper at i on f or each st at e i n t he Tr el l i s. I t compar es 2 accumul at ed
di st ance met r i cs ( X, Y) of t he 2 ar cs l eadi ng i nt o t he st at e. The shor t est
ar c i s sel ect ed as t he most - l i kel y ar c. The shor t est accumul at ed di st ance
met r i c i s st or ed i n S( I ) and bi nar y code whi ch desi gnat es t he most - l i kel y
ar c t o t he st at e i s st or ed i n Sel ect [ j ] [ I ] , wher e ( I ) r epr esent s t he st at e
and ( j ) r epr esent s t he t r el l i s col umn.
*******************************************************************/
#def i ne ACS( S, I , X, Y) i f ( ( s1 = ( X) ) < ( s2 = ( Y) ) ) {S[ ( I ) ] = s1;
Sel ect [ j ] [ ( I ) ] = 0; } el se {S[ ( I ) ] = s2; Sel ect [ j ] [ ( I ) ] = 1; }

Di f f = G0[ j ] - G1[ j ] ;
Sum= G0[ j ] +G1[ j ] ;
/ / Usi ng ACS macr o f or si ngl e but t er f l y
ACS( NewSt at e, 0, St at e0+Sum, St at e4- Sum) ;
ACS( NewSt at e, 1, St at e0- Sum, St at e4+Sum) ;

A butterfly operation consists of two add-compare-select calculations. The code above is used
to perform the butterfly operation shown below.
State4+Sum
100
S
ta
t
e
0
-S
u
m
S
t
a
t
e
4
-
S
u
m
NewState[ 0 ] =
Min(State0+Sum,State4-Sum)
NewState[ 1 ] =
Min(State0+Sum,State4+Sum)
State0+Sum
000
001
000

FIGURE 10: BUTTERFLY OPERATION DIAGRAM
A single butterfly operation is performed for every pair of destination states within a trellis
column. The same trellis column operation is iteratively performed on every subsequent trellis
column until the end of the frame. Once the end of frame is reached, each state’s accumulated
distance metric is compared, with the smallest being considered the ending state. The trace
back phase begins with the end state. The decoder will then extract the LSB of each state as
the deduced input bit and use the coded path to trace through all prior trellis columns until the
inferred input at the beginning of the frame is deduced.
8 Implementation on Base Xtensa
A demonstration GSM Viterbi Decoder and test bench was developed in C and is provided as an
Xplorer Workspace file, Vi t er bi
_
v2. xws. The decoder is a soft decision decoder using the
Euclidian (Square) Distance metric and ACS macro described earlier in this Application Note
(instead of eight states described in previous sections). Since GSM uses a constraint length of
five, there will be 16 states in every trellis column. Hence, GSM requires eight butterfly
operations to decode a single bit (as compared to our previous example which only required
four butterfly operations).
The Viterbi
_
v2 project is a test bench that prepares a random frame of 1000 bits and then
encodes them into GSM coded symbols. The symbols are corrupted to simulate white noise.
Finally, the test bench decodes these bits and compares the output with the original input bits.
The Viterbi decoder is benchmarked for performance.
11
© TENSILICA, INC.
Convolutional Coding on Xtensa
In this original form, a single bit requires 337 cycles to decode on a base Xtensa processor
when using aggressive compiler optimizations (-O3 switch used in xt - xcc). Given that the
Xtensa processor is as efficient, if not more efficient, than ARM9 and MIPS32 cores in handling
ANSI C code, the performance of other 32-bit RISC cores is estimated to be similar.
9 Full Optimization with TIE
The Tensilica Instruction Extension (TIE) language provides a powerful mechanism to add
instructions to the base Xtensa instruction set and to generate complete support in hardware
and software tools for special purpose operations. The decode butterfly involves the addition of
the local distance to a pair of adjacent states’ accumulated distance metric calculation, then a
comparison and selection of the most-likely arc into each of the pair of states. The regularity of
this computation suggests a set of special instructions intended to accelerate the butterfly
calculation. Variations of add-compare-select instructions have been implemented on
advanced digital signal processors to accelerate the Viterbi decoder. Likewise, variations of the
add-compare-select instruction can be developed for Xtensa using TIE. Such instructions are
invaluable in accelerating Viterbi decoders that support data encoded using arbitrary constraint
length and polynomials. On the other hand, TIE could be used to develop instructions that
accelerate the decoding of data generated from a specific encoder. TIE instructions that are
specific to an encoder can be developed with computational performance comparable to a pure
hardware implementation. The optimal TIE instructions chosen is dependent upon the balance
between flexibility and computational performance required in a given system.
Significant improvement using TIE can be achieved by creating a variation of the add-compare-
select butterfly computation and defining this logic as a TIE function as shown below:


/ / Vi t er bi ADD- COMPARE- SELECT But t er f l y
function [ 33: 0] VBFLY ( [ 15: 0] St at eA, [ 15: 0] St at eB, [ 15: 0] Met r i c)
{
wire [ 15: 0] neg
_
Met r i c = ~Met r i c + 1' b1;
/ / Add st at e and pat h met r i c
wire [ 15: 0] st at eA
_
pat hA = St at eA+Met r i c;
wire [ 15: 0] st at eB
_
pat hB = St at eB+neg
_
Met r i c;
/ / Compar e accumul at ed met r i c
wire [ 4: 0] compA = TIEcmp( st at eA
_
pat hA, st at eB
_
pat hB, 1' b1) ;
/ / Sel ect ed ( l east val ue) pat h i s out put
wire [ 15: 0] new
_
st at eA = ( compA[ 4] ) ?st at eA
_
pat hA: st at eB
_
pat hB;
wire Sel ect A = ( compA[ 4] ) ?0: 1;

wire [ 15: 0] st at eA
_
pat hB = St at eA+neg
_
Met r i c;
wire [ 15: 0] st at eB
_
pat hA = St at eB+Met r i c;
wire [ 4: 0] compB = TIEcmp( st at eA
_
pat hB, st at eB
_
pat hA, 1' b1) ;
wire [ 15: 0] new
_
st at eB = ( compB[ 4] ) ?st at eA
_
pat hB: st at eB
_
pat hA;
wire Sel ect B = ( compB[ 4] ) ?0: 1;

assign VBFLY = {Sel ect A, Sel ect B, new
_
st at eA, new
_
st at eB};
}
This TIE function performs the same computation as a pair of ACS macros shown in section 7.
12
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors
Several additional techniques used to accelerate the Viterbi decoder are:
♦ The VBFLY TIE function can be instanced several times in an operation so that multiple
Viterbi butterfly computations are performed in parallel.
♦ Making use of internal TIE state (not to be confused with states in the trellis diagram
referred to as trellis states) to hold intermediate data, such as accumulated state metrics,
can eliminate many memory accesses.
♦ Fusion of memory accesses and butterfly computations into high performance TIE
operations
♦ FLIX with dual load/store interface allows for two operations (both operations performing
load/store) to be issued in the same instruction word.
Appendix A lists vt b2. t i e, the TIE file that describes TIE operations that accelerate Viterbi
decode. The TIE instructions for the trellis update phase of Viterbi decoding are summarized
below.
♦ VBI N: Viterbi Input
C I nt r i nsi c Synt ax: voi d VBI N( VREG PG0, VREG* p
_
PG0)
This operation loads 2 GSM coded symbol pairs (4 bytes) at one time by using a 32-bit load into
a 32-bit register file VREG. The load pointer (p
_
PG0) is also auto-incremented by 4 bytes in
preparation for the next VBI Ninstruction.
♦ VBOUT: Parallel Viterbi Butterfly Operation and Output
C I nt r i nsi c Synt ax: voi d VBOUT ( unsi gned shor t * PSel ect , VREG PG0, i mmi )
This operation updates all state metrics of a trellis column for a single pair of GSM coded data
(PG0). The add-compare-select operation is performed on all 16 states of the trellis column
using 8 VBFLY TIE functions, to support the Viterbi butterfly computations for the entire trellis
column.
This operation updates each state’s accumulated distance metric within 16-bit TIE states, one
for each of the 16 Trellis states and writes out 16 “select” bits for the most-likely arcs going into
each of the 16 trellis states. The write pointer (PSelect) is auto-incremented in preparation for
the next VBOUT instruction. An immediate operand (i) is used to choose a symbol pair of GSM
coded data from the 32-bit VREGTIE register file. Since VBI Nprovides 2 GSM coded symbols,
there will be two VBOUT instructions for each VBI Ninstruction.
♦ WUR
_
BMsel: Write User Register- Branch Metric Select
The BMSel register is a 32-bit register that sets the distance metric for each path of the Viterbi
butterfly computations as used by the VBOUT instruction. Since the VBOUT performs 8 butterfly
computations, there are 32 paths metrics. However, due to path symmetry in the butterfly
structure, we need only define the top-most path to the butterfly and remaining paths are
inferred from this path. For example, the top-most path in figure 8 is +sum. The bottom-most
path is the same as the top-most path (+sum) and the diagonal paths are negative of the top-
most path (–sum).
The BMSel register is split into 8 4-bit fields, where each bit corresponds to a one-hot value for
+sum, -sum, +diff, or -diff. The most significant 4-bit field corresponds to the top-most path of
the butterfly computation that updates states 0 and 1. The following 4-bit field corresponds to
the top-most path of the butterfly computation that updates states 2 and 3, and so on.
Prior to executing VBOUT instructions, the BMSel register should be initialized with the
appropriate branch metric selection for the butterfly computations. By allowing the setting of
the branch metrics, the VBOUT instructions allows support for different polynomials used for
Viterbi coding (given that the constraint length is k=5, coding rate = 1/2).
In this example, the path metrics for each butterfly computation are taken directly from the
GSM decoder C source code. The initialization for standard GSM coded polynomials is shown in
the sample code below:
13
© TENSILICA, INC.
Convolutional Coding on Xtensa

#define di st
_
sum 8
#define di st
_
neg
_
sum4
#define di st
_
di f f 2
#define di st
_
neg
_
di f f 1
WUR
_
BMSel ( ( di st
_
sum<< 28) | ( di st
_
di f f << 24) | ( di st
_
sum<<
20) | ( di st
_
di f f << 16) | ( di st
_
neg
_
sum<< 12) | ( di st
_
neg
_
di f f <<
8) | ( di st
_
neg
_
sum<< 4) | ( di st
_
neg
_
di f f ) ) ;

The TIE operations described above are defined to be used in dual slot 32-bit FLIX instructions
as shown below. This enables two operations to be issued in the same instruction word.

format vt b
_
f l i x 32 {sl ot
_
a, sl ot
_
b}
slot
_
opcodes sl ot
_
a {VBI N, STORE
_
OUT}
slot
_
opcodes sl ot
_
b {VBOUT, BACKTRACE, BACKTRACE0}

The Viterbi
_
v2 code contains #ifdef TIE statements which will conditionally replace use of the
Xtensa processor’s base ISA with specialized TIE instructions to accelerate the algorithm. The
main loop for the Viterbi decoder’s update phase is shown below:

for ( j =0; j <FS; j =j +2)
{
/ * Loop unr ol l ed by 2( 2 t r el l i s col umn updat es per i t er at i on) */
VBI N( PG0, p
_
PG0) ; / *Load 2 GSM coded pai r s*/
VBOUT( PSel ect , PG0, 0) ; / *Do Cal cul at i on on 1
st
GSM coded pai r */
VBOUT( PSel ect , PG0, 1) ; / *Do Cal cul at i on on 1
st
GSM coded pai r */
}

Having compiled the code, the disassembly of the Viterbi decoder’s trellis update loop is as
follows:

l oopgt z a10, 60000f 10 <mai n+0x148>
{ vbi n vr 0, a8; vbout a9, vr 1, 0 }
{ vbi n vr 1, a8; vbout a9, vr 1, 1 }
{ nop; vbout a9, vr 0, 0 }
{ nop; vbout a9, vr 0, 1 }

This loop demonstrates loop unrolling and software scheduling optimizations to hide load
latency. As a result of the optimization there are no stalls in the loop. Thus one VBOUT
operation completes every clock cycle. This means that the update phase of Viterbi decoding
occurs at a rate of one cycle per bit.
Another way of looking at this performance is to recognize that 8 butterfly computations are
performed in a single cycle. Compare this against 4 cycles required for EACH butterfly found in
specialized DSPs dedicated to wireless telephony, such as TI TMS320C54x. This means that
the Xtensa implementation is 32x more cycle efficient than such DSPs when measured on a
14
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors
“work-per-cycle” basis. Note that the Xtensa-based implementation is written in C, whereas
hand coded assembly is required to obtain performance numbers for many DSP machines.

The TIE operations for the trace-back phase of Viterbi decoding are summarized below.
♦ BACKTRACE: Viterbi Backtrace
C I nt r i nsi c Synt ax: voi d BACKTRACE( unsi gned shor t * PSel ect )
This operation loads the 16 “select” bits (from address PSelect) that were stored during
execution of VBOUT instructions. From the current minimum state, the select value
(representing the most likely path) is used to trace backward to the previous trellis stage. The
LSB value of the minimum state is considered to be the most likely output bit and is saved in a
holding register to be later written to memory using the STORE
_
OUT operation. The select
pointer (PSelect) is post-decremented by 2 in preparation for the next BACKTRACE operation.
♦ BACKTRACE0: Viterbi Backtrace initialization
C I nt r i nsi c Synt ax: voi d BACKTRACE0( char Mi nSt at e)
This instruction is a subset of the BACKTRACE operation that is only executed once prior to
subsequent executions of the BACKTRACE instructions. This instruction initializes the minimum
state after the update phase. The state number with the minimum value is passed as argument
MinState.
♦ STORE
_
OUT: Store eight output values
C I nt r i nsi c Synt ax: voi d STORE
_
OUT( unsi gned char * POut put )
This instruction performs a byte store of the single-bit output value calculated in prior
executions of the BACKTRACE instruction to pointer POutput. The POutput pointer is post-
decremented by one in preparation for the next STORE
_
OUT operation.
The main loop for the Viterbi decoder’s update phase is shown below:

for ( i =FS- 1; i >=1; i - - ) {
BACKTRACE( PSel ect ) ;
STORE
_
OUT( pt r
_
out put ) ;
}


The disassembly of the Viterbi decoder’s backtrace loop is as follows:

l oopgt z a10, 60000f e0 <mai n+0x218>
{ st or e
_
out a9; backt r ace a8 }

The loop consists of a single FLIX instruction that contains both BACKTRACE and STORE
_
OUT
operations. These operations are effectively pipelined such that the backtrace is done in the
first iteration and then the output bit is written to memory in the next iteration. As a result, an
output bit is written every clock cycle. This means that the trace back phase of Viterbi decoding
occurs at a rate of one cycle per bit.
The highly optimized assembly code described in this section was directly compiled from C
source code with the TIE variable set (#define TIE). Upon building this example and simulating
it, the console shows the following:

15
© TENSILICA, INC.
Convolutional Coding on Xtensa
Pr ocessi ng New Fr ame

Er r or s det ect ed = 0, Benchmar k = 2. 167000 cycl es per bi t

Viterbi decode performance of 2.17 cycles per bit is more than 155x improvement over the
standard implementation without TIE acceleration (337 cycles per bit). The TIE area for this
approach is 28.7K gates, in addition to 47K gates for base Xtensa LX2 core. This core is
capable of being synthesized up to 264MHz (worst case) in .13µ LV. Therefore, this solution is
capable of decoding a GSM coded bitstream at a peak rate of 130Mbits per second.
10 Demonstration Instructions
The demonstration requires that you have installed Xplorer CE 2.1.1 with RB-2008.3 software
tools. The workspace, Vi t er bi
_
V2. xws can be obtained from the Tensilica support website.
Follow these steps to build and simulate the demonstration code.
1. Start Xplorer and import the Vi t er bi
_
V2. xws workspace. Select all components
provided in the workspace for installation into your workspace.
2. In the workspace toolbar, select project (P: Viterbi
_
v2), configuration (C: Viterbi
_
v2) and
release target (T: Release).
3. Click Build Active to compile and then click on Run to simulate. The console will display the
decode error and benchmark results.
To compare performance with ANSI C implementation (without TIE), you can comment out
the line (#define TIE) in the mai n. c file of the Viterbi
_
V2 project.

11 Summary
Xtensa processors offer significant advantages for complex telephony applications. The Xtensa
architecture combines a powerful general-purpose 32-bit instruction set design, with a unique
configuration and extension process. These are used together to solve some of the toughest
problems in communication system design, including efficient convolutional coding and Viterbi
decoding. Application-specific-processors are quickly designed, simulated, built in silicon, and
offer significantly better programmability, performance and power-efficiency than most popular
DSPs. With the benefit of TIE, Xtensa solutions can offer almost 155x improvement in
communication processing efficiency compared to conventional 32-bit RISC cores and over 32x
improvement when compared to specialized DSPs.


16
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors
Appendix A – VTB2.TIE Code
/ / VTB2. TI E
/ / TI E Ext ensi ons f or Vi t er bi Accel er at i on
/ / FLI X
format vt b
_
f l i x 32 {sl ot
_
a, sl ot
_
b}
slot
_
opcodes sl ot
_
a {VBI N, STORE
_
OUT}
slot
_
opcodes sl ot
_
b {VBOUT, BACKTRACE, BACKTRACE0}


/ / St at es used by Vi t er bi I nst r uct i ons
state AccumDi st 0 16 add
_
read
_
write
state AccumDi st 1 16 add
_
read
_
write
state AccumDi st 2 16 add
_
read
_
write
state AccumDi st 3 16 add
_
read
_
write
state AccumDi st 4 16 add
_
read
_
write
state AccumDi st 5 16 add
_
read
_
write
state AccumDi st 6 16 add
_
read
_
write
state AccumDi st 7 16 add
_
read
_
write
state AccumDi st 8 16 add
_
read
_
write
state AccumDi st 9 16 add
_
read
_
write
state AccumDi st A 16 add
_
read
_
write
state AccumDi st B 16 add
_
read
_
write
state AccumDi st C 16 add
_
read
_
write
state AccumDi st D 16 add
_
read
_
write
state AccumDi st E 16 add
_
read
_
write
state AccumDi st F 16 add
_
read
_
write
state Mi nSt at e 4 add
_
read
_
write
state BMSel 32 add
_
read
_
write
state Out put 1 add
_
read
_
write

/ / I mmedi at es
immediate
_
range i mm8 0 7 1

regfile VREG 32 2 vr


/ / Vi t er bi ADD- COMPARE- SELECT But t er f l y
function [ 33: 0] VBFLY ( [ 15: 0] St at eA, [ 15: 0] St at eB, [ 15: 0] Met r i c)
{

wire [ 15: 0] neg
_
Met r i c = ~Met r i c + 1' b1;
wire [ 15: 0] st at eA
_
pat hA = St at eA+Met r i c;
wire [ 15: 0] st at eB
_
pat hB = St at eB+neg
_
Met r i c;
wire [ 4: 0] compA = TIEcmp( st at eA
_
pat hA, st at eB
_
pat hB, 1' b1) ;
wire [ 15: 0] new
_
st at eA = ( compA[ 4] ) ?st at eA
_
pat hA: st at eB
_
pat hB;
wire Sel ect A = ( compA[ 4] ) ?0: 1;

wire [ 15: 0] st at eA
_
pat hB = St at eA+neg
_
Met r i c;
wire [ 15: 0] st at eB
_
pat hA = St at eB+Met r i c;
wire [ 4: 0] compB = TIEcmp( st at eA
_
pat hB, st at eB
_
pat hA, 1' b1) ;
wire [ 15: 0] new
_
st at eB = ( compB[ 4] ) ?st at eA
_
pat hB: st at eB
_
pat hA;
wire Sel ect B = ( compB[ 4] ) ?0: 1;

assign VBFLY = {Sel ect A, Sel ect B, new
_
st at eA, new
_
st at eB};
}


operation VBI N {out VREG GI nput , inout AR *ar s} {out VAddr , in MemDat aI n32}
{
assign VAddr =ar s;
assign GI nput =MemDat aI n32;
assign ar s=ar s+4;
}

operation VBOUT
17
© TENSILICA, INC.
Convolutional Coding on Xtensa
{inout AR *ar s, in VREG GI nput , in i mm8 t }
{
in BMSel ,
inout AccumDi st 0,
inout AccumDi st 1,
inout AccumDi st 2,
inout AccumDi st 3,
inout AccumDi st 4,
inout AccumDi st 5,
inout AccumDi st 6,
inout AccumDi st 7,
inout AccumDi st 8,
inout AccumDi st 9,
inout AccumDi st A,
inout AccumDi st B,
inout AccumDi st C,
inout AccumDi st D,
inout AccumDi st E,
inout AccumDi st F,
out VAddr ,
out MemDat aOut 16
}
{
/ / Choose G0 f r omGI nput based upon i mmedi at e ar gument t
/ / Wr i t t en f or Bi g Endi an Or der i ng


wire [ 7: 0] G0=( ( t ==1) ?GI nput [ 15: 8] : GI nput [ 31: 24] ) ;
/ / Choose G1 f r omGI nput based upon i mmedi at e ar gument t
/ / Wr i t t en f or Bi g Endi an Or der i ng
wire [ 7: 0] G1=( ( t ==1) ?GI nput [ 7: 0] : GI nput [ 23: 16] ) ;


/ / Decl ar e t empor ar y var i abl es f or AccumDi st
wire [ 15: 0] St at e0=AccumDi st 0;
wire [ 15: 0] St at e1=AccumDi st 1;
wire [ 15: 0] St at e2=AccumDi st 2;
wire [ 15: 0] St at e3=AccumDi st 3;
wire [ 15: 0] St at e4=AccumDi st 4;
wire [ 15: 0] St at e5=AccumDi st 5;
wire [ 15: 0] St at e6=AccumDi st 6;
wire [ 15: 0] St at e7=AccumDi st 7;
wire [ 15: 0] St at e8=AccumDi st 8;
wire [ 15: 0] St at e9=AccumDi st 9;
wire [ 15: 0] St at eA=AccumDi st A;
wire [ 15: 0] St at eB=AccumDi st B;
wire [ 15: 0] St at eC=AccumDi st C;
wire [ 15: 0] St at eD=AccumDi st D;
wire [ 15: 0] St at eE=AccumDi st E;
wire [ 15: 0] St at eF=AccumDi st F;
/ / Cal cul at e Sum/ Di f f f or i nput
wire [ 7: 0] Sum
_
8=G0+G1;
wire [ 7: 0] Di f f
_
8=G0- G1;
wire [ 15: 0] Sum={8{Sum
_
8[ 7] }, Sum
_
8};
wire [ 15: 0] Di f f ={8{Di f f
_
8[ 7] }, Di f f
_
8};
wire [ 15: 0] neg
_
Sum=~Sum+ 1;
wire [ 15: 0] neg
_
Di f f =~Di f f + 1;

/ / Cal cul at e Accumul at ed Pat h Met r i cs
/ / Compar e/ Sel ect Shor t est Pat h i nt o each St at e
/ / usi ng 8 par al l el VBFLY f unct i ons


wire [ 15: 0] new
_
AccumDi st 0, new
_
AccumDi st 1, new
_
AccumDi st 2, new
_
AccumDi st 3,
new
_
AccumDi st 4, new
_
AccumDi st 5, new
_
AccumDi st 6, new
_
AccumDi st 7, new
_
AccumDi st 8,
new
_
AccumDi st 9, new
_
AccumDi st A, new
_
AccumDi st B, new
_
AccumDi st C, new
_
AccumDi st D,
new
_
AccumDi st E, new
_
AccumDi st F;
wire Sel ect 0, Sel ect 1, Sel ect 2, Sel ect 3, Sel ect 4, Sel ect 5, Sel ect 6, Sel ect 7, Sel ect 8,
Sel ect 9, Sel ect A, Sel ect B, Sel ect C, Sel ect D, Sel ect E, Sel ect F;

18
© TENSILICA, INC.
Convolutional Coding on Xtensa Processors
wire [ 15: 0] Di st A = TIEsel( BMSel [ 31] , Sum, BMSel [ 30] , neg
_
Sum, BMSel [ 29] , Di f f ,
BMSel [ 28] , neg
_
Di f f ) ;
assign {Sel ect 0, Sel ect 1, new
_
AccumDi st 0, new
_
AccumDi st 1} = VBFLY( St at e0, St at e8,
Di st A) ;

wire [ 15: 0] Di st B = TIEsel( BMSel [ 27] , Sum, BMSel [ 26] , neg
_
Sum, BMSel [ 25] , Di f f ,
BMSel [ 24] , neg
_
Di f f ) ;
assign {Sel ect 2, Sel ect 3, new
_
AccumDi st 2, new
_
AccumDi st 3} = VBFLY( St at e1, St at e9,
Di st B) ;

wire [ 15: 0] Di st C = TIEsel( BMSel [ 23] , Sum, BMSel [ 22] , neg
_
Sum, BMSel [ 21] , Di f f ,
BMSel [ 20] , neg
_
Di f f ) ;
assign {Sel ect 4, Sel ect 5, new
_
AccumDi st 4, new
_
AccumDi st 5} = VBFLY( St at e2, St at eA,
Di st C) ;

wire [ 15: 0] Di st D = TIEsel( BMSel [ 19] , Sum, BMSel [ 18] , neg
_
Sum, BMSel [ 17] , Di f f ,
BMSel [ 16] , neg
_
Di f f ) ;
assign {Sel ect 6, Sel ect 7, new
_
AccumDi st 6, new
_
AccumDi st 7} = VBFLY( St at e3, St at eB,
Di st D) ;

wire [ 15: 0] Di st E = TIEsel( BMSel [ 15] , Sum, BMSel [ 14] , neg
_
Sum, BMSel [ 13] , Di f f ,
BMSel [ 12] , neg
_
Di f f ) ;
assign {Sel ect 8, Sel ect 9, new
_
AccumDi st 8, new
_
AccumDi st 9} = VBFLY( St at e4, St at eC,
Di st E) ;

wire [ 15: 0] Di st F = TIEsel( BMSel [ 11] , Sum, BMSel [ 10] , neg
_
Sum, BMSel [ 9] , Di f f ,
BMSel [ 8] , neg
_
Di f f ) ;
assign {Sel ect A, Sel ect B, new
_
AccumDi st A, new
_
AccumDi st B} = VBFLY( St at e5, St at eD,
Di st F) ;

wire [ 15: 0] Di st G = TIEsel( BMSel [ 7] , Sum, BMSel [ 6] , neg
_
Sum, BMSel [ 5] , Di f f , BMSel [ 4] ,
neg
_
Di f f ) ;
assign {Sel ect C, Sel ect D, new
_
AccumDi st C, new
_
AccumDi st D} = VBFLY( St at e6, St at eE,
Di st G) ;

wire [ 15: 0] Di st H = TIEsel( BMSel [ 3] , Sum, BMSel [ 2] , neg
_
Sum, BMSel [ 1] , Di f f , BMSel [ 0] ,
neg
_
Di f f ) ;
assign {Sel ect E, Sel ect F, new
_
AccumDi st E, new
_
AccumDi st F} = VBFLY( St at e7, St at eF,
Di st H) ;

/ / St or e new st at e met r i cs
assign AccumDi st 0=new
_
AccumDi st 0;
assign AccumDi st 1=new
_
AccumDi st 1;
assign AccumDi st 2=new
_
AccumDi st 2;
assign AccumDi st 3=new
_
AccumDi st 3;
assign AccumDi st 4=new
_
AccumDi st 4;
assign AccumDi st 5=new
_
AccumDi st 5;
assign AccumDi st 6=new
_
AccumDi st 6;
assign AccumDi st 7=new
_
AccumDi st 7;
assign AccumDi st 8=new
_
AccumDi st 8;
assign AccumDi st 9=new
_
AccumDi st 9;
assign AccumDi st A=new
_
AccumDi st A;
assign AccumDi st B=new
_
AccumDi st B;
assign AccumDi st C=new
_
AccumDi st C;
assign AccumDi st D=new
_
AccumDi st D;
assign AccumDi st E=new
_
AccumDi st E;
assign AccumDi st F=new
_
AccumDi st F;

/ / Wr i t e out t he Bi nar y Encoded Pat hs

wire [ 15: 0] Sel ect Pat hs={Sel ect 0, Sel ect 1, Sel ect 2, Sel ect 3, Sel ect 4, Sel ect 5, Sel ect 6, Sel ect 7,
Sel ect 8, Sel ect 9, Sel ect A, Sel ect B, Sel ect C, Sel ect D, Sel ect E, Sel ect F};
assign VAddr =ar s;
assign MemDat aOut 16=Sel ect Pat hs;
/ / Updat e t he out put poi nt er
assign ar s=ar s+2;

19
© TENSILICA, INC.
Convolutional Coding on Xtensa
}


/ / I ni t i al i ze Backt r ace i nst r uct i on
operation BACKTRACE0{in AR ar s} {out Mi nSt at e, out Out put }
{
/ / i ni t i al i ze Mi nst at e w/ most l i kel y endst at e
assign Mi nSt at e = ar s;
LSB i s t he out put bi t / / t he
assign Out put = ar s[ 0] ;

}


operation BACKTRA inout AR *ar t } CE{
{inout Mi nSt at e, out Out put , out VAddr , in MemDat aI n16}
{
/ / Read i n Pat hs f or t r el l i s col umn and post decr ement poi nt er
assign VAddr = ar t ;
wire [ 15: 0] Sel = MemDat aI n16;
assign ar t = ar t - 2;

/ / Sel ect pat h f or t r el l i s st at e
wire Dat aI n8 = TIEmux( Mi nSt at e[ 3: 0] , Sel [ 15] , Sel [ 14] , Sel [ 13] , Sel [ 12] , Sel [ 11] ,
Sel [ 10] , Sel [ 9] , Sel [ 8] , Sel [ 7] , Sel [ 6] , Sel [ 5] , Sel [ 4] , Sel [ 3] , Sel [ 2] , Sel [ 1] , Sel [ 0] ) ;

e backwar d one bi t t o pr evi ous st at e / / Tr ac
assign Mi nSt at e = {Dat aI n8, Mi nSt at e[ 3: 1] };

out put bi t / / Save
assign Out put = Mi nSt at e[ 1] ;
}

schedule backt r ace
_
sched {BACKTRACE}{use Mi nSt at e 2; def Mi nSt at e 2; def Out put 2; }

operation STORE
_
OUT{inout AR *Addr }{in Out put , out VAddr , out MemDat aOut 8}
{
assign VAddr = Addr ;
assign MemDat aOut 8 = {7' b0, Out put };
assign Addr = Addr - 1;
}







20
© TENSILICA, INC.