(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, 2010
operation is done with the help of individual Multiply and Addinstructions, it will take more than 1 cycle.
Capability of executing operationsin parallel is one of the most desirable features in any DSPprocessor as it helps achieving a higher level of optimization
and thus largely improves the system’s throughput. A wide
range of DSP processors are available in market with varyingfeatures, designed for different kinds of applications; Super-scalar and VLIW architectures being a popular choicewherever a speedy system is targeted. These systems permitissuance of more than a single instruction, to the executionunits, per cycle.Super-scalar machines can dynamically issue multipleinstructions per clock cycle whereas VLIW Processors use along instruction word that contains a usually fixed number of instructions that are fetched, decoded, issued, and executedsynchronously. Though these architectures are reallyadvantageous still they add to the complexity and cost of thesystem which can not always be afforded. In such a situationan inexpensive simpler system, with added capability of allowing some instructions to be issued in parallel, like ADSPBlackfin, provides efficient implementations both in terms of cost and performance.
Loop Unrolling w.r.t Code:
Looping structures presentin the code can be unrolled for improving the execution speedbut, as discussed before, at the cost of increased code memorysize. Unrolling the loop makes it possible to take maximumbenefits of the core architecture. For example once thestatements are unrolled, independent operations/ processesmay be carried out in parallel to each other.
Zero Overhead Loops:
Software loop overhead can besignificantly reduced by assigning the control of the loop tothe hardware. The loop counter register holds a value which isbased on the test instructions. This counter is decremented andlooping continues until the counter reaches zero. Such astrategy loads off the operating system from the overhead of testing loop condition every time (for each iteration) and thusa significant improvement in performance is observed.
Reusing the same Memory for Trellis structure(Memory Optimization):
2 shows Decoder’s internal
design. As shown, once the input buffer is filled with therequisite amount of input samples, the data is sent out to theTrellis unit in form of chunks. Length of the buffer dictates theinput latency but here we assume that the buffer is alreadyfilled whenever the data is required and so the input latency iszero. The Trellis unit builds up the trellis. The buffer keeps onfeeding the trellis until the length of the trellis reaches thetraceback depth. At this point, the trellis is traced back by theTrace Back unit to produce the decoded data.Once the trellis is fully traced back, the next call to thetraceback unit needs to be initiated immediately after singlebuffer fetch instead of waiting for the re-fill of the wholetrellis. This significantly eliminates the latency required for re-filling all the trellis stages and reduces the processing timerequired for tracing back the whole trellis at once and thus
increases the systems’s average speed.
Figure 2. Decoder
s Internal Configuration
The trellis can be implemented with the help of circular bufferso that the same memory is re-used because once it is tracedback the data occupying the trellis is no longer useful. Thishelps reducing a significant amount of required memoryspace.IV.
This section discusses the implementation details of our designon a DSP platform. Our aim is to design an optimized versionof Viterbi having a constraint length K=7 i.e. for a total of 64states particularly the scroll down window on the left of theMS Word Formatting toolbar. with a code rate of 1/2.However the system for a rate 1/3 encoder holds quite asimilar structure (except for the no of output bits
and so thebranch metric size), the optimization techniques discussed forrate ½ system can efficiently be applied here too. So, our work examines performance for both code rates. Moreover,performance estimation for a system with constraint lengthK=9 has also been considered.It is important to mention here that the size of input buffer iskept as 10 samples and the traceback depth is set as 100.
An initial design of the system caneasily be developed in MATLAB and can be further analyzedand improved through simulations. The steps include:
The input to the system is a randomly generated bitstream (vector) of length 1000.
This input is then convolutionally encoded.
The resultant encoded data has to be transmittedthrough the channel but before its transmission, it ismapped onto different signal levels. The encodedinformation is modulated using
After this mapping process has been finished, thesignal is transmitted through AWGN channel.MATLAB has a pre-defined routine for AWGNchannel. By passing the SNR and Signal Powervalues as arguments, any signal can be sent and thuscorrupted through the channel. The Signal-to-Noiseratio (SNR) is continuously varied from minimum tomaximum so that the system can be tested well forthe whole SNR range.
A comparison of the resultant Bit error rates of ourdeveloped Hard Viterbi codes, Soft Viterbi code and