A LOW POWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

RAJESHWARI. M. BANAKAR

DEPARTMENT OF ELECTRICAL ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY, DELHI
INDIA JULY 2004

Intellectual Property Rights (IPRs) notice Part of this thesis may be protected by one or more of Indian Copyright Registrations (Pending) and/or Indian Patent/Design (Pending) by Dean, Industrial Research & Development (IRD) unit, Indian Institute of Technology Delhi (IITD) New Delhi-110016, India. IITD restricts the use, in any form, of the information, in part or full, contained in this thesis ONLY on written permission of the competent Authority: Dean, IRD, IIT Delhi OR MD, FITT, IIT Delhi.

A LOW POWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

by RAJESHWARI M. BANAKAR
DEPARTMENT OF ELECTRICAL ENGINEERING

Submitted in fulfillment of the requirements of the degree of Doctor of Philosophy

to the

INDIAN INSTITUTE OF TECHNOLOGY, DELHI INDIA
July 2004

Certificate

This is to certify that the thesis titled “A Low Power Design Methodology for Turbo Encoder and Decoder” being submitted by Rajeshwari M. Banakar in the Department of Electrical Engineering, Indian Institute of Technology, Delhi, for the award of the degree of Doctor of Philosophy, is a record of bona-fide research work carried out by her under our guidance and supervision. In our opinion, the thesis has reached the standards fulfilling the requirements of the regulations relating to the degree. The results contained in this thesis have not been submitted to any other university or institute for the award of any degree or diploma.

Dr. M Balakrishnan Head Department of Computer Science & Engineering Indian Institute of Technology New Delhi - 110 016

Dr. Ranjan Bose Associate Professor Department of Electrical Engineering Indian Institute of Technology New Delhi - 110 016

i

ii

Acknowledgments
I would like to express my profound gratitudes to my Ph.D advisors Prof M Balakrishnan and Prof Ranjan Bose for their invaluable guidance and continuous encouragement during the entire period of this research work. Their technical acumen, precise suggestions and timely discussions whenever I was in some problem is whole heartily appreciated. I wish to express my indebtedness to SRC/DRC members Prof Anshul Kumar, Prof Patney and Prof Shankar Prakriya for their helpful suggestions and lively discussions during the research work. I am grateful to Prof Preeti Panda and Prof G.S.Visweswaran for their useful discussions. I would like to thank Prof Surendra Prasad and Prof S.C.Dutta Roy who were the key persons in the phase of choosing this premier institute for my research activity. Part of this research work was carried out at the Universitat of Dortmund Germany under DSD/DAAD project. My sincere thanks to Prof Peter Marwedel for his valuable guidance and pleasant gesture during my stay there. Thanks to the research scholars Lars Wehmeyer, Stefan Steinke and a graduate student Bosik Lee of the embedded systems design group for the help and support. I am thankful to the Director of Technical Education, Karnataka and management B.V Bhoomaraddi Engineering College Hubli, Karnataka for granting leave and giving me this opportunity to do my research. My thanks are due, to my friends B. G. Prasad, S. C. Jain, Manoj, Basant, Bhuvan, Satyakiran, Anup, Viswanath, Vishal, C. P. Joshi, Uma and the other enthusiastic embedded system team members at the Philips VLSI Design Lab for their lively discussions. It was an iii

enjoyable period and great experience to work as a team member in the energetic embedded systems group IIT Delhi. I would also like to thank all other faculties, laboratory and administrative staff of Department of Computer Science and Engineering and Department of Electrical Engineering for their help. A special mention deserves to Ms Vandana Ahluwalia for her helping attitude accompanied with pleasure during the research period. Thanks to my brothers, sisters, friends and other family members for their support. I wish to accord my special thanks to my husband Dr Chidanand Mansur for his moral and emotional support throughout the deputation period for the research work. With highest and fond remembrance I recollect each day of this research work, which turned to be an enjoyable, comfortable and cherished moment due to my caring kids Namrata and Vidya. Their helping little hands, lovely painted greeting cards and smiling faces at home front in IIT Delhi needs a special appreciation. This thesis is dedicated to my parents late Shri Mahalingappa Banakar and late Smt Vinoda Banakar who were a source of inspiration during their life period.

New Delhi

Rajeshwari Banakar

iv

Abstract
The focus of this work is towards developing an application specific design methodology for low power solutions. The methodology starts from high level models which can be used for software solution and proceeds towards high performance hardware solutions. Turbo encoder/decoder, a key component of the emerging 3G mobile communication is used as our case study. The application performance measure, namely bit-error rate (BER) is used as a design constraint while optimizing for power and/or area. The methodology starts from algorithmic level, concentrating on the functional correctness rather than on implementation architecture. The effect on performance due to variation in parameters like frame length, number of iterations, type of encoding scheme and type of the interleaver in the presence of additive white Gaussian noise is studied with the floating point C model. In order to obtain the effect of quantization and word length variation, a fixed point model of the application is also developed. First, we conducted a motivational study on some benchmarks from DSP domain to evaluate the benefit of custom memory architecture like the scratch pad memory (SPM). The results indicate that SPM is energy efficient solution when compared to conventional cache as on-chip memory. Motivated by this we have developed a framework to study the benefits of adding small SPM to the on-chip cache. To illustrate our methodology we have used ARM7TDMI as the target processor. In this application cache size of 2k or 4k combined with a SPM of size 8k is shown to be optimal. Incorporating SPM results in an energy improvement of 51.3 . Data access pattern analysis is performed to see whether specific storage modules like v
 

FIFO/LIFO can be used. We identify FIFO/LIFO to be an appropriate choice due to the regular access pattern. A SystemC model of the application is developed starting from the C model used for software. The functional correctness of SystemC model is verified by generating a test bench using the same inputs and intermediate results used in the C model. One of the computation unit, namely the backward metric computation unit is modified to reduce the memory accesses. The HDL design of the 3GPP turbo encoder and decoder is used to perform bit-width optimization. Effect of bit width optimization on power and area is observed. This is achieved without unduly compromising on BER. Overall area and power reduction achieved in one decoder due to bit width optimization is 46 and 35.6
   

respectively.

At the system level, power optimization can be done using power shut-down of unused modules. This is based on the timing details of the turbo decoder in the VHDL model. To achieve this a power manager is proposed. The average power saving obtained is around 55.5% for the 3GPP turbo decoder.

vi

Contents

1 Introduction 1.1 1.2 1.3 The system design problem . . . . . . . . . . . . . . . . . . . . . . . . . . Turbo Encoder and Decoder . . . . . . . . . . . . . . . . . . . . . . . . . Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2 4 5 7 7 9

2 Turbo coding application 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.2 Applications of turbo codes - A scenario in one decade (1993-2003)

Turbo Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 2.2.2 Four state turbo encoder . . . . . . . . . . . . . . . . . . . . . . . 12 Eight state turbo encoder - 3GPP version . . . . . . . . . . . . . . 14

2.3

Turbo Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 2.5

Experimental set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.1 2.5.2 2.5.3 2.5.4 2.5.5 Effect of varying frame-length on code performance . . . . . . . . 24 Effect of varying number of iterations on performance . . . . . . . 24 4 state vs 8 state (3GPP) turbo decoder performance . . . . . . . . 26 Interleaver design considerations . . . . . . . . . . . . . . . . . . . 26 Effect of Quantization . . . . . . . . . . . . . . . . . . . . . . . . 31

2.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 vii

3 On chip Memory Exploration 3.1 3.2 3.3

35

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Scratch pad memory as a design alternative to cache memory . . . . . . . . 37 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 Cache memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Scratch pad memory . . . . . . . . . . . . . . . . . . . . . . . . . 40 Estimation of Energy and Performance . . . . . . . . . . . . . . . 44 Overview of the methodology . . . . . . . . . . . . . . . . . . . . 45 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4

Turbo decoder : Exploring on-chip design space . . . . . . . . . . . . . . . 51 3.4.1 3.4.2 3.4.3 Overview of the approach . . . . . . . . . . . . . . . . . . . . . . 51 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 63

4 System Level Modeling 4.1 4.2 4.3 4.4 4.5

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Work flow description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Functionality of turbo codes . . . . . . . . . . . . . . . . . . . . . . . . . 65 Design and Analysis using SystemC specification . . . . . . . . . . . . . . 67 4.5.1 4.5.2 Turbo decoder : Data access analysis . . . . . . . . . . . . . . . . 68 Decoder : Synchronization . . . . . . . . . . . . . . . . . . . . . . 70

4.6 4.7

Choice of on-chip memory for the application . . . . . . . . . . . . . . . . 71 On-chip memory evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.7.1 Reducing the memory accesses in decoding process . . . . . . . . . 75

4.8 4.9

Synthesis Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 viii

5 Architectural optimization 5.1 5.2 5.3 5.4

81

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Bit-width precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Design modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4.1 5.4.2 Selector module . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Top level FSM controller unit . . . . . . . . . . . . . . . . . . . . 85

5.5

Design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.5.1 Power estimation using XPower . . . . . . . . . . . . . . . . . . . 88

5.6

Experimental Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.6.1 5.6.2 Impact of bit optimal design on area estimates . . . . . . . . . . . . 89 Power reduction due to bit-width optimization . . . . . . . . . . . 94

5.7

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 101

6 System Level Power Optimization 6.1 6.2 6.3 6.4 6.5

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 General design issues - Power Shut-Down Technique . . . . . . . . . . . . 103 Turbo Decoder : System level power optimization . . . . . . . . . . . . . . 104 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.5.1 6.5.2 6.5.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . 108 Power optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Branch metric simplification . . . . . . . . . . . . . . . . . . . . . 116

6.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 119

7 Conclusion 7.1 7.2 7.3

Our design methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Results summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 ix

x

List of Figures
2.1 (a) The encoder block schematic (b) 4 state encoder details (Encoder 1 and Encoder 2 are identical in nature) (c) State diagram representation (d) Trellis diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 2.3 2.4 The 8 state encoder – 3GPP version . . . . . . . . . . . . . . . . . . . . . 15 Decoder schematic diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Impact on performance for different frame-length used, simulated for six iterations varying the SNR. . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Impact on performance for varying number of iterations, N =1024, 8 state turbo code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 Comparing 4 state and the 8 state turbo code (floating point version), N = 1024, RC symmetric interleaver, six iterations. . . . . . . . . . . . . . . . . 27 2.7 2.8 2.9 Random interleaver design - Our work with Bruce’s 4 state turbo code [61] 29

Illustration of interleaving the data elements . . . . . . . . . . . . . . . . . 30 Comparing our work with Mark Ho’s [41] using a symmetric interleaver . . 32

2.10 Impact of word length variation on bit error rate of the turbo codes . . . . . 33 3.1 3.2 3.3 3.4 3.5 3.6 Cache Memory organization [75] . . . . . . . . . . . . . . . . . . . . . . . 38 Scratch pad memory array . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Scratch pad memory organization . . . . . . . . . . . . . . . . . . . . . . 42 Flow diagram for on-chip memory evaluation . . . . . . . . . . . . . . . . 46 Comparison of cache and scratch pad memory area . . . . . . . . . . . . . 48 Energy consumed by the memory system . . . . . . . . . . . . . . . . . . 50 xi

3.7 3.8 3.9

Framework of the design space exploration . . . . . . . . . . . . . . . . . 52 Energy estimates for the turbo decoder with N = 128 . . . . . . . . . . . . 57 Energy estimates for the turbo decoder with N =256 . . . . . . . . . . . . . 58

3.10 Performance estimates for Turbo decoder . . . . . . . . . . . . . . . . . . 60 4.1 4.2 4.3 4.4 4.5 4.6 4.7 5.1 5.2 5.3 5.4 5.5 6.1 The turbo coding system . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Input data with the index value . . . . . . . . . . . . . . . . . . . . . . . . 68 Code segment to compute llr . . . . . . . . . . . . . . . . . . . . . . . . . 69 Synchronization signals illustrating the parallel events . . . . . . . . . . . . 71 Data flow and request signals . . . . . . . . . . . . . . . . . . . . . . . . . 73
   

(a) Unmodified

unit (b) Modified

unit . . . . . . . . . . . . . . . . . . 77

Area distribution pie chart representation . . . . . . . . . . . . . . . . . . . 78 State diagram representation of FSM used in the turbo decoder design . . . 86 Design flow showing the various steps used in the bit-width analysis . . . . 87 Depicting the area for 32 bit design vs bit optimal design . . . . . . . . . . 97 Power estimates of turbo decoder design . . . . . . . . . . . . . . . . . . . 98 Area and Power reduction due to bit-width optimization turbo decoder design 99 Illustrating the turbo decoder timing slots which helps to define the sleep time and the wake time of the on chip memory modules ( = time slot) . . . 105
¢ £¡

6.2 6.3 7.1

Tasks in the decoder vs the time slot (Task graph). . . . . . . . . . . . . . . 107 Block diagram schematic of the power manager unit . . . . . . . . . . . . . 115 Turbo Decoder : Low power application specific design methodology . . . 121

xii

List of Tables
3.1 3.2 3.3 3.4 3.5 3.6 4.1 5.1 5.2 5.3 5.4 5.5 5.6 5.7
 
 

Cache memory interaction model . . . . . . . . . . . . . . . . . . . . . . . 39 Memory access cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Area and Performance gains for bubble-sort example . . . . . . . . . . . . 49 Energy per access for various modules . . . . . . . . . . . . . . . . . . . . 50 Energy estimates in nJ for various cache and SPM combinations . . . . . . 55 Energy estimates in nJ for various cache and SPM combinations . . . . . . 59 Memory accesses with/without the
 

unit modification . . . . . . . . . . . 76

unit synthesis results with the input data bit-width set to 6 . . . . . . . . . 90 unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
¡

Area reduction of

unit area and frequency estimates . . . . . . . . . . . . . . . . . . . . . 91

Extrinsic unit synthesis results . . . . . . . . . . . . . . . . . . . . . . . . 92 Synthesis observations of LLR unit . . . . . . . . . . . . . . . . . . . . . . 92 Decoder unit synthesis observations . . . . . . . . . . . . . . . . . . . . . 93 Decoder power estimates varying bit-width and varying switching activity of the design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.1 6.2 6.3

Active memory modules in the decoder . . . . . . . . . . . . . . . . . . . 109 Power estimates in mW and area in terms of LUTs . . . . . . . . . . . . . 110 Illustrating
 

power (energy) saving using sleep and wake time analysis in

the memory modules of the decoder . . . . . . . . . . . . . . . . . . . . . 111 xiii

6.4

Power estimates in the individual decoder, active time slot information with the associated power saving due to sleep and wake time analysis. . . . . . . 112

6.5

Database to design the power manager for the 8 state turbo decoder (1 = Active, 0 = Doze) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.6

Power and area saving due to the branch metric simplification . . . . . . . 116

xiv

Chapter 1 Introduction
Application specific design is a process of alternative design evaluations and refined implementations at various abstraction levels to provide efficient and cost effective solutions. The key issues in an application specific design are the cost and performance as per the application requirement with power consumption being the other major issue. In hand held and portable applications like cellular phones, modems, video phones and laptops, batteries contribute a significant fraction of its total volume and weight [1, 2]. These applications demand low power solutions to increase the battery life. The growing concern for early time to market coupled with the complexity of the systems have led to the research and development of efficient design methodologies. Traditionally, handing over of manually prepared RTL design documents to the application developer have been used for the implementation. Since the introduction of hardwaresoftware codesign concepts in this decade (1993 to 2003), suitable frameworks are being favored due to their overall operational advantages when details of performance, energy and area costs are captured during the system level design stage. To define application specific methodology, it is essential to specify steps to come up with a suitable design while ensuring the application functionality and features. As the design methodology progresses, detailed functional assessment continues to provide valuable input for analyzing initial design concepts and in later stages of application constraints 1

and quality of implementation. The design methodology can be used to make significant changes during early stages which results in a wide design space exploration. A key concern is the design turnaround time. Common test bench across various levels of design abstraction is one approach which is effective in reducing development time.

1.1 The system design problem
The design space including memory design issues of power efficient architectures are different than the design space of a general-purpose processors. In particular when the application specific design space is considered, the application parameters which impact the quality should be taken into account and these are quite application specific. The system design problem is to identify the application parameters which will have an impact on the overall architecture. Increasingly, today system design approaches tries to address these issues at a high level of design abstraction while providing a structured methodology for refinement of the design. Low power design of digital integrated circuits has emerged as a very active and rapidly developing field. Power estimation can be done at different stages of system design as illustrated in [3 - 29]. These estimations can drive the transformations to choose low power solutions. Transformations done at the higher level of abstraction namely the system level have a greater impact, whereas estimations are generally more accurate at the lower levels of abstractions like the transistor level and the circuit level [30]. Klaus et al. [31] present high level DSP design methodology raising the design entry to algorithmic and behavioral level instead of specifying a detailed architecture on register transfer level. They concentrate mainly on the issues of functional correctness and verification. No optimization techniques are considered in their work. Another design methodology example is the work of Zervas et al. [32]. They propose a design methodology using communication applications for performance analysis. Power savings through various level of design flow are considered and examined for their 2

efficiency in the digital receiver context. The authors deal mainly with the data path optimization issues for low power implementation. In our approach, architecture exploration using SystemC is performed, quantifying the power savings using bit-width optimization and the power shut down technique for the application using control flow diagram. The application parameters like the information length and its performance parameter namely the bit error rate to the power consumption during the bit-width optimization are correlated. Memory units represent a substantial portion of the power as well as area and form an important constituent of such applications. Typically power consumed by memory is distributed across on-chip as well as off-chip memory. A large memory implies slower access time to a given memory element and also increases the power consumption. Memory dominates in terms of the three cost functions viz area, performance and power. On-chip caches using static RAM consume power in the range of 25 to 45
   

of the total chip
 

power [33]. For example, in the Strong ARM 110, the cache consumes 43 power.

[34] of total

Cathoor et al. [35, 36] describe the custom memory design issues. Their research focuses on the development of methodologies and supporting tools to enable the modeling of future heterogeneous reconfigurable platform architectures and the mapping of applications onto these architectures. They analyse the memory dominated media applications for data transfer and storage issues. Although they concentrate on a number of issues related to reducing memory power, custom memory option like the scratch pad memory is not studied. When considering power efficient architectures, along with performance and cost, onchip memory issues are also included as an important dimension of the design space to be explored. The application algorithmic behavior in terms of its data access pattern is an additional feature to be included in the design space exploration to take a decision on the type of on-chip memory. One of our research goals is to provide a suitable methodology for on-chip memory 3

exploration issues in power efficient application specific architectures. In our solution, scratch pad memory as an attractive design alternative is proposed. At the system level the energy and performance gain by adding a small scratch pad memory to on-chip cache configuration is evaluated. Further, in a situation of distributed memory, shut-down of unused memory blocks in a systematic manner is proposed.

1.2 Turbo Encoder and Decoder
Turbo codes were presented in 1993, by Berrou et al. [37] and since then these codes have received a lot of interest from the research community as they offer better performance than any of the other codes at very low signal to noise ratio. Turbo codes achieve near Shannon
¥ £ ¡ ¦¤¢ 

limit error correction performance with relatively simple component codes. A BER of is reported for a signal to noise ratio of 0.7 dB [37].

Advances in third generation (3G) systems and beyond will cause a tremendous pressure on the underlying implementation technologies. The main challenge is to implement these complex algorithms with strict power consumption constraint. Improvement in the semiconductor fabrication and the growing popularity of hand held devices are resulting in computation intensive portable devices. Exploration of novel low power application specific architectures and methodologies which address power efficient design capable of supporting such computing requirement is critical. In communication systems specifically wireless mobile communication applications, power and size are dominant factors while meeting the performance requirements. Performance being measured not only by data rate but also by the the corresponding bit error rate (BER). In this one decade (1993 - 2003) turbo codes have been used in a number of applications like Long-haul terrestrial wireless systems, 3G systems, WLAN, Satellite communication, Orthogonal frequency division multiplexing etc. The algorithmic specifications and the performance characterization of the turbo codes form a pre-requisite study to develop a low power solution. Functional analysis of turbo 4

decoder with simplified computation metric to save on power as well as the architectural features like bit-width are closely related to the application parameter like bit-error rate (BER).

1.3 Thesis Organization
The thesis is organized as follows : Chapter 2 deals with the algorithmic details of the specific application under consideration namely the turbo coding system. The effect of varying various parameters like the frame length, number of iterations, type of encoder, type of interleaver and quantization is presented. The main focus of this chapter being the development of various C models of the application, which is the first step in our application specific design methodology. In Chapter 3, a motivational study for the evaluation of energy efficient on-chip memory is conducted. A design methodology to show that scratch pad memory is an energy efficient on-chip memory option in embedded systems vis-a-vis cache is developed. For the specific turbo decoder, which is memory intensive, we demonstrate that suitable combination of cache and scratch pad memory proves beneficial in terms of energy consumption, area occupied and performance. In Chapter 4, the data access analysis to see if any other custom memory is suitable for the specific application is done. It is concluded that FIFO/LIFO on-chip memory is appropriate. The architecture at the system level using SystemC is explored for functional verification of the design and show the benefit of modifying one of the computational unit to reduce the memory accesses. In Chapter 5, the architecture optimization namely the bit width optimization is considered as a low power design issue. A VHDL model for the 3G turbo decoder to estimate the power and area occupied by the design is developed. In Chapter 6, system level power optimization technique is studied. More specifically, the power shut-down method is considered. The sleep time and the wake-up time of the 5

design modules is derived from the timing details available from the HDL model. As a system level power optimization option, the power shutdown technique for the turbo codes is applied to quantify the power savings. Design of a suitable power manager is proposed. In Chapter 7, the application specific flow methodology is summarized with focus on low power. The results of our work on low power turbo encoder and decoder are also summarized with future directions in this area.

6

Chapter 2 Turbo coding application
2.1 Introduction
Efficient methodology for the application specific design reduces the time and effort spent during design space exploration. The turbo code application from the area of wireless communications is chosen as the key application for which an application specific design methodology is developed. The functionality and specific characteristics of the application are needed to carry out the design space exploration. The application characteristics studied are, the impact on the performance of the turbo codes with variation in the size of the input message (frame-length), type of the interleaver and the number of decoding iterations. Turbo coding is a forward error correction (FEC) scheme. Iterative decoding is the key feature of turbo codes [37, 38]. Turbo codes consist of concatenation of two convolution codes. Turbo codes give better performance at low SNRs (signal to noise ratio) [39, 40]. Interestingly, the name Turbo was given to this codes because of the cyclic feedback mechanism (as in Turbo machines) to the decoders in an iterative manner. The turbo encoder transmits the encoded bits which form inputs to the turbo decoder. The turbo decoder decodes the information iteratively. Turbo codes can be concatenated in series, parallel or in a hybrid manner. Concatenated codes can be classified as parallel concatenated convolution codes (PCCC) or serial concatenated convolutional codes (SCCC). 7

In PCCC two encoders operate on the same information bits. In SCCC, one encoder encodes the output of another encoder. The hybrid concatenation scheme consists of the combination of both parallel and serial concatenated convolutional codes. The turbo decoder has two decoders that perform iterative decoding. Convolutional codes can be viewed in different ways. Convolutional codes can be represented by state diagrams. The state of the convolutional encoder is determined by its contents.

Two versions of the turbo codes are studied in detail. One is the four state turbo code and the other is 3GPP (third generation partnership project) version which has eight states. BER in case of 3GPP turbo code shows significant improvement over the four state version, although the computational complexity of the decoder is high when compared to the four state turbo code system. The impact of quantization on the BER at the algorithmic level is studied.

In this chapter, the impact of different frame lengths of the message on the bit error rate (BER) is analyzed. The impact of varying the number of iterations on the BER is studied. The model developed is compared to Mark Ho’s model [41, 42] and the results have been found to closely tally. A comparison of the performance of the random interleaver implementation and the symmetric interleaver is done.

The rest of the chapter is organized as follows. The various application areas of turbo codes are described in subsection 2.1.1. Section 2.2 describes the details of the encoder schematic. Section 2.3 discusses the turbo decoder and its performance issues. Section 2.4 gives the experimental set up. Section 2.5 presents the results of various experiments. The interleaver design considerations are discussed followed by the explanation of the effect of quantization. Finally section 2.6 summarizes the chapter. 8

2.1.1 Applications of turbo codes - A scenario in one decade (19932003)
The following paragraphs briefly discuss several application areas where turbo codes are used. Mobile radio :
     

Few environments require more immunity to fading than those found in mobile communications. For applications where delay vs performance is crucial, turbo codes offer a wide trade-off space at decoder complexities equal to or better than conventional concatenated code performance. The major benefit is that such systems can work with smaller constraint-length convolutional encoders. The major drawback is decoder latency. Turbo codes with short delay are being heavily researched. Turbo codes generally outperform convolutional and block codes when interleavers exceed 200 bits in length [43]. Digital video : Turbo codes are used as part of the Digital Video Broadcasting (DVB) standards. ComAtlas, a French subsidiary of VLSI Inc, has developed a single-chip turbo code codec that operates at rates to 40M bps. The device integrates a 32x32-bit interleaver and performance is reportedly at least as good as with conventional concatenated codes using a Reed-Solomon outer code and convolutional inner code [44]. Long-haul terrestrial wireless : Microwave towers are spread across and communications between them is subjected to weather induced fading. Since these links utilize high data rates, turbo codes with large interleavers effectively combat fading, while adding only insignificant delay. Furthermore, power savings could be important for towers in especially remote areas, where turbo codes are used. [45]. 9

Satellite communications and Deep space communication :
     

Turbo codes are used for deep-space applications. Low power design is important for many communication satellites that orbit closer to earth. Turbo codes can be used in such applications. Many of these satellites are equipped with programmable FECs using turbo codes [46, 47]. Turbo codes can include serial as well as parallel concatenations and NASA group has also proposed hybrid structures. Turbo encoders are used in spacecrafts [48]. Military applications : The natural applicability of turbo codes to spread-spectrum systems provides increased opportunity for anti-jam and low probability of intercept (LPI) communications. In particular, very steep BER vs

between geographic locations that can receive communication with just sufficient / and those where

/

is insufficient. Turbo codes are used in a number of

military and defense application. Terrestrial telephony : Turbo codes are suited for down-to-earth applications such as terrestrial telephony [45]. Satellite technology provides global coverage and service. UMTS is being standardized to ensure an efficient and effective interaction between satellite and terrestrial networks. An important third generation cellular standard is cdma2000, which is standardized by the third generation partnership project (3GPP). As in UMTS (Universal Mobile Telephone Service), cdma2000 systems use turbo codes for forward error correction (FEC). While the turbo codes used by these two systems are very similar, the differences lie in the interleaving algorithm, the range of allowable input size and the rate of constituent RSC encoders [49].

10

¤ ¡ ¥£ ¢ 
/

curves lead to a sharp demarcation

¤ ¡ ¥£ § 

¤ ¡ ¥£ ¦ 

Oceanographic Data-Link :
         

This data-link is based on the Remote Environmental Data-Link (REDL) design. ViaSat is a company offering satellite networking products. ViaSat offers a modem with turbo encoding and decoding. It is reported that turbo coding increases the throughput and decreases the bandwidth requirement. [50]. Image processing : Embedded image codes are very sensitive to channel noise because a single bit error can lead to irreversible loss of synchronization between the encoder and the decoder. Turbo codes are used for forward error correction in robust image transmission. Turbo codes are suited in protecting visual signals, since these signals are typically represented by a large amount of data even after compression. [51, 52, 53]. WLAN (Wireless LAN) : Turbo codes can increase the performance of a WLAN over traditional convolutional coding techniques. Using turbo codes, an 802.11a system can be configured for high performance. The resulting benefits to the WLAN system are that it requires less power, or it can transmit over a greater coverage area. Turbo code solution is used to reduce power and boost performance in the transmit portion of mobile devices in a wireless local area network (WLAN). [54] OFDM : The principles of orthogonal frequency division multiplexing (OFDM) modulation have been around for several decades. The FEC block of an OFDM system can be realizes by either a block-based coding scheme (Reed-Solomon) or turbo codes [55]. xDSL modems : Turbo codes present a new and very powerful error control technique which allows communication very close to the channel capacity. Turbo codes have outperformed 11

all previously known coding schemes regardless of the targeted channel. Standards based on turbo codes have been defined. [56]. These applications indicate that turbo codes can play a vital role in many of the communications systems which are being designed. Low power, memory efficient implementation of turbo codes is thus an important research area.

2.2 Turbo Encoder
Turbo coding system consists of the turbo encoder and the turbo decoder. The description of the four state and eight state (3GPP), turbo encoder used in our design is discussed.

2.2.1 Four state turbo encoder
The general structure of a turbo encoder architecture consists of two Recursive Systematic Convolutional (RSC) encoders Encoder 1 and Encoder 2. The constituent codes are RSCs because they combine the properties of non-systematic codes and systematic codes [40, 38]. In the encoder architecture displayed in Figure 2.1 the two RSCs are identical. The N bit data block is first encoded by Encoder 1. The same data block is also interleaved and encoded by Encoder 2. The main purpose of the interleaver is to randomize burst error patterns so that it can be correctly decoded. It also helps to increase the minimum distance of the turbo code [57]. Input data blocks for a turbo encoder consist of the user data and possible extra data being appended to the user data before turbo encoding. The encoder consists of a shift register and adders as shown in Fig. 2.1 (b). The structure of the RSC encoder is fixed for the design because enabling varying encoder structures would significantly increase the complexity of the decoder by requiring to adapt to the new trellis structure and computation of the different metrics in the individual decoders. The input bits are fed into the left end of the register and for each new input bit two output bits are transmitted over the channel. These bits depend not only on the present

12

¡ ¢ 

input bit, but also on the two previous input bits, stored in the shift register.

¦ §¥¤£

forms

Input message

uk

Encoder 1

xs xp1
Concatenate

u(k)

systematic bits

parity bits xp1[k]
Encoded message

Interleaver

D

D

uk_int

Encoder 2

xp2

(a)
STATE Diagram 0/00 So 00 1/11 0/00
S1= 01

(b)

uk=0 Notation used in trellis diagram S0= 00 0./00 0./00 0./00

uk=1

0./00

0./00

1/11
1/11 1/11 1/11 1/11 1/11 1/11 0/00 1/11 1/11 0/00

S2 10 0/10

1/01 0/10 S3 11 1/01

S1 01
10 1/01 0/10

0/00

1/01 0/10 0/10

1/01 0/10 0/10 1/01

1/01 0/10 0/10 1/01

11

1/01

(c)

(d)

Figure 2.1: (a) The encoder block schematic (b) 4 state encoder details (Encoder 1 and Encoder 2 are identical in nature) (c) State diagram representation (d) Trellis diagram

13

the systematic bit stream, for the input message of frame length N. The parity bit stream

constituent convolutional code words, each which is based on a different permutation in the order of the information bits. The working of the encoder can be understood using a state diagram given in Fig. 2.1 (c). In an encoder with two memory elements, there are four states , , and

For the encoder shown in Fig 2.1 (b), there are two encoded bits corresponding to input bit ‘0’ and ‘1’. Another way to represent the state diagram is the trellis diagram [58]. A trellis is a graph whose nodes are in a rectangular grid semi infinite to the right as shown in Fig. 2.1 (d). Trellis diagram represents the time evolution of the coded sequences, which can be obtained from the state diagram. A trellis diagram is thus an extension of a state diagram that explicitly shows passage of time [58]. The nodes in the trellis diagram correspond to the state of the encoder. From an initial state (

to the next states for each possible input pattern. In the trellis diagram Fig 2.1 (d) at stage t=1 there are two states and

and each state has two transitions corresponding to the

input bit ‘0’ and ‘1’. Hence the number of nodes are fixed in a trellis, which is decided by the number of memory elements in the encoder. In drawing the trellis diagram the convention in specifying the branches introduced by a ‘0’ input bit as a solid line and the branches introduced by a ‘1’ input as a dashed line is used.

2.2.2 Eight state turbo encoder - 3GPP version
In order to compare and study the characteristics in terms of the performance, both the four state and the eight state encoders are implemented. The difference essentially lies in the number of memory element that each encoder uses. In contrast to the two memory element as in four state encoder, there are three memory elements in an eight state encoder. This is the encoder that is specified in the 3GPP standards [38]. 14

£ ©¢

¥ ¦¢

£ ¢

¨ ¤¢

£

of frame-length

§ ¥ ¤¢ ¦¢ ¤¢ £

¦ §¥¤£

depends on the state of the encoder. For a given information vector (input message) , the code word (encoded message) consists of concatenation of two

  ¡ 

. At each incoming bit, the encoder goes from one state to another.

) the trellis records the possible transitions

Systematic bits

+
Input data

+ D + D

Parity bits

+

D
Encoder 1

Interleaver

+ + D
Encoder 2

+ D + D

Parity bits

8 state encoder schematic

Figure 2.2: The 8 state encoder – 3GPP version The schematic diagram of the 3GPP encoder is shown in Figure 2.2. It consists of two eight state constituent encoders. The data bit stream goes into the first constituent encoder that produces a parity for each input bit. The data bit stream is scrambled by the interleaver and fed to the encoder 2. For 3GPP, the interleaver can be anywhere from 40 bits to 5114 bits long. The data sent across the channel is the original bit stream, parity bit of the encoder 1 and the parity bits of the second encoder. So the entire turbo encoder is a rate 1/3 encoder.

2.3 Turbo Decoder
In this section the iterative decoding process of the turbo decoder is described. The maximum a posteriori algorithm (MAP) is used in the turbo decoder. There are three types of algorithms used in turbo decoder namely MAP, Max-Log-MAP and Log-MAP. The MAP algorithm is a forward-backward recursion algorithm, which minimizes the probability of 15

bit error, has a high computational complexity and numerical instability. The solution to these problems is to operate in the log-domain. One advantage of operating in log-domain is that multiplication becomes addition. Addition, however is not straight forward. Addition is a maximization function plus a correction term in the log domain. The Max-LogMAP algorithm approximates addition solely as a maximization. Max-Log-MAP algorithm in turbo decoder is used in our work [38].

2.3.1 Decoding Algorithm
The block diagram of the turbo decoder is shown in Figure 2.3. There are two decoders corresponding to the two encoders. The inputs to the first decoder are the observed systematic bits, the parity bit stream from the first encoder and the deinterleaved extrinsic information from the second decoder. The inputs to the second decoder are the interleaved systematic bit stream, the observed parity bit stream from the second RSC and the interleaved extrinsic information from the first decoder. The main task of the iterative decoding procedure, in each component decoder is an algorithm that computes the a posteriori probability (APP) of the information symbols which is the reliability value for each information symbol. The sequence of reliability values generated by a decoder is passed to the other one. In this way, each decoder takes advantage of the suggestions of the other one. To improve the correctness of its decisions, each decoder has to be fed with information that does not originate from itself. The concept of extrinsic information was introduced to identify the component of the general reliability value, which depends on redundant information introduced by the considered constituent code. A natural reliability value, in the binary case, is the logarithm likelihood ratio (llr). Each decoder has a number of compute intensive tasks to be done during decoding. There are five main computations to be performed during each iteration in the decoding stage as shown in Figure 2.3. Detailed derivations of the algorithm can be found in [59, 60, 38]. The computations are as follows (computations in one decoder per iteration): 16

encoder 2 parity bits

1. Branch Metric Computation
Deinterleaver
LLR encoder 1 parity bits systematic bits

Decoder 2

2. Forward Recursion 3. Backward Recursion

Decoder 1

extrinsic information Interleaver

4. Log likelihood Computation 5. Extrinsic Computation

Iterative Decoding

Interleaver

Interleaver

Final estimate

Retrieved message

Figure 2.3: Decoder schematic diagram

17

In the algorithm for turbo decoding the first computational block is the branch metric computation. The branch metrics is computed based on the knowledge of input and output associated with the branch during the transition from one state to another (Fig. 2.1 (d)). There are four states and each state has two branches, which gives a total of eight branch metrics. The computation of branch metric is done using (2.1). Upon simplifying this equation as shown in [61, 62], it is observed that only two values are sufficient to derive branch metrics for all the state transitions.

with frame-length N, other decoder,

is the information that is fed back from one decoder to the

is the channel estimate which corresponds to the maximum signal is the encoded parity bits of the encoder,

observed values of the encoded parity bits and encoded systematic bits. The
 

unit takes the noisy systematic bit stream, the parity bits from encoder1 and

encoder2 to decoder1 and decoder2 respectively and the apriori information to com 

pute the branch metrics. The branch metrics computed and stored.

the probability of a state at time k, given the probabilities of states at previous time
¡

instance.

is calculated using equation (2.2). 18

¡

The forward metric

is the next computation in the algorithm, which represents

¡

Forward metric computation  

unit

¡ ¦ § £¤ 

¦ §¥¤£

   

to distortion ratio,

is the observed values of the

for all branches in the trellis are

  ¦ §¥¤£ 

¡  

¦ §¥¤£ £

¦ £¤

© §

 

where

is the branch metric at time k,

    ¡  ¢ © ¡ ¡   ¡   ¦  ¦ §¥¤£ ¨¢ §¥¤£ § §¥¤£ ¨¢ ¦ £¤ ¥ ¦ ¢ © § ¥ ¦ £ ¦§¥¤£ ¤¢ ¦ £¤   ¦ ¡§¥¤£

¦ §¥¤£

 

Branch metric computation  

unit

(2.1)

 

¦ §¥¤£

are the systematic bits of information

is the noisy

computed for states 00, 01, 10 and 11 (refer Fig 2.1 (d) ). In an 8 state decoder ¡

3GPP version, respectively.

is computed for states 000, 001, 010, 011, 100, 101, 110 and 111

N-1]. This metric is termed as forward metric because the computation order is from
¡

0,1,2,... N-1 and the value for index
 

is initialized to zero. The

computes the metric using

values computed in the above step. A forward recursion
¡

the previous alpha times the branch metric along each branch from the two previous nodes to the current node.
 

Backward metric unit 

unit

The backward state probability being in each state of the trellis at each time k, given the knowledge of all the future received symbols, is recursively calculated and stored.
 

The backward metric

is computed using equation 2.3 in the backward direction,

going from the end to the beginning of the trellis at time instance k-1, given the probabilities at time instance k.

for states 000, 001, 010, 011, 100, 101, 110 and 111. 19

 

Fig 2.1 (d) ) in a 4 state decoder. In an 8 state decoder - 3GPP version,

 

where the state transition is from .

is computed for states 00, 01, 10 and 11 (refer is computed

¦ §¥¤£

¡ ¡ ¢ ¡ ¦ §¥¤£

¢

 

 

¢ ¡   ©   ¦   ¥¤£ ¢ ¡

¡

on the trellis is performed by computing

for each node in the trellis.

¡

¤

¡

¡

 

¡

computed, namely

,

,

,

, where k ranges from [0, 1, 2, ...,

¡

Observing a section of the trellis diagram in Figure 2.1 four

metrics are to be

unit recursively

is the sum of

(2.3)

¡

node at a time instance k in the forward direction traversing through the trellis.

¡

where the summation is over all the state transitions 

¢

¦ §¥¤£
to .

¡ ¡
  

¨¢

¦
 

¦ §¥¤£ ¥ ¥

§¦¥¤£¢¡ © ¥¤£ ¨¢¡ ¡¡§ £¤ ¡    ¦
¦ £¤ £ ¥
¡

(2.2)

¦ §¥¤£ ¥ £

¡

is computed at each is

¦ §¥¤£ £ £

Backward metric computation can start only after the completion of the computation
 
 

by the

unit. Observing the trellis diagram, there are four

values to be computed,
 

node in the trellis. The computation is the same as for , but starting at the end of the trellis and going in the reverse direction. Log likelihood ratio llr unit
  ¡

Log likelihood ratio llr is the output of the turbo decoder. This output llr for each symbol at time k is calculated as

message bit

= 0.

to compute the llr values. The main operations are comparison, addition and subtraction. Finally, these values are de-interleaved at the second decoder output, after the required number of iterations to make the hard decision, in order to retrieve the information that is transmitted. The log likelihood ratio llr[k] for each k is computed. The llr is a convenient measure since it encapsulates both the soft and hard bit information in one number. The sign of the number corresponds to the hard decision while the magnitude gives a reliability estimate. Extrinsic unit
 

Compute the extrinsic information that is to be fed to the next decoder in the iteration 20

 

¡

 

The

values,

unit output and the

values obtained from the above steps are used

 

= 1. The denominator is summation over all the states

to

in

¦ £¤ 

¢

 

where numerator is summation over all the states

to in

¦ § £¤

¢

  

¢

¢

llr[k-1]

and input message bit and input

 

to N-1. A backward recursion on the trellis is performed by computing

¦ §¥¤£

¡ ¡ ¡   ¦§¥¤£ ¦ ¡ ¡ §¥¤£ ¡   ¦ ¦ §¥¤£ ¦ §¥¤£ ¦

¢

 

 

¢

 

 

¦ ©¥¤£ ¢ ¡ £ ©¨¦§¤¤   © ¥¤£ ¢ ¡ ¥ ©¨§¥£¢¡ 

¡

¡

 

¦ §¥¤£

  

 

,

,

utilizing

values and the initialized

values for index k equal for each

 

but now in backward order, from [N-1 down to 0]. The four

values are

¦ § £¤ £ £

,

¦ §¥¤£ ¥ ¥

¦ §¥¤£ £ ¥

¦ §¥¤£ ¥ £ ¦ §¥¤£
 

(2.4) 

sequence. This is the llr minus the input probability estimate. Extrinsic informa 
¡

weighted channel systematic bits and the information fed from the other decoder. Extrinsic information computation uses the llr outputs, the systematic bits and the apriori information to compute the extrinsic value. This is the value that is fed back to the other decoder as the apriori information. This sequence of computations is repeated for each iteration by each of the two decoders. After all iterations are complete, the decoded information bits can be retrieved by simply looking at the sign bit of the llr: if it is positive the bit is a one, if it is negative the bit is a zero. This is because the llr is defined to be the logarithm of the ratio of the probability that the bit is a one to the probability that the bit is a zero. There are a number of references in which one can find the complete derivations equations for the MAP turbo decoding algorithm [38, 61, 40, 39].

2.4 Experimental set up
To study the effects of varying various parameters on the BER, the C version of the application is developed during the SW solution of the application. To compare the performance of four state turbo code and eight state (3GPP) turbo
 

code, the floating point version of 3GPP turbo code is developed. It assists in observing the effect of the type of encoder design used in each case. A floating point version of the four state turbo encoder and the decoder is developed.
 

This helps us to obtain the BER for various signal to noise ratio and test its functionality. The test bench modeled during the application characterization is used during hardware solution of the design later. Effects of varying application parameters (frame length and iterations) are assessed
 

 
 

tion

¦

©

¥¤£

is obtained from the log likelihood ratio llr[k-1] by subtracting the

21

using the developed C model. These are used later to correlate BER vs area/power due to architectural optimization. To study the effect of quantization, the fixed point version of the four state and the
             

eight state turbo code is developed. The word length effects of various data elements in the design on BER can be observed using the fixed point model. This helps in fixing the bit-width of the input parameters of the decoder to be used during the architectural optimization (bit-width analysis). To compare the results of different interleaver designs, a module for the random interleaver and the RC symmetric interleaver was developed. These versions were used to compare our result with the work of other researchers in the turbo coding area.

2.5 Simulation Results
Different parameters which affect the performance of turbo codes are considered. Some of these parameters are: The size of the input message (Frame-length). The number of decoding iterations. The type of the encoder used (4 state vs 8 state encoder). Effect of interleaver design. Word-length optimization (Quantization effects).

All our results presented in this section consider turbo codes over the additive white Gaussian noise (AWGN) channel. 22

Effects of varying the framesize on BER (eight state) 10
0

10-1

N = 128 N = 256 N = 1024 N = 2048 N = 4096

10-2
BER

10-3

10-4

10-5

0

0.5

1 SNR Eb/No

1.5

2

2.5

Figure 2.4: Impact on performance for different frame-length used, simulated for six iterations varying the SNR.

23

2.5.1 Effect of varying frame-length on code performance
The increase in the size of the input frame length has an impact on the interleaver size. When the size of the interleaver increases it adds to the complexity of the design. It increases the decoding latency and the power consumption. Figure 2.4 depicts how the performance of the turbo codes depends on the frame-length N used in the encoder. It can be seen that, there is a clear trend in the performance of the turbo codes when the frame-length increases. The codes exhibit a better performance for increase in the frame-length. This is in accordance with various other published literature. Since our goal is to develop a low power framework for this application specific design, the algorithmic study on the impact of vary N is essential. This assists in the power performance trade-off studies, when the design is ported to ASIC or FPGA solution. Next the effect of varying the number of iterations in the turbo codes is considered.

2.5.2 Effect of varying number of iterations on performance
Effect of varying the number of iterations during the decoding process is an interesting observation in system level studies. Increasing the number of iterations should give a performance improvement, but at the expense of added complexity. The latency of obtaining the decision output after completion of the decoding will increase. This will also have an impact on the power consumption when the design is considered at the implementation level. As seen from Figure 2.5, going from one iteration, which was the case of no feedback, to the case of 3 iterations, a substantial gain in SNR for a given BER was obtained. In our study it is inferred that around four to six iterations are sufficient in most cases. The number of iterations have been fixed as six in our all our experiments, unless otherwise stated. These observations agree with the results from [63]. Feedback drives the decision progressively further in one particular direction, which is an improvement over the previous iteration. The extent of growth slows down as iterations increase. 24

3GPP N=1024 Varying the number of iterations 10
0

6 iterations 5 iterations 1 iteration

10-1

10-2
BER

10-3

10-4

10-5

0

0.5

1

1.5

2

2.5

SNR Eb/No (1024 bit interleaver)
Figure 2.5: Impact on performance for varying number of iterations, N =1024, 8 state turbo code

25

2.5.3 4 state vs 8 state (3GPP) turbo decoder performance
Figure 2.6 shows the performance of the turbo decoder using 4 state version and the 8 state (3GPP) version. For comparison, the 4 state floating point model and the 3GPP floating point model was used, simulated under the same conditions of AWGN channel. Third generation partnership project (3GPP) has recommended turbo coding as the defacto standard in the 3G standards due to its better performance. The performance of the four state vs the eight state turbo codes is compared. In all of the simulated cases, 8-state (3GPP) outperforms 4-state turbo codes. The gains obtained by 8-state with 6 iterations with respect to 4-state with 6 iterations show a difference of almost one order of magnitude for a 1024 bit interleaver. The performance of the 8 state turbo codes is better than the 4 state, primarily as a result of superior weight distribution of the 8 state turbo codes. Furthermore 8-state PCCC with i iterations is approximately equivalent to 4-state SCCC with 3i/2 iterations in terms of computational cost, gate count and power consumption [64].

2.5.4 Interleaver design considerations
The performance of the turbo codes is dependent on different parameters like the framelength, number of iterations, selection of different encoders and use of different interleavers. Interleaving is basically the rearrangement of the incoming bit stream in some order specified by the interleaver module. Random interleavers A random interleaver is based on random permutation . For large values of N, most random interleavers utilized in turbo codes perform well. As the interleaver block size increases, the performance of a turbo code improves substantially. The method to generate a random interleaver is as follows: Generate an interleaver array of size N. Input to this array is the information bit
 

stream of frame-length N. 26

Comparison of 8 state(3GPP) and 4 state Turbo code 10
0

3GPP - 8 state 4state

10-1

10-2
BER

10-3

10-4

10-5

0

0.5

1

1.5

2

2.5

SNR Eb/No (4 state vs 3GPP 1024 bit six iterations)
Figure 2.6: Comparing 4 state and the 8 state turbo code (floating point version), N = 1024, RC symmetric interleaver, six iterations.

27

Generate a random number array of size N.
   

Use the random number array as the index to rearrange the position of the bits in the interleaver array to obtain the new interleaved array.

To verify the turbo code functionality and to compare it with similar work conducted by other researchers, a turbo coding system using a random interleaver was also developed. This model forms the basis for comparison with the simulated performance of Bruce’s work [61]. Interleaver size N = 1024, 4 state turbo coding and six iterations are used in both the cases in order to have a fair comparison. Our results tally with their model. The results are shown in Figure 2.7 with bit error rate (BER) on the y axis and signal-to-noise in dB on the x axis. RC symmetric interleaver A symmetric interleaver is used in the remaining experiments of our work. The interleaver size is N, which is also the frame-length size of the information that is handled. The interleaver is arranged in the form of a matrix with specified rows and columns. The possible interleaver configuration can be N = 128 with 4 rows and 32 columns, or N = 32 with 4 rows and 8 columns. The scrambling of the bit positions is done as follows and the diagram is shown in Fig 2.8 Conceptually bits are written into a two dimensional matrix row-wise. The row interleaving is done first followed by the column interleaving to obtain the intereaved data. The rows are shuffled according to a bit-reversal rule and the elements within each row are permuted according to the bit-reversal procedure again.The deinterleaving is also done in the same way, since this is a symmetric interleaver. Consider the row interleaving where num   
© 

and three are kept as it is, the elements in rows one and two are interchanged. Before interleaving can commence it is essential that all the data elements are present. The advantage of using symmetric interleaver is that the same interleaver can be used for 28

   

© 

  ¡

© 

¡ ¢ 

© 

¡ ¡

§ ¥ £ ¡ ¨¢¢

After bit reversal binary

which is R[0,2,1,3]. Elements in row zero

¡ ¢ 

© 

  ¡

© 

¡ ¡

§ ¥ £ ¡ ¨¦¤¢¢

ber of rows is 4. The row numbers are R[0,1,2,3]. Its binary equivalent

¦

£

¡ ¥
 

.

¦

£

¡
 

Comparison of Our work and Bruce 4 state Turbo code 10
0

Our work Bruce

10-1

BER

10-2

10-3

10-4

0

0.5

1

1.5

2

2.5

SNR Eb/No (1024 bit random interleaver design and six iterations)
Figure 2.7: Random interleaver design - Our work with Bruce’s 4 state turbo code [61]

29

Step1 : Take the input message of 128 and arrange it in 4 rows and 32 columns. Row 0 Row 1 Row 2 Row 3
Col 31 Col 0

elements arranged in the matrix form. ...............

Step 2 : Perform Row interchange. Use bit reversal algorithm to get the pattern of the row interchanging. Eg Row 1 means binary ’01’ . Its bit reversal is ’10’ which is Row 2. Interchange the row elements. Step 3 : Column Interleaving. Bit reversal algorithm is used for the column numbers also. After row interleaving, column interleaving is done.

Figure 2.8: Illustration of interleaving the data elements

30

deinterleaving also. A disadvantage of random interleavers is that it requires different interleave and deinterleave unit with separate hardware and lookup tables. It is shown in [41] that although the random interleavers give good performance, symmetric interleavers in no way degrade the performance of the turbo coding system. Our turbo encoder/decoder simulation is compared with the results published by other independent researchers. Our simulation results after six iterations are shown in Figure 2.9, together with the results simulated by Mark Ho et al. [41]. There is some difference in BER and it can be attributed due to the different symmetric interleavers used in both the cases. In [41] S-symmetric interleaver is used. RC-symmetric interleaver is used in our simulations.

2.5.5 Effect of Quantization
Next, the effect of quantization on the performance of the turbo decoder is studied. Floating point computations are first converted to fixed point computations by the use of quantizers. Experiments are conducted to compare the impact on performance using floating point version of the turbo code and the fixed point version of the turbo code. Since it is planned to use this application further for architectural analysis, on chip memory exploration and the FPGA implementation study, this step is a pre-requisite. Further, to investigate the word length effects on power consumption the fixed point implementation study is needed. The performance of the turbo code for bit width of 32 (integer), 6 bit and 4 bit quantization is simulated. Numerical precision has an impact on the performance in terms of the bit error rate of the turbo codes which is shown in Fig 2.10. It can be observed that the floating point implementation would have better performance than the fixed point implementation. However, this will increase the complexity of the implementation for floating point case, when compared to the fixed point case. Hence, fixed point precision is a preferred choice. Now, in fixed point implementation, it is interesting to observe the word length requirement of the input data to the decoder (output of the channel estimates). In the same Figure 2.10, 31

Comparison using symmteric interleaver, N=1024 10
0

Our Work Mark Ho

10-1

BER

10-2

10-3

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

SNR Eb/No (4state N=1024 bit, six iterations)
Figure 2.9: Comparing our work with Mark Ho’s [41] using a symmetric interleaver

32

Effects of Quantization on BER N = 4096 10
0

floating point fixed point 6 bit quantization 4 bit quantization

10-1

10-2
BER

10-3

10-4

10-5

0

0.5

1

1.5

2

2.5

SNR Eb/No (dB) (Varying the quantization bits)
Figure 2.10: Impact of word length variation on bit error rate of the turbo codes

33

comparison of various bit width is depicted. Observing the performance of the turbo codes with bit width six, the performance is very close to the fixed point (32 bit width) performance. Hence bit width of six could be an appropriate choice for the input data bit width. Coming to the comparison of bit width variation from six to four, it is observed that the there is a degradation of performance of the turbo codes, indicating that four bit width is not a good choice. These observations tally with other published literature on quantization effects [65].

2.6 Summary
In this chapter, some applications of turbo codes are discussed in the beginning to get an insight into the different areas of its utility. The applications are characterized and the operational details of the turbo coding application are studied. The algorithmic details of the turbo encoder and the decoder are presented next, which enables one to understand the operational flow of the application. BER is a measure of the performance of the turbo codes for different signal to noise ratios. The impact on the BER for the varying parameters are studied. Experimental set up and the results are presented for effect of number of iterations and effect of the frame length. The study on the effect of type of encoder (4 state vs 8 state(3GPP)) shows that the 3GPP version has better performance than the four state turbo decoder as expected. Our results are compared with those published in literature. The effect of quantization on the BER is also studied. The C simulation models constructed are used to test the functional correctness.

34

Chapter 3 On chip Memory Exploration
3.1 Introduction
Processor-based information processing can be used conveniently in many embedded systems. Embedded systems themselves have been evolving in complexity over the years. From simple 4-bit and 8-bit processors in the 70’s and 80’s for simple controller applications, the trend is towards use of 16-bit as well as 32-bit processors for more complex applications involving audio and video processing. The detailed architecture of such systems is very much application driven to meet power and performance requirements. In particular, this applies to the memory architecture, since memory does have a major impact on the overall performance and the power consumption. In this chapter the effect of different options that exist for the implementation of memory architectures is analyzed and alternatives to on-chip cache realization explored. The performance and energy consumption when using scratch pad memory (SPM) vis-a-vis cache as on-chip memory alternatives is compared. This is because memory requirements in the application programs have increased. This is specially true in digital signal processing, multimedia applications and wireless mobile applications. This also happens to be a major bottleneck, since a large memory implies slower access time to a given memory element, higher power dissipation and more silicon area. Memory hierarchy consists 35

of several levels of memory, where higher levels of memory comprise of larger memory capacity and hence longer access time [66, 67]. Memory hierarchy must be taken into consideration in system level design to minimize the overall system cost [33]. The different memory levels used in most processor architectures comprise of the registers and L1 cache as on-chip memory. L2 cache, main memory and secondary memory are organized as off-chip memory. In many embedded applications, two levels of memory exists; on-chip memory and off-chip main memory.

In this chapter, two sets of results, one on the DSPStone benchmarks and the other on the turbo decoder application, in particular for performance and energy trade off are presented. The initial study conducted on DSPStone benchmarks [68] using the methodology that is developed [69], reveals that SPM is an energy efficient choice when compared to traditional on-chip cache configurations. This work was done at University of Dortmund Germany, as part of a collaborative project.

Motivated by these results, benefit of adding small SPM to cache configuration is analyzed using the design space exploration framework for the turbo decoder application. In this framework, suitable address filters are used to separate the cache data and the SPM data based on the respective sizes of cache and SPM. The ARM7TDMI processor based design space exploration assists the design decisions at the architectural level.

The rest of the chapter is organized as follows. Section 3.2 presents the previous work in this area. Section 3.3 deals with the details of the motivational study done, emphasizing the design methodology used. Results for various applications are presented for cache vs SPM impact on performance, area and energy. Section 3.4 gives the details of the on-chip memory design space exploration for the turbo decoder application. Section 3.5 presents the summary of this chapter. 36

3.2 Related Work
Recently, interest has been focused on having on-chip scratch pad memory to reduce power consumption and improve performance. Scratch pad memories are considerably more energy and area efficient than caches. On the other hand, they can replace caches only if they are supported by an effective compiler. The mapping of memory elements of the application benchmarks to the scratch pad memory can be done only for data [70], only for instructions [71] or for both data and instructions [72, 73]. Current embedded processors particularly in the area of multimedia applications and graphics controllers have on-chip scratch pad memories. Panda et.al [70] associate the use of scratch pad memory with data organization and demonstrate its effectiveness. Benini et.al [74] focus on memory synthesis for low power embedded systems. They map the frequently accessed locations onto a small application specific memory. Most of these works address only the performance issues. Focusing on energy consumption it is shown that scratch pad memory can be used as a preferable design alternative to the cache for on-chip memory. To complete the trade-off area model for the cache and scratch pad memory is developed. It is concluded that scratch pad memory occupies less silicon area.

3.3 Scratch pad memory as a design alternative to cache memory
In this section experiments on DSPstone benchmarks using the methodology developed to show that SPM is more energy efficient than the traditional on-chip cache memory are conducted. 37

Address

Input Word lines Word lines

Tag array

Decoder

Bit lines

Column Muxes Sense amplifiers

Bit lines

Column Muxes Sense amplifiers

Comparators Output Drivers Mux Drivers Valid output Output Driver Output Driver

Data Output
Figure 3.1: Cache Memory organization [75]

3.3.1 Cache memory
The basic organization of the cache is taken from [75] and is shown in Fig. 3.1. The decoder first decodes the address and selects the appropriate row by driving one wordline in the data array and one wordline in the tag array. The information read from the tag array is compared to the tag bits of the address. The results of comparisons are used to drive a valid (hit/miss) output as well as to drive the output multiplexers. These output multiplexers select the proper data from the data array and drive the selected data out of the cache. The area model used in our work is based on the transistor count in the circuitry. All 38

Data array

transistor counts are computed from the designs of circuits. From the organization shown

ray, column multiplexer, the pre-charge, sense amplifiers, tag comparators and multiplexer driver units respectively.

column multiplexer, pre-charge, data sense amplifiers and the output driver units respectively. The estimation of power can be done at different levels, from the transistor level to the architectural level [30]. In CACTI [75], transistor level power estimation is done. The energy consumption per access in a cache is the sum of energy consumptions of all the components identified above. It is assumed that the cache is a write through cache. There are four cases of cache access that are considered in our model. Table 3.1 lists the accesses required in each of four possible cases i.e. read hit, read miss, write hit and write miss.

Table 3.1: Cache memory interaction model

$#¡¢ ('% §  '% $#¡¢ "  §  ¢ ! & ¥  & ! ¥  

1 1 0 1

0 L 1 0

0 L 0 0

0 0 1 1

39

¥ 

£

¥

¥ 

£

¥

Access type Read hit Read miss Write hit Write miss

¡

¥

£ 

where

,

,

,

,

,

is the area of the data decoder unit, data array,

¡¦¤  

¥

¡ 

¡  

¥ 

     ¥ ¤ ©   

¥ 

¨ 

¡ ¦ ¤   ¡        ¤ ©   ¦  §¦     ¥ ¥

¥

£ 

¥ ¦ 

¥ 

¥ ¤¦ 

¥

 

 

£

¡ ¥ ¤ ¦ 

£ 

£

where

,

,

,

,

,

and

is the area of the tag decoder unit, tag ar- 

¦ © 

¥ 

¤ ©  

¥ 

¡  

¥

¥

¦

 

 

¤ ¥ ©

  

 ¤ ©   ¡  

¥

£

¡ ¡¥ ©  ¥ ¦ 

  ¤ ©   ©  ¤¦  ¡ ¡¥

 

£ ¡ § ¨ 

£

£

¡ ¥ ¤ ¦ 

£ 

and

is computed using the area of its components.

£

¡ ¥ ¤ ¦ 

£ 

¥

£ ¡ § ¨ 

£

  ©

 

£

¡ ¥ § ¦ 
£ 

and data array (

).

£ ¡ ¤ ¢ 
£

in Fig. 3.1, the area of the cache (Ac) is the sum of the area occupied by the tag array (

)

(3.1)

£ ¡ ¤ © 
£

(3.2)

(3.3)

Cache read hit : When the CPU requires some data, the tag array of the cache is
 

accessed. If there is a cache read hit, then data is read from the cache. No write to the cache is done and main memory is not accessed for a read or write. (row 2 - table 3.1) Cache read miss : When there is a cache read miss, it implies that the data is not in
 

the cache and the line has to be brought from main memory to cache. In this case, there is a cache read operation, followed by L words to be written in the cache, where L is the line size. Hence there will be a main memory read event of size L with no main memory write. (row 3 - table 3.1) Cache write hit : If there is a cache write hit, there is a cache write followed by a
 

main memory write as this is a write through cache. (row 4 - table 3.1) Cache write miss : In case of a cache write miss, a cache tag read (to establish the
 

miss) is followed by the main memory write. There is no cache update in this case. For simplicity, tag read access from cache read of tag and data are not distinguished. The cache is a write through cache. (row 5 - table 3.1) Using this model the cache energy equation is derived as

(3.7).

3.3.2 Scratch pad memory
The scratch pad is a memory array with the decoding and the column circuitry logic. This model is designed keeping in view that the memory objects are mapped to the scratch pad in the last stage of the compiler. The assumption here is that the scratch pad memory 40

¥ 

£

£

and

is the number of cache write accesses. Energy E is computed using equation

¥  §
£ ¥

£

£

£

Where

is the energy spent in cache.

is the number of cache read accesses

 

¦ ¤ §¥

$#¡¢ !
¥ 

£

©

£
©

¥

¥  §
£ ¥

£

©

£

¢ £

  

¡© ©    
£

(3.4)

$#¡¢ !  ¡© ©    

©

Word select Vdd o Vdd

(a) Memory array
bit bit_bar

Word Select

bit
(b) Memory Cell (c) Six Transistor Static RAM

bit_bar

Figure 3.2: Scratch pad memory array

occupies one distinct part of the memory address space with the rest of the space occupied by main memory. Thus, the availability of the data/instruction in the scratch pad need not be checked. It reduces the comparator and the signal miss/hit acknowledging circuitry. This contributes to the energy as well as area reduction. The scratch pad memory array cell is shown in Fig. 3.2(a) and the memory cell in 3.2(b). The 6 transistor static RAM cell [75, 2] is shown in Fig 3.2(c). The cell has one R/W port. Each cell has two bit-lines, bit and bit bar, and one word-line.The complete scratch pad organization is as shown in Fig. 3.3. From the organization shown in Fig. 3.3, the area of the scratch pad is the sum of the area occupied by the decoder, data array and the column

area, column multiplexer, pre-charge, data sense amplifiers and the output driver units respectively. The scratch pad memory energy consumption can be estimated from the energy
¥

consumption of its components i.e. decoder

and memory columns

41 

 ¤ ©     

¥  §¤ © ¤¥  

  

¡ ¡  

¥

 ¡

£ ¦

where

,

,

,

,

and

is the area of the data decoder, data array

¦¤¡
 
.

¥ 

¡ ¡  

¥
¥

 ¡

 

¤ ¡ ¥ ©

¦¤¡

 

¥

£ ¦

¥ ¡  

¥ 

¥ ¤¡  

  ¤ © ¡   ¡   ¤¡   ¥ ¥

 

¡

 

¡

 

circuit. Let

be the area of the scratch pad memory.

(3.5)

Decoder Unit

Memory Array

Column Circuitry
(Sense amplifiers, column mux, output drivers, pre− charge logic)

Figure 3.3: Scratch pad memory organization

Scratch Pad Memory Columns

Energy in the memory array consists of the energy consumed in the sense amplifiers, column multiplexers, the output driver circuitry, and the memory cells due to the wordline, pre-charge circuit and the bit line circuitry. The major energy consumption is due to the memory array unit. The procedure followed in the CACTI tool to estimate the energy consumption is to first compute the capacitance’s for each unit. Then, power and energy is estimated. As an example the energy computation for the memory array is described. Similar analysis is performed for the decoder circuitry also, taking into account the various switching activity at the inputs of each stage. Let the energy dissipated be memory cell. Thus

, which consists of the energy dissipated in the

toggle probability, which depends on the transitions of the data bit values. The probability 0.5 is taken as the average bit toggle value and corresponds to half the bits changing in any 42

¤ ¥£ ¢

in equation (3.7) is the capacitance of the memory array unit.

¥

£

¥ 

 ¤ ©    
£ §

¤ ¢ ¥ ¥£ £¦ ¤¥ ¦ ¤    ¤       ¡  ©    ©

¥
¥ 

¥  ¤¤ © ¤¥  
  

 ¤ ©    

¥  ¡ §   ¡© ¤
£

£ ¥

©¡

 

(3.6)

(3.7) is the bit 

 ¤ ©   

cycle[124].

is computed from equation (3.8). It is the sum of the capacitance’s due to preis the effective load capacitance of

read/write.

is the number of columns in the memory.

In the preparation for an access, bit-lines are pre-charged and during actual read/write, one side of the bit lines are pulled down. Energy is therefore dissipated in the bit-lines due to pre-charging and the read/write access. When the scratch pad memory is accessed, the address decoder first decodes the address bits to find the desired row. The transition in the address bits causes the charging and discharging of capacitance’s in the decoder path. This brings about energy dissipation in the decoder path. The transition in the last stage, that is the word-line driver stage triggers the switching in the word-line. Regardless of how many address bits change, only two word-lines among all will be switched. One will be logic 0 and the other will be logic 1. The equations are derived based on [75].

pad as contrast to cache, events due to write miss and read miss are not there. The only possible case that holds good is the read or write access.

scratch pad model.

¥

£

to the scratch pad memory.

is the energy per access obtained from our analytical

43

¡¡©©£ ¢

¢

  ¡© §¨¥ © ¡   ¡ £

 

£

where

is the total energy spent in the scratch pad memory. In case of a scratch

¥  ¡ §   ¡© ¤

£

£ ¥

©¡

  ¦ ¡¡©© ¢

$¡ ¢ §   !¥ 

¥ 

£

£

¢

¥

the bit-lines during pre-charging and

is the effective load capacitance of the cell 

  

¥

charge and read access to the scratch pad memory.

¤

$¡ ¢ §   !¥ 
¥  £ ¥

¥ 

  ¢
¥

¦ 

  £¢¡¢  
 
£ 

¡ § ¤ ¡   ¡  

  

 ¤ ©     ¤ ¡   ¡   ¡

(3.8) 

  ¥¢¤¢   

 ¤ ©   

(3.9)

is the number of accesses

3.3.3 Estimation of Energy and Performance

As the scratch pad is assumed to occupy part of the total memory address space, from the address values obtained by the trace analyzer, the access is classified as going to scratch pad or main memory and an appropriate latency is added to the overall program delay. One cycle is assumed if it is a scratch pad read or write access. For main memory, two modes of access - 16 bits and 32 bits are considered. The modes and latencies are based on an ARM processor with thumb instruction set which has been used for our experimentation. This is because for processors designed for energy sensitive applications, multiple access modes are provided. If it is a main memory 16 bit access is one cycle plus 1 wait state (refer to Table 3.2 ). If it is a main memory 32 bit access then, it is one cycle plus 3 wait states. The total time in number of clock cycles is used to determine the performance. The scratch pad energy consumption is the number of accesses multiplied by the energy per access as described in equation 3.9.

Access Number of cycles Cache Using Table 3.1 Scratch pad 1 cycle Main Memory 16 bit 1 cycle + 1 wait state Main Memory 32 bit 1 cycle + 3 wait states Table 3.2: Memory access cycles

The special compiler support required in SPM address space allocation is to identify the data objects to be mapped to the scratch pad memory area. In order to map the data objects to the scratch pad memory some profiling is to be performed initially. This helps to identify which of the array variables are suitable candidates for assignment onto the scratch pad memory. On chip memory components are typically parameterized. SPM parameters 

¤¢¡ % ¢
  ¡

include the scratch pad memory size

¢

and the feature size.

44

3.3.4 Overview of the methodology
Our overall methodology is based on estimating energy, performance (number of cycles) and area requirements in case of different cache and SPM configurations. As all of these is critically based on number of accesses to on-chip memory, it is necessary to identify critical parts of programs and data which will be mapped to SPM. Fig. 3.4 shows the flow diagram of the methodology used. The encc compiler [69] generates the code for the ARM7 core. It is a research compiler used for exploring the design and new optimization techniques. The input to this compiler is an application benchmark written in C. Constant propagation, copy propagation, dead code elimination, constant folding, jump optimization and common sub expression elimination are supported as standard optimization. The compiler is an energy aware compiler supporting the selection of instructions based on their energy consumption. Energy data required for this is built using the instruction level power model as referred by Steinke et al. [76]. As a post pass option, encc uses a special packing algorithm, known as the knapsack algorithm, for assigning code and data blocks to the scratch pad memory. The result is that blocks of instruction and data which are frequently accessed and are likely to generate maximum energy savings, are assigned to the scratch pad memory. The output of the compiler is a binary ARM code which can be simulated by the ARMulator to produce a trace file. For on-chip cache configuration, the ARMulator accepts the cache size as parameter and generates the performance as the number of cycles. The on-chip memory configuration has an impact on the number of wait states for accessing a location which are added during the post analysis of the trace file. The predicted area and energy for the cache is based on the CACTI [75] model for 0.5 m technology. The same model has been used with the hardware pruned to predict area
 

and energy of scratch pad memory. The target architecture used in our experiments is the AT91M40400 microcontroller containing an ARM7TDMI core [77]. In general, software power estimation for a pipelined processor has to consider both instruction level power model as well as inter-instruction effects [105]. More recent work 45

Application

On−chip memory configuration

encc (includes a scratch pad option)

CACTI

Binary code

ARMulator Energy per access Trace file

Energy computation

Number of accesses

Trace analysis

Area estimates

Energy estimates

Performance estimates

Figure 3.4: Flow diagram for on-chip memory evaluation

46

has shown that variation in power consumption due to inter-instruction effects is very small [127] at least for RISC processors. On the other hand contribution of memory to the overall power/energy consumption is universally accepted to be very significant. The emphasis in our work has been on memory power consumption rather than instruction power consumption. Although, for the work on ARM7TDMI, an energy aware compiler which was driven by instruction level power model was used.

3.3.5 Results and discussion
To compare the use of scratch pad memory and caches, a series of experiments for both of these configurations are conducted. The trace analysis for the scratch pad and the cache was done in the design flow after the compilation phase. A 2-way set associative cache configuration is used for comparison. From the trace file it is possible to obtain the number of cache read hits, read misses, write hits and write misses. From this data the number of accesses to cache based on Table 3.1 is computed, where the number of cycles required for each type of access is listed in Table 3.2. The area is represented in terms of number of transistors. These are obtained from the cache and scratch pad organization. Fig. 3.5 shows the comparison of the area of cache and scratch pad memory for varying sizes. It is found, on an average, that the area occupied by the scratch pad is 34
 

less than the cache memory for the same size.

Table 3.3 gives the area/performance trade off for bubble-sort example. Column 1 is the size of scratch pad or cache in bytes. Columns 2 and 3 are the cache and scratch pad area in transistors. Columns 4 and 5 are the number of CPU cycles (in 1000s) for cache and scratch-pad based memory systems respectively. Column 6 gives the area reduction due to replacing a cache by a scratch pad memory while column 7 corresponds to the reduction in the number of cycles. Column 8 gives the improvement in the area-time product AT (assuming constant clock cycle periods). 47

150000 & 135000 Area (in number of transistors) 120000 100005 90000 75000 60000 45000 & 30000 & 15000 & @ 64
,

@ Scratch pad based memory

& Cache based memory @

@

& @ 128

@ 256 512 1024 2048

Size of cache/scratch pad in bytes
Figure 3.5: Comparison of cache and scratch pad memory area

48

Size bytes

Area Cache

Area Scratch pad

CPU cycles CPU cycles Area Time Area-time cache Scratch pad reduction reduction Product

64 128 256 512 1024 2048 Average

6744 11238 21586 38630 74680 142224

4032 7104 14306 26722 53444 102852

481.9 302.4 264.0 242.6 241.7 241.5

347.5 239.9 237.9 237.9 192.0 192.0

Table 3.3: Area and Performance gains for bubble-sort example

The area time product AT is computed using

The average area, time and AT product reductions are 34%, 18% and 46%, respectively for this example. The cycle count considerations in performance evaluation are based on the static RAM chips found on the ATMEL AT91 board with AT91M40400 [78, 79] microcontroller. To compare the energy, the energy consumption of the main memory is to also be accounted. The energy required per access by various devices is listed in Table 3.4. The cache and scratch pad values for size 2048 bytes were obtained from energy models for the cache and SPM whereas the main memory values were obtained from actual measurements on the ATMEL board [72]. The energy consumed in the main memory along with the energy consumed in the on-chip memory was taken into account for these calculations. Fig.3.6 shows the energy consumed by the memory hierarchy for biquad, matrixmult and quicksort examples for both cache and scratch pad. In all the cases it is observed that scratch pad consumes less energy for the same size of cache, except for quicksort with cache size of 256 bytes. On an average it was found that energy consumption reduced by 40
 

Clock cycle estimation of number of clock cycles is based on the ARMulator trace 49

¤

©

¡
0.40 0.37 0.34 0.31 0.28 0.28 0.33 0.28 0.21 0.10 0.10 0.21 0.20 0.18 0.44 0.51 0.55 0.61 0.55 0.57 0.54 (3.10)

£

£
¦

©

  ¢ £¥¡ £ ¢¤

©
¦ §

¡

£

 ¢

 

  ¡ 

¡

 

©

 

using scratch pad memory.

250000

Scratch Pad Cache Biquad

200000

150000
Energy in nJ

matrix mult
100000

quick sort

50000

0

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

Size in bytes

Figure 3.6: Energy consumed by the memory system

Cache per access (2 kbytes) Scratch pad per access (2 kbytes) Main memory read access, 2 bytes Main memory read access, 4 bytes Main memory write access, 4 bytes

4.57 nJ 1.53 nJ 24.00 nJ 49.30 nJ 41.10 nJ

Table 3.4: Energy per access for various modules

50

output for cache or scratch pad memory. This is assumed to directly reflect performance i.e. the larger the number of clock cycles lower the performance. This is under the assumption that any change in the on-chip memory configuration (cache/scratch pad memory and its size) does not change the clock period. This is strictly valid only when the on-chip memory is not in the critical path. On the other hand, though restrictive, this does not degrade our results. This is because the same size cache is compared with same size scratch pad memory. The delay of cache implemented with the same technology will always be higher. In effect, even if this assumption is invalid, the performance gains for SPM will increase vis-a-vis cache.

3.4 Turbo decoder : Exploring on-chip design space
The objective here is to explore the benefits of combining cache and SPM to form the on-chip memory for our application specific design namely the turbo decoder.

3.4.1 Overview of the approach
The overall framework of the approach is depicted in Figure 3.7. The ARM tool suite is a collection of programs that emulates the instruction sets and the architecture of the chosen target processor. It provides an environment for the development of ARM targeted software [77]. It is instruction accurate. The C algorithm of the Turbo decoder is fed as input to the ARM compiler armcc, which generates the executable image module. This is given to the ARM debugger armsd, to obtain the trace output of the application under consideration. The trace forms input to the address filter section, where-in the cache and SPM accesses are separated based on SPM address map information. The cache trace when fed to the Dinero IV simulator gives the cache access statistics in terms of cache read hit, cache read miss, cache write hit and cache write miss. The ARM profiler is used to profile the application. A related problem that arises in this context is to identify the data arrays in an application for storage in the on chip memory, namely the scratch pad memory area. A suitable 51

Application

ARM tool suite

Trace Trace analysis + Access Simulator On−chip memory configuration Cache + SPM

Cache accesses

SPM accesses

Performance, Energy and Area Estimator tool suite (On−chip cache and SPM)

Area

Performance

Energy

Figure 3.7: Framework of the design space exploration

52

pad memory under consideration. For various configurations of cache and SPM the trace analysis is done. Using the results of cache and scratch accesses the performance is predicted. The number of accesses to the scratch pad and cache obtained in this step is also used to calculate the total energy spent. The access statistics is fed to the performance, area and the energy estimator tool suite. As already described in section 3.3, it consists of the CACTI tool for per access energy estimates of the cache with pruned values for the SPM per access energy. Area estimator gives the area of the SPM and the Cache in terms of the number of transistors.

3.4.2 Experimental setup
Design space exploration is done for various combination of Cache and Scratch pad memory, to explore the on-chip memory design space. The target architecture used is ARM7TDMI processor.
 

ARMulator is used for simulation and analysis.
 

Dinero IV simulator is used for Cache access statistics for various configuration.
 

Various supporting modules are developed with Perl scripts to carry out the experiment. These include the arm2dinero module which converts the format of the trace generated by ARM simulator to the one suitable for Dinero IV cache simulator. The Access frequency module, which does the trace analysis and finds the most frequently accessed data addresses. The Address Filter module, separates the trace based on the addresses to the Cache memory trace and to Scratch pad memory trace depending upon its address. The experimentation is done to observe the performance, energy and area costs impact for various configuration of cache and scratch pad memory. The cache sizes taken are 53 

¤¢¡ % ¢
  ¡ 

¤¢¡ % ¢
  ¡

address memory of size

is designated, where

¢

¢

is the size of the scratch

1 kbytes, 2 kbytes, 4 kbytes, 8 kbytes, 16 kbytes and 32 kbytes. For each of the cache configuration 512 bytes, 1 kbytes, 2 kbytes, 4 kbytes, 8 kbytes and 16 kbytes scratch pad configuration are taken. For example, 1k 512 represents a 1k cache and 512 bytes scratch pad memory, 1k only represents a 1 kbyte cache with no scratch pad memory.

3.4.3 Results
The results for various combinations of cache and SPM are presented. Cache misses lead to degradation in terms of performance and energy consumption. The total energy consumed in cache is the product of the number of cache accesses and energy per access for a given cache configuration. The number of cache accesses constitute the sum of number of cache read accesses and cache write accesses for the application under consideration. Upto a certain limit increase in cache size is one of the solutions to decrease the energy consumed. For this application 16K cache configuration without SPM is optimal from energy point of view (column 2 of table 3.5). Turbo decoder application is used to explore the benefits of adding SPM to on-chip cache. The parameter of the turbo decoder application which is of interest to us is the frame length N which constitutes the number of information message bits. The 3GPP version of the Turbo decoder application for design space exploration is used. The frame length of the information bits N is set to 128 to record these observations in the turbo decoder application. The baseline configuration consists of only cache as on-chip memory with no SPM.The cache configuration parameters are 2 way set associative using 0.5 m technology. The total design space is composed of 42 configurations of the on-chip
 

memory. The benefit of adding 512 bytes of SPM to various cache sizes is studied. The frame size N is 128 in the turbo decoder application with each data element of four bytes. Hence one array can be mapped to 512 bytes of SPM address space. The most frequently accessed data is identified and mapped to the SPM. This proves beneficial in reducing the energy. The observations are shown in table 3.5. 54

SPM 512 bytes Cache 1K 22876489 11133935 2K 17385507 9560997 4K 12147691 7420852 8K 6741607 4159790 16K 5594543 2782712 32K 6115935 2868704

Config

No SPM

SPM 1K 9668092 8251137 6160668 3484257 2619130 2710687

SPM 2K 8769464 7433051 5435109 3302291 2505375 2610920

SPM 4K 6893609 5813424 3572995 2540500 2309164 2415369

SPM 8K 3648375 2029112 1824218 1676234 1699956 2072486

SPM 16K 2251843 1632303 1567899 1513698 1702447 2082507

Table 3.5: Energy estimates in nJ for various cache and SPM combinations When the memory accesses and energy estimates is compared with the baseline configuration of only cache as on-chip memory, there is considerable saving in the energy when 512 bytes of SPM is added. Total energy comprises of the sum of cache energy, main memory energy and the SPM energy. The energy improvement by adding a 512 bytes of scratch pad with 1 kbyte of cache combination is around 51.33% for the chosen application specific design. Every entry of table 3.5 refers to a specific on-chip configuration i.e. specific cache size and specific SPM. The energy consumption across columns (or in fact many rows) starts from a high value and dips down before rising again. This implied an intermediate configuration to be optimal. For example in column 7 minimal energy configuration is 8 kbytes of SPM and 8 kbytes of cache. This is because as the cache size is increased (starting from 1 kbyte), though the energy per access increases, it is more than compensated by reduction in energy due to reduced memory accesses. After a certain cache size the situation changes. Though the number of cache accesses increase and memory accesses decrease this is overshadowed by the increase in energy per access. Consider the following equations.

total memory of the system, scratch pad, cache and main memory respectively. 55 

 

¡ ¢¢ £

£

£

£ ¥

£

where

,

,

and

refer to the energy consumed by 

 
¡

¢
£ 

 
¥ 

¡© ©    
£ 

 

¥

¥
£

  ¡© §¨¥ © ¡   ¡ £ 

 © ©   §    ©¤ ¥ ¡

 
  

¡  § ¤ ¡    
£

(3.11)

©¡ 

¡    § ¤ ¡    

for cache, SPM and main memory respectively.
¤ ¢ ¥£

creases. Relative variations results in the non-monotonic behavior.

Increase in the energy estimates is observed when the cache size is increased to 32k with any combination of SPM. From the on chip memory requirement for the turbo decoder application and the energy estimates, the optimal choice is 8k cache with 16k SPM. This framework assists the designer to choose appropriate energy efficient on chip memory combination. For the same decrease in energy the cache size is to be increased four times (i.e. to 4K Bytes) without SPM. The area tradeoff is observed using the area model developed for the cache and the SPM. The area increase from 1k to 4k cache configuration is 3.7 times. The area increase is only 1.35 times for a 1k cache with 512 bytes SPM with substantial energy improvement. The area model described in section 3.3 is used. Figure 3.8 shows the energy of table 3.5 as bar graph for various cache and SPM configuration of on chip memory. Using the energy trade off results, the number of configurations can be lowered to a minimum number to pick up the optimal solution, considering three factors viz, energy cost, performance issue and area penalty. And the chosen configurations can be mapped, to low level implementation for further functional verification purposes. As the frame length increases, the BER of the turbo decoder improves which is shown in the previous chapter. The impact on energy, by increasing the frame length of the turbo decoder is studied in the experimental set up from N = 128 to N = 256. Since the frame 56 

 

¡ ¢¢ £ 

¢£

£

£

£

cache size increases

(per access) increases,

decreases and

¥

£ 

¡© © £  

£

In any column for a fixed SPM size

and 

 

¡ ¢¢ £

  ¡© §£ ¥ © ¡ © © ¡  

£

 

£ ¥

£

¦ £

£

£

main memory respectively, and

,

and 

 

¥ ¡ §    ©¤

©¡©©

   © 

 

©©©

¡ ¢¢ £ 

¢£
  

 ©

©©©

¤ ¢ ¥£

 

£

£

where

,

and

are the number of accesses to the cache,SPM and are energy per access

(per access) is fixed. As de- 

 
¡

¢
£

¨ §

©©
£ 

  ¢  

¡ ¢¢ £ 

£

¥

¥
£

  ¡© §£ ¥ © §© © ¡ ¦  
£

 
¢

¤ ¢ ¥£

¦
£

¥ 

¡©  
£

  ¡

©©
£

    ¢  ¡©

¦

£  ¡©  

£

©

£
©

  

¡  § ¤ ¡    
£

(3.12)

Figure 3.8: Energy estimates for the turbo decoder with N = 128

57

Figure 3.9: Energy estimates for the turbo decoder with N =256

58

Turbo decoder N 128 256

BER (SNR 1 dB) 0.00854 0.00191

Optimal Configuration Cache with SPM 4K with 8K 4K with 16K

Memory Energy nJ 1567899 2636747

Table 3.6: Energy estimates in nJ for various cache and SPM combinations

length is now 256, each array data element being of 4 bytes, the minimum size of SPM that is essential to map at least one complete array is 1 kbyte. Hence the 512 bytes SPM is not used in this case. When the difference in the energy consumed for N = 128 and N = 256 is compared, it is observed that there is an increase in the energy. The ratio is not the same for different configuration of on-chip memory. It varies from 1.2 to around 2.7 sometimes, the average increase being 1.4 for all the simulated configurations of cache and SPM combination. The energy increase can be traded off against the BER performance with varying N documented in Chapter 2. Table 3.6 shows the BER, optimal configurations and memory energy for N = 128 and N = 256. It is observed that as N increases BER improves and memory energy increases by 40.5 . The performance estimates are computed using the scratch pad access model, cache access model and the main memory access model described in the previous section. Figure 3.10 shows the performance estimates for baseline configuration and various combinations of cache and SPM. Observing the performance estimates it is seen that after 4k cache and SPM configuration, the curves flatten and there is little variation in the number of cycles. This indicates that there is insignificant benefit when the cache size and the SPM size is increased beyond certain sizes. In this application cache size of 2k or 4k and SPM size of 8k seems to be quite efficient. 59
 

Figure 3.10: Performance estimates for Turbo decoder

60

3.5 Summary
A methodology to evaluate various on-chip memory configurations comprised of cache and scratch pad memories area, performance and energy consumption is presented. Further, this can be related to application specific measures like BER in case of turbo decoder. Such a comparison helps the designer to make effective decisions at the architectural level. The comparison is carried out by evaluating performance through simulation and estimating area and energy using the scratch pad and cache memory models over a range of scratch pad and cache sizes. Results clearly show that scratch pad memory is a promising alternative or add on to caches in many embedded system applications. With the methodology developed, the design space for one specific application namely Turbo decoder is explored. By adding a very small 512 bytes on-chip scratch pad to a 1K cache it is seen that, the energy improvement for memory accesses is 51.33%. The area increase is a minimum of 1.35 times for a 1 kbyte cache and 512 bytes scratch pad configuration, which indicates that with minimum area penalty, a significant energy improvement without performance degradation is achieved. A range of design options are explored. It is observed that a small cache size (2 kbytes or 4 kbytes) coupled with 8 kbytes of SPM is the optimal choice. For the turbo decoder application, one can trade-off the BER and energy when the frame length N is varied. An improvement in the BER at the cost of memory energy increase by 40.5
 

when N is increased from 128 to 256 is observed.

61

62

Chapter 4 System Level Modeling
4.1 Introduction
System level modeling is the phase of design aimed at architecture exploration before the actual HDL implementation. The availability of system level modeling and implementation techniques helps us to ensure functional correctness as well as explore the design space. In our case the modeling is based on SystemC design environment for the turbo coding application. The memory access patterns in the application are studied. The data flow analysis as well as synchronization requirement between the different modules form the basis for identifying parallelism in the algorithm. This drives the hardware solution of the turbo encoder and decoder design at system level in the design methodology. Identifying the access pattern of the computationally intensive routines, it is proposed to use FIFO/LIFO buffers to store the intermediate results in the four state turbo decoder. One
 

of the computational unit, namely the

unit is modified, which leads to the reduction in the

number of memory accesses and hence savings in the energy consumption. The parallelism in the algorithm is explored and the design is implemented in SystemC, which basically helps us to validate the modularity, functionality and architectural features of the design. To test the implementation correctness, the turbo encoder and one four state turbo decoder is synthesized onto Xilinx virtex device V800BG560-6. Amongst the computational units 63

it is found that log likelihood ratio (llr unit) occupies

The rest of the chapter is organized as follows. Section 2 presents the related work. In section 3 a brief introduction of the work flow and functionality of the turbo codes is given. The design implementation issues using SystemC is then explained and the data access pattern analysis is presented. On-chip memory evaluation for the application specific design is considered followed by the description of the benefit of modifying the backward computation unit, to reduce the memory accesses is explained in section 4.5. Finally, the synthesis results performed using the Xilinx tools are presented. Finally the chapter is concluded.

4.2 Related work
In [80] low cost architecture of the turbo decoder is presented. The reduction of the implementation cost is on the amount of RAM memory allocated. Dielissen et al. [81] present a design based on layered processing architecture. It is a power efficient layered turbo decoder processor. To resolve the problem of multiple memory accesses per cycle for the efficient parallel architecture, a two-level hierarchical interleaver architecture is proposed in [82]. A low-power design of a turbo decoder based on the log- MAP algorithm is given in [83]. The turbo decoder has two component log-MAP decoders, which perform the decoding process alternatively. VLSI architectures of the turbo decoder are presented in [84]. Prior studies into turbo decoder memory issues to store the computed metrics have been addressed and experiments conducted [85, 86, 87, 88, 89]. The impact of modifying the backward metric unit has not been addressed at the architectural level in any of these architectures. In doing so the number of memory accesses are decreased, which effectively reduces the energy consumption. Further, the issues of synchronisation and data access pattern analysis for the turbo decoder application are considered in our design. 64

 

  ¡ 

of the individual decoder area.

4.3 Work flow description
In this section the approach used for the design of the turbo decoder is explained. An application developed using C to test the turbo decoder performance issues is
         

used to get an insight into the data access pattern analysis required by the algorithm. The data access pattern analysis is done to evaluate the suitable type of on-chip memory that is required for the particular application. An improved design of the turbo decoding system is proposed using buffer as the storage unit for the intermediate metrics that are generated. The application is studied at the behavioral level using SystemC. The objective is to identify the available modularity, exploiting the parallel activities in the algorithm, to explore the decoder architecture design space. A SystemC model is developed for the four state turbo decoder with the modified backward metric unit. This modification helps in reducing the memory accesses and ensures the functionality of the encoder and the decoder. The SystemC model assists in evaluating on-chip memory storage necessary for the application. As a final step the application is synthesized to evaluate the area and frequency. Synthesis is done using Xilinx XCV BG800-6 device.

4.4 Functionality of turbo codes
C-language software version is developed to verify the functional correctness of the design and observe the performance measure BER - bit error rate of the decoder. Chapter 2 describes the application in detail. The turbo coding system is depicted in figure 4.1. A brief summary of the steps involved in developing the model is presented. 65

Channel noise

Input message

Encoder1

Concatenation

Interleaver

Encoded stream

AWGN channel

Noisy encoded stream

Encoder

Encoder2

Interleaver Interleaver

Demultiplexer

ys yp1 yp2

Decoder1

apriori

Interleaver Interleaver

ys_int

Decoder2

apriori

Decision

unit

Retrieved message

Iterative Decoder
Figure 4.1: The turbo coding system

66

1. Generate the message to be transmitted through the channel, which is a binary data of frame size N. Generating this message to be transmitted through the AWGN channel is done using a random number generator. 2. The two encoders, depicted in figure 4.1 encode the message into a bit stream and transmit it through the channel. To the encoder1, the input is the message to be transmitted and to the second, it is the interleaved message using symmetric interleaver of size N. The encoded data is concatenated and mapped to the (1, 0) channel symbols onto an antipodal baseband signal (+1, -1), producing transmitted channel symbols. 3. Add AWGN noise to these transmitted channel symbols to obtain the noisy receiver input. 4. Separate the received symbols as the systematic noisy observed data,

noisy parity bits given to the decoder1 (Figure 4.1) and which form the input to the decoder2.

the noisy parity bits

5. Perform MAP decoding for six iterations to retrieve the message that is transmitted. Bit error rate is computed to obtain the performance of the turbo decoder for varying signal-to-noise ratios.

4.5 Design and Analysis using SystemC specification
In this section the design specification of the turbo coding design is explained. For this, the system modeling language, SystemC is used. SystemC can be used to represent any mix of hardware and software. The potential strength of SystemC are the high level of abstraction, modularity, reusability and flexibility. It provides the designer a fast and easy mechanism for capturing the system functionality. The associated simulation environment is useful in not only verifying the captured functionality, but also in exploring the architectural design space. 67

  ¦ £¤ ¥ 

  ¦ £¤ § 

¡ ¦ § £¤ 

the

Alpha Computation
k α [k−1] γ [k] α [k]

Beta Computation
β [N−k−1]
k β [N−k] γ [N−k]

llr Computation
k α [k] β [k+1] γ [k+1] llr[k]

1 2 3 4 5 6 . . . . 126 127

0 1 2 3 4 5 . . . . 125 126

1 2 3 4 5 6 . . . . 126 127

1 2 3 4 5 6 . . . . 126 127

1 2 3 4 . . . . . . 126 127

127 126 125 124 . . . . . . 2 1

127 126 125 124 . . . . . . 2 1

126 125 124 123 . . . . . . 1 0

127 126 125 124 . . . . . 3 2 1

127 126 125 124 . . . . . 3 2 1

128 127 126 125 . . . . . 4 3 2

128 127 126 125 . . . . . 4 3 2

127 126 125 124 . . . .. . 3 2 1

Figure 4.2: Input data with the index value Apart from establishing functional correctness at algorithmic level it provides easy transition from C language to SystemC design development environment. The modularity of the design is clearly defined at the early stage of design phase and offers flexibility, irrespective of the technology in which the design is to be implemented. Specifications in terms of the signal flow in between the modules and the definition of the architecture can be viewed at the behavioral level description using SystemC. Another significant advantage is that the same test bench can be shared at various levels of description with SystemC.

4.5.1 Turbo decoder : Data access analysis
One of the analysis that can be performed at this level relates to data access pattern. In the case of turbo decoder, how LIFO/FIFO buffers can be suitably used for the data storage and transfer requirements of the algorithm is shown. The storage devices are used to store the data. The turbo decoder design has a complex data flow between the computational units 68

... ... t = alpha00[k] + beta10[k+1]; + gamma11[k+1]; if (numerator > t ) numerator = numerator; else numerator = t; b = alpha00[k] + beta00[k+1] − gamma11[k+1]; if (denominator > b) denominator = denominator; else denominator = b; ... ... llr[k] = numerator − denominator;

Figure 4.3: Code segment to compute llr

during the iterative decoding process. Looking for design attributes, like the data access pattern and then identifying the type of storage modules that are suitable for its implementation, may lead to cost effective solutions in terms of latency, area and power. This is similar to examining scratch pad memory as an alternative to cache memory in terms of area and energy consumption [90, 91, 70]. Both cache and scratch pad memories are suitable for supporting data accesses which are random in nature. On the other hand, the data access has a serial pattern in some pre-defined order in the turbo decoding application. This maps to a FIFO/LIFO data organization. Figure 4.2 depicts the index values of the major computational tasks in each decoder

algorithm are also to be considered for making a decision about the storage units . Basically 69

 

storing ,
¡

is seen. Now, the tasks which can run concurrently according to the decoding

 

In computing llr, , ,
 
¡

are to be obtained in a specific order. Hence the necessity of

 

 

 

 

¡

£

 

and

as inputs. llr[k] takes

,

and

as its input operands.

£

 

 

¦

£

¥ £¤

 

¦

¥ ¥¤£

 

¦ §¥¤£

 

¦ §¤

¡

requires

and the branch metric

as its input.

needs

¦ §¤

©

£

¦

©¤ ©

£

¡

¦ §¥¤£

£

stage for a frame length

= 128. Consider the

computation index values in Column 4.

¦

©

¥¤£

©

£

¦ §¥¤£
¡

this parallelism is exploited during memory storage optimization. The parallelism in the turbo decoder is exploited in the turbo decoder with the intention of reducing the decoding time during the iterative decoding process. If parallelism was not
  ¡

exploited in the turbo decoder, the

unit computations would commence only after the

unit computation. For exploiting this algorithmic level behavior of the turbo decoder, no
¡  

extra resources are required. The reason is that the hardware for

unit and

unit are both

basically required in the decoding process. Hence this does not result in extra power usage.
 

Also the
 

values are computed on the fly. Hence no intermediate storage is required for the computation module results
 

values. This reduces the power dissipation. A simplified

in reduced power dissipation as shown later in Chapter 6.

4.5.2 Decoder : Synchronization
Figure 4.3 shows the code segment to compute the log likelihood ratio llr. This depicts the software computation module of equation 2.4 (section 2.3.1) which is illustrated in Chapter 2. The natural log computation is translated to maximization (comparison) operation since MAX-Log-MAP turbo decoder algorithm is used [38]. Hence there are no logarithms. Consider the computation of llr unit of Figure 4.3 where llr[k] is to be evaluated for k =
¦   §¡ 
 

determines when should the llr unit be provided the start signal and what are the specific
 

delays involved in the

culated. In this context it is obvious that synchronization plays a vital role and becomes one of the design issues, which can also be analysed at the system level using SystemC. 70

¡

 

The issue of synchronization arises when the

 

passed through the

unit, it is essential to synchronize the events. This synchronization

unit itself to provide the correct data values. values and the values are being cal-

 

data dependency of llr unit on the

 

being performed. Both llr and

cannot compute at the same index values because of unit. Hence in order to have the proper index values

 

 

unit requires
 

¦   §¡ 

 

and

are the input operands. values for index k = 125, when operation for index k = 124 is

¦   §¡ 

 

¦   §¡ 

 

¦   §¡ 

 

¦   §¡ 

 

¢   ¤¡ 

¡

¢   ¤¥ 

¡

¢   ¤¡ 

¡

¦

¢   £¡ 

£¥¥

¡

124.

,

,

,

,

,

,

,

,

¦

£ £¥

¦

£¥¥

¦

£ £¥

¦

£¥£

¦

£ ££

¦

£¥¥

¦

£ £¥

¦

£¥£

¦

£ ££

SystemC\Clock

SystemC\Gamma_begins

D
SystemC\Alpha_begins

i g i t

SystemC\Beta_starts

a l

SystemC\Beta_asks_gamma

SystemC\LLR_starts

0
T(SystemC\Clock)

2u

4u

6u

8u

Time (Seconds)

Figure 4.4: Synchronization signals illustrating the parallel events The timing diagram of these synchronization signals is illustrated in Fig. 4.4. The first sig 

nal is the system clock, second one indicates the start and completion of the
  ¡

operations,

third is the
 

computations which runs in parallel with the
 

unit, fourth is the start and

starts. During SystemC simulation, the observations about each computation step is possible through the signal graph, which assists in verifying the functionality step wise, during the architecture exploration phase. This signal graph also brings out activities of the design, which run in parallel during the computation. Thus, during the architecture definition stage at behavioral level, it is possible to test the functionality and modularity of the design.

4.6 Choice of on-chip memory for the application
The choice of memory type is based on how the various modules of the application interact. This will determine when to send read requests to the memory or give write enable signal to the memory. The read pattern from the memory and the write pattern to the memory for the computation modules namely the interleaver, alpha, gamma and extrinsic modules 71

 

end of

operations, fifth is when

requests for the

values and the last is when llr unit

are observed. Considering reading or writing of N address locations, there can be basically three possibilities of patterns. If the data writing order and reading order are reverse of each other (say 0 to N for
           

writing and N to 0 for reading) then a LIFO buffer as on-chip memory is the choice. If the reading order and the data writing order are identical (say from 0 to N) then a FIFO buffer as on-chip memory is the choice. If data writing and data reading order are not strongly correlated then random access memory (like cache or scratch pad memory) is the choice. The data access study of the turbo coding algorithm revealed a deterministic regular pattern throughout the design. For example, to define the FIFO buffer (using SystemC) for storing the intermediate data elements, the FIFO module has a single clock port for both data-read and data-write operation. FULL and EMPTY signals on the buffer ensure the corresponding writing and reading modules do not overrun each other. The type of storage that is required just before interleaving in encoder section is a FIFO buffer because data reading and writing order is identical. The sequence of computation that is used in MAP decoding is as follows : As soon as the noisy encoded input bit stream is received, the first computation that is
 

to be performed is the branch metric

in the decoder1 (Figure 4.5). As the encoded

data bit stream enters the decoder section, a FIFO buffer is required to store the demultiplexed systematic bits The
¡

and parity bits from the second encoder

computation can commence immediately after the first value of is computed,
¡    

since this forms the input operand to compute the first value of . Thus run in parallel.
¡

and

¡

and the

metric by their respective computation units and the order of reading the 72

 

Observing the data analysis pattern described and the order of writing the

  ¦ § £¤ § 

¡ ¦ §¥¤£ 

.

units

metric

z2 yp1 ys_int

Gamma Unit

γ 11 γ 10

Alpha Unit

α 00 α 01 α 10 α 11

Alpha buffer

α 00 α 01 α 10 α 11

α_req

γ γ 10 γ 11

LLR unit
req

llr

Gamma buffer γ

γ 10 11

β 00

Beta unit

β 01 β 10 β 11 α 10 α 11

Figure 4.5: Data flow and request signals

73

Prior to interleaving after decoder1 computations and deinterleaving after decoder2
 

computations, a LIFO buffer is essential observing the data access sequence. The memory design of the turbo encoder and decoder is regular using either FIFO or LIFO buffers for storage. This identification is beneficial because regularity is one of the preferred design features in the chip implementation phase.

4.7 On-chip memory evaluation
In the turbo coding application let the message be of frame size N . The width of each data element to be stored to compute the total storage that is required for the application is assessed. The turbo encoder memory requirement depends on the message frame length N and the width of the interleaver memory, width of the memory required to store the systematic data and the parity data from the encoder 1 are one bit wide.

1.

The memory required by the decoder can be expressed as

74

#¡¢

¡

%

where

is the on-chip memory required for the turbo decoder.

¦ ¦¡

¥ 

¥  ¤¤ © ¤¥ % ¥ #¡¢

¡

¦  ¦¡

%

¤ ¤¥    ©

 

£ § 

length of the systematic data and

is the word length of the parity bits from the encoder

¡ ¦  £

¥

%

¤ © ¤¥ % 

¢

 

information,

¡

where

is the memory required by the turbo encoder, Nthe frame length of the input is the word length of the interleaver at the encoder side, is the word

 

£   ¥¡¦

¡ ¤¢¦ £  

¢

  ¡¦

N

N

N

¥

¥

¥

¤   ©

¡ 

%

 

Since the

values are computed on the fly, no storage is required for this.

 

metrics by the

and the llr unit, a LIFO buffer will satisfy the storage requirement.

(4.1)

¤©  %

(4.2) is the

quired in the decoder unit.

The input data at the decoder is demultiplexed and stored in memory for further usage during the turbo decoding iterative procedure. N is the size of the frame length of the infor-

are the noisy observed parity data from the encoder1 and the encoder2 respectively. is the width of the interleaved data elements of the systematic information.

is the width of the data element used for the branch metric computation. The
£

branch metric symmetry [62] is utilized to reduce the branch metric memory requirement. is the width of the apriori information, namely the extrinsic information which is
£

coding process.

4.7.1 Reducing the memory accesses in decoding process
It is observed that the turbo decoder design has a complex control flow. From the data access pattern analysis described above it is clear that there are a number of memory accesses depending on the computation to be performed. This section explores how the modification in the beta computation unit reduces the memory accesses. Any attempt to reduce the on-chip memory power is definitely beneficial in the design phase. The total energy consumption is a product of the energy per access and the number of accesses [90, 91]. Hence energy consumption is directly proportional to the number of accesses (read access/write access) [92]. 75

$¤ ¡ ¡ ¡

£

the width of the forward metric computation,

is the number of states used in the turbo

 

  §£ 

 

fed back from decoder1 to the decoder2 during the iterative decoding process.

£

 

  §£ 

  $§ ¡ ¡ ¡

£

¢

¥

¤¢

¥

 

£

 

¢

£ 

¤£   

£ ¦

N
 

N

N

 

§

 

¢

¥

¥

 

¥ 

¥  §¤ © ¤¥ %

¡

§

 

mation sequence,

is the width of the systematic bits from the encoded data.

¥

¡

¡ ¢¢

¡

§

 

 

§

 

 

§

 

¡

§

 

N

N

N

N

¥

memory in the data supply unit, to access the data elements. 

¥  ¤¤ © §¥ %

is the memory re-

¥ §

¥ ¥

¥

 

#¡¢ ¦      ¦ ¡ %
¡ §

(4.3)

and 

¤£   

¢

£ ¦

¥

¡
¤¢

¡ ¢¢

§
¥

¡
 
£ §

 
§

     

(4.4)

is

Access type write read1 read2 write read
¡  

Number of accesses Number of accesses without unit modified with unit modified
©§¥ £ ¡ ¨¦¤¢   ©  ¦£ 

Consider one constituent decoder stage. Table 4.1 indicates the number of reads and
¡  

writes to the memory namely the
£ ! #© " 

and

buffer for encoded information of size (test case
£ © ! 

One way of llr unit obtaining the values is to issue a separate read request to the buffer each time. Should this part be implemented directly, accessing the required values, would result in increased number of accesses. The energy consumption refers only to energy for
 
 

memory accesses. Considering the number of accesses with the

the total number of accesses are reduced which is illustrated in Table 4.1. The following analysis quantifies the reduction in energy consumption due to the reduction in memory accesses. Let Let

76

£ © !  

 

Equation (4.6) gives the energy consumed with unmodified 
©    ¦ ¦ )¦£ £

$ % 

¦¡

 

¥

$ &  

¥

¦¡

 

¥  §   £

£

 

¥ ¨¥

¥

¥ 

  ¤   ¡

¥

©§¥ £ ¡ ('¦¤¢  

¤¥   ¥©

£

£  

 

$ &  

¦¡

 

 

with the unmodified

unit.

$ &  

¡

be the energy consumed by the

buffer respectively.

¦¡

 

 

be the sum of energy consumed by the , ,
¡  

¡

access of the

LIFO buffer.

be the energy consumed by one decoder. and llr computation units. is the energy consumed

 

N= 128).

is the energy per access of the

LIFO buffer and

unit.

 

 

Table 4.1: Memory accesses with/without the

unit modification 

©§¥ £  ¨¦¤¡ 

¥  §   £

£

¥ ¥

£

£ £
  

 

©§¥ £  ¨¦¤¡ 

©§¥ £ ¡ ¨¦¤¢   ©  ¦£ 
£ £ ¥ ¨¥ ¥ ¥

¥   £  ¥  §   £
£

£

£

 

¡

   

¥©

 

is the energy per

unit being modified, 

  ¤   ¡  ¤¥  
¥

$ % 

¦¡

 

(4.5)

(4.6)

Signal Schematic diagram clk γ10 in γ11 in ygamframeavail unmodified β unit β00 β01 β10 β11 clk γ10 in γ11 in ygamframeavail modified β unit β00 β01 β10 β11 γ10 out Input port Output port (a) (b) γ11 out

unit.
 

is given by (4.8).

framelength of N= 128, it is seen that

77

 

 

Thus from (4.9) (unmodified

unit) and (4.10) (modified

unit) it is observed that

£ ! #© "  

£¢  ¦     ¢

¡ 

£¢  ¡§  ¥ ¦ ¢

¡

 

 

 

 

¦ ¤

¦ §¤

    ¡¡ 

    ¡¡ 

$ & 

¥

¥

    ¡¥ 

    ¤¡ 

¢

¥

¥

Assume in (4.5) that sum of

and

is one unit and

 

£ #© ! 

£ #© !  

¥    ¦ ¦  §   £ 

©    ¦ ¦ ¦£ £

£

¥ ¥

¦¡

¥

 

¥

©§¥ £ ¡ (¨¦¤¢ 

©§¥ £  ¨¦¤¡ 

$ &  

accesses from LLR unit to the unit.
 

 

Since there are no read requests to the

buffer from the LLR unit there are no read (4.7) gives the energy consumed with modified

 

 

Figure 4.6: (a) Unmodified

unit (b) Modified

unit

¢ ¦¡

£

£  

    ¡¡  

  ¤   ¡

¥

  £

£  

 

¢

¢ ¦¡
¥ 

¤¥   ¥©

$ &   $ % 

¦¡

 

 
£

(4.7) 

¤¥   ¥©

 

¥ ¤   

 

$ & 

¦¡

 

(4.8) is one unit. For a

(4.9)

(4.10)

Figure 4.7: Area distribution pie chart representation the consumption of the energy for memory accesses is reduced by 33.24
 
 

in the mod-

ified

unit case for N = 128. The savings will increase as the frame size is increased.

This reduction is accomplished at the higher abstraction level, even before the actual chip
 

implementation. To describe the modification of the

computation unit, the signal as 

signment schematic is shown in Figure4.6 (a) and 4.6 (b). The key modification to the
 
 

unit is the required synchronization.

data values are passed through the

computation unit with proper

4.8 Synthesis Observations
In order to find the area distribution a synthesizable model of an encoder and decoder was developed and synthesized. The test case refers to frame length of 128, with a bit width of 16 for the design and a four state encoder. An important observation related to design flow is that, since the complete model was available in SystemC, the transition to VHDL was faster. Also the design details like the number of modules, the input and output port description and the modularity of the task to be synthesized was available from SystemC. It was interesting to see the block diagrams defined in SystemC and the results obtained after synthesis matched. The design was synthesized with XILINX Virtex V800BG560-6 as target device. The 78

results for area of one decoder utilized a total number of 4507 LUTS (18.17 MHz) of which
 

. These were the important computation units that constituted major part of the total

area and this distirbution is shown in the pie chart of figure 4.7.

4.9 Summary
An improved design of the turbo decoder using buffers as the intermediate storage device is presented. By modifying the backward computation unit, the number of accesses could be reduced lowering the energy consumption. The usefulness of architecture definition using SystemC as the CAD tool is depicted. This tool enabled us to create models which could exploit parallelism of the algorithm at the system level design stage. The study of the turbo decoder data access pattern for the compute intensive blocks reveals interesting practical solutions during this application specific design. To get a feel of the relative area occupied by different modules of our design both the encoder and a decoder are synthesized. It is observed that llr (log likelihood unit) occupies about
 

of one decoder area.

79

¡  

 

  § 

¡

the complex llr unit occupied
 

of the total LUTs,

unit

,

unit

 

 

   

 

   

and

unit

  ¡ 

80

Chapter 5 Architectural optimization
5.1 Objectives
In application specific designs, impact of architectural parameters on the power consumption and area is significant [93, 94]. With increasing demand for low power battery driven electronic systems, power efficient design is presently an active research area. Batteries contribute a significant fraction of the total volume and weight of the portable system. Applications like digital cellular phones, modems and video compression are compute and memory intensive. The basic components in these are the input-output units, on-chip memory, off chip memory and the central processing unit. Power dissipation is highly dependent on the switching activity, load capacitance, frequency of operation and supply voltage. Further design parameters like memory size and bit width influence the power consumption significantly [95, 96]. The power dissipation of the application design need to be lowered using suitable architectural optimizations. The goal of this work is mainly to quantify what is the saving in area and reduction in power consumption using bit width optimization. In doing so, it is ensured that there is no performance degradation of the turbo decoder in terms of bit error rate. With this motivation, the 3G turbo decoder design is implemented and the benefits of bit width variation for three different bit width architecture designs are studied. 81

The rest of the chapter is organized as follows. Section 5.2 describes the related work. Section 5.3 gives a brief introduction of the numerical precision analysis. In section 5.4 the HW solution of the turbo decoder design is described to arrive at a bit optimal, area efficient and power efficient design. In section 5.5 the experimental flow is described. Section 5.6 gives the results of the bit width optimization. Summary is presented in section 5.7.

5.2 Related work
The earlier research has mainly focused on evaluating the performance in terms of BER of the turbo codes considering bit widths of some data modules. There is no reference on the impact of bit width optimization on power consumption and area estimates of the turbo decoder design. Yufei et al. [65] quantify the effect of quantization and fixed point arithmetic on the performance of the turbo decoder. Only the input bit-width effect on BER is dealt in this work. In [63] determination of internal data width is studied using a C model. They provide an estimation of the minimum internal data width required for a few data modules in a turbo decoder. However their work is restricted to the performance analysis of the turbo codes. They do not address the power issues. In [84] the authors present a survey of VLSI architectures and deal with width precision and bounds on the data metrics in the turbo decoder. They also state that number of bits used to code the state metrics determines both the hardware complexity and the speed of the hardware. They emphasize the need for techniques which can minimize the number of bits required without modifying the behavior of the algorithm. They examine the issues of re-scaling the state metrics in the turbo decoder and come up with a procedure to obtain the upper bound for the bit-width of the data metrics and they do propose an analytical result on the bit-width study. Although sufficient study is done for various bit-width analysis, either on some of the data metrics or one single data metric, neither of the references quantify the impact of power and area reduction for all the data modules in the turbo decoder. The problem addressed is, how to arrive at a optimal bit-width design without degrading 82

the performance issues of the turbo decoder and its impact on power and area of the design. It is established that a suitable bit-width can result in a significant reduction in the power consumption and area.

5.3 Bit-width precision
To limit storage and computation requirements, it is desirable to use optimum bits for representation of the data modules. For the hardware realization of the 3GPP turbo decoder, the finite precision analysis is performed on various data elements. It is important in the design of the turbo decoder to see if the bit-width reduction of the metrics leads to performance degradation in terms of BER and then determine the upper bound of the data metrics of the design. The various data elements in the turbo decoder are the branch metric , forward metric , backward metric , log likelihood ratio llr and the extrinsic information. The
   

traverses from 0 to N-1, where N is the frame length of the message.
 
¡

The values of

creases when the index value moves backward from N-1 to 0. The largest value from the simulation (VHDL) results to fix the bit-width of the data element is observed and recorded.
 
¡

Another observation is that as the number of iterations increase, the

growing. Six iterations for the turbo decoding process is used to retrieve the message that is sent from the encoder side.
 

Similar analysis is done for the log likelihood ratio, extrinsic values and the

to arrive at precise bit-width. One can reduce the bit-widths still less than the defined values here by using normalization techniques. To obtain the normalized values of each data element needs extra operations, additional circuitry and added delay [97]. The goal of this work is mainly to quantify what is the saving in area and reduction in power consumption using bit-width optimization. The bit-width of all of the data metrics in the turbo decoder is optimizaed and the impact on a 32 bit turbo decoder, 24 bit turbo 83

 

¡

data metrics

and

are computed recursively, adding the

will grow as one moves from index value 0 to N-1 and

 

¡

metrics as the index value

metric in-

and

values keep

metric

decoder architecture and the bit optimal architecture studied.

5.4 Design modules
In this section the VHDL model of the turbo decoder design that is used for simulation and synthesis is described. The HDL design can be divided into three distinct module sets as follows. Data supply unit As soon as the decoding commences, the encoded data information is demultiplexed and separated into the systematic received data ( ), parity data elements from the encoder1 ( ) and the parity data values from the encoder2 (
     

The systematic information is to be interleaved, since this interleaved data is one of the inputs to the decoder2 section. Essentially in the data supply unit modules the data inputs that are required by the decoder1 and the decoder2 are ready for usage. Decoder1 The main blocks in the decoder1 are the , , , llr, extrinsic unit, in¡  

termediate

storage units, intermediate

storage units and the associated feedback

units, namely the extrinsic interleaver and its storage units. The storage units are modeled as FIFO/LIFO following the data access pattern analysis that was presented in the previous chapter. Decoder2 The main blocks in the decoder2 are the same as the decoder1, except for the different inputs at the various computational blocks. Another addition is the decision unit, which gives the final estimates of the message that is retrieved. The output of the decoder2 is stored to be fed back to the decoder1 in the next iteration.

5.4.1 Selector module
The typical algorithmic behavior of the turbo decoder requires a specific selector module to continue from second to the sixth iteration. When the encoded data is received at the input of the decoder, first it is multiplexed and then respective systematic, parity1, parity2 84

  § 

¡ 

 

¡

 

  ¥ 

).

and interleaved systematic data are stored in the first iteration. During the second iteration till the sixth iteration only the decoder1 and the decoder2 are iteratively operating. Hence the selector module (multiplexer) during the first iteration takes the input signal from the data supply unit and gives a start signal to the first module of the decoder1 ( unit). During the subsequent iterations, the selector module receives the input signal from the last
   

computational unit of the decoder2 (extrinsic interleaver unit) and enables the

unit.

5.4.2 Top level FSM controller unit
The turbo decoder operates on blocks of data. The top level controller, whose purpose is to activate the modules in proper order, manages the turbo decoding process. The turbo decoder which constitutes of the two decoders is iterative in nature. The simulations have to be performed for six iterations. Hence a top level finite state machine (FSM) controller unit is essential. A FSM consists of a set of states, a start state and a transition signal to move from the current state to the next state (in turbo decoder from one iteration to the next iteration). The FSM design appears explicitly in the turbo decoder design system for the control and the sequencing of the iterations. At the data flow level, iteration control and the data computation of the decoder system are separated then the description of the FSM has a close correspondence to the typical behavior of the algorithm. Figure 5.1 depicts the state diagram of the FSM used in the turbo decoder design to control the number of iterations. The state diagram has state S0, which corresponds to the first iteration. After six iterations, next block of data has to be received and processed. If the frame start signal is equal to zero then the decoder process does not commence. If the frame start signal is equal to one and also the decoder module gets the enable signal, the decoding starts. The last module in the decoder2 gives a valid bit signal to the FSM controller unit. The FSM iteration controller unit sends an enable signal to the decoder1 so that the next iteration starts. In this way the number of iterations are controlled using the FSM unit.

85

valid_bit =1 select_bit = valid_bit Frame start is the input bit from the FSM controller unit to the computation unit. frame_start = 1
S5 S0

frame_start = 0

frame_start = 0 valid_bit =1 select_bit = valid_bit
S1

Valid bit is the input bit signal from the last module of the decoder2 to the controller unit
S4

Select bit is the output bit signal from the FSM controller unit to the first module of the decoder1

S2 S3

Figure 5.1: State diagram representation of FSM used in the turbo decoder design

Memory modules The turbo decoder requires a number of memory modules in its design. The address of the memory location from which the data has to be read and written is specifed in the module. Along with this, appropriate conditions are also specified. During synthesis, internal Block RAM memory of XILINX VirtexII is used for realizing the modules. Test bench model The complete turbo decoder application was modeled in C and this was tested using test data generated by encoder and the decoder model. In case individual modules in the decoder needed to be tested, test bench for these were created but the test data was obtained from the intermediate results from the C model. This was possible because both C and VHDL models matched in terms of module boundaries enabling transfer of execution results from an abstract algorithmic (C) to test data for a lower level behavioral model (VHDL). 86

VHDL design

XCV800 device package bg560

Model Sim Simulation

Leonardo Synthesis .ncd file

Area estimates LUTs

Simulated data

Frequency of operation

Physical Constraints

Input Capacitance

XPower Tool

Activity factor

Power estimates
Figure 5.2: Design flow showing the various steps used in the bit-width analysis

5.5 Design flow
The bit-width analysis is performed using the methodology and tools shown in Figure 5.2. The VHDL design is simulated with ModelSim to test the functional correctness of the design. ModelSim is a HDL simulator from Mentor Graphics and supports both ASIC and FPGA as target technologies. The test bench vectors being the same as used for the C model, the intermediate data testing becomes easy. The turbo decoder is simulated for six iterations to obtain the complete run statistics of the decoding process. The first step in the Leonardo synthesis is to choose the technology library. Though it is clear that FPGA technology is not suitable for power efficient implementations it is decided to map onto the same for ease of implementation and also to get a feel of relative savings. In the pre-synthesis step, simple optimizations such as constant propagation and unused logic elimination is done. These pre-synthesizing steps are carried out by the synthesis tool to remove the redundant logic or reduce the same. Area reports and the timing reports are 87

generated for the design. Since it is necessary to place and route the design, the edif file containing the netlist information is generated for the design. This edif file contains the netlist information of the design, which is used to place and route the design. The .ncd file is required which is the FPGA design file that provides the design topology and physical resource usage. This file along with the constraint file forms the input to the Xilinx XPower tool. In the turbo decoder design there are many on-chip memory data storage units. With Leonardo spectrum, the process of component instantiation of RAMs is eliminated. RAMs are automatically inferred from a simple RTL RAM model. The extra effort of separately instantiating the Block RAMs is not there which results in reduced synthesis time. Leonardo specturm does not require separate instantiation of the RAMs in the VHDL programming environment. The VHDL RTL behavior provides the Leonardo Spectrum tool the read and write signals from which the tool infers the RAM usage.

5.5.1 Power estimation using XPower
Power estimation tools in various abstraction levels from transistor level to system level are different [98, 99, 100, 101, 102, 103, 104]. Methods used in instructional level power analysis are reported in [105, 106, 107, 108]. Power estimation for high level synthesis are described in [109, 110, 3, 4]. Since the FPGA design environment is used, XPower tool [111, 112, 113, 114] is utilized for power estimation of the design modules. The design is first modeled in VHDL and then synthesized using Leonardo tools. XPower brings an extra-level of design assurance to the low-power device analysis. Accurate power estimation during programmable design is done with XPower. XPower reads in either pre-routed or post-routed design data and provides accurate device power estimation either by net, or for the overall device. Results are provided either in report or graphic format. The VHDL designs were simulated and only synthesized. The estimates are for 0.18 m technology. The voltage that is used in our estimation is 2.5V. There is a facility to set the input capacitance and frequency of operation for which 88
 

the design power estimation is to be done. There is a provision to give a specific switching activity for the design power estimation. The estimates for 20 found. XPower calculates the power as a summation of the power consumed by each component in the design. The power consumed is the product of capacitance, square of the voltage, activity factor and frequency of operation and is reported in mW. and 50
   

activity factor is

5.6 Experimental Observations
In this section the synthesis results and the power estimates of various individual modules of the turbo decoder is presented. To get an insight into the area, power and timing values for the individual modules in the turbo decoder, first simulation of each of the design modules is done, then synthesized and further used XPower for the power estimation of the design. The synthesis results of data supply unit and one complete decoder is presented. The results are presented for three varying bit-width architectures namely the 32 bit wide design, 24 bit wide design and the bit optimal design and evaluate the area and power savings. Here the main goal is to quantify the power consumption and see the benefits of the bit-width variation without compromising on the performance (in terms of BER).

5.6.1 Impact of bit optimal design on area estimates
The designs were synthesized using Xilinx XCV800 device, package bg560 and speed -6. The power estimates are with Xilinx XPower version 2.37 with 3.3V supply. The area in terms of LUTs and the frequency of operation is obtained from the Leonardo synthesis tool suite. Although individual modules operated at a very high frequency range, the power estimates are for a frequency of 20 MHz, so that there can be a common comparison for the power estimates of all the design modules. First the synthesis results of the modules are presented.
 

Table 5.1 illustrates the area and frequency of operation for various bit-widths of the 89

Architecture

Inputs to bit-width
 

Area Critical frequency LUTs in MHz

Area reduction 32 bit to bit optimal

32 bit design

z1 ys yp1 z1 ys yp1 z1 ys yp1

24 bit design

Bit optimal design

73

124.2

44.6

design module. Since the designs generated large combinational logics, very little variation is observed in the frequency of operation with the bit-width variation. In all the synthesis experimental set up, the bit-width of the input data elements ( ,

(after studying the effect of quantization on BER in the first chapter).
 

Column 1 represents the type of architecture. Column 2 represents the input to

putational module. Column 3 shows the respective bit-width of the inputs of column 2. Column 4 indicates the output metric of the design. Column 5 shows the area occupied by the unit. Column 6 gives the Critical frequency in MHz. Column 7 gives the area reduction in % for the bit optimal design with respect to the 32 bit-width design.
 

The area reduction comparing the 32 bit and the bit optimal design for the
 

44.6 , the frequency of operation being 124.2 MHz for the bit optimal
¡

The results of the forward metric unit -

unit is shown in table 5.2 for 32 bit design, 24

bit design and the bit optimal design. Forward metric unit calculates recursively the metric
¡  

values using the branch metric

values. The area reduction for the

module is 38.7 .

The backward metric unit synthesis results are illustrated in table 5.3. The inputs to
 
 

 

the

unit is the branch metric value . The variation of area estimates for the 90

unit are

 

    ¡ §  ¥  

 

Table 5.1:

unit synthesis results with the input data bit-width set to 6

,

) are kept as 6 bits

module.

 

32 6 6 32 24 6 6 24 10 6 6 13

132

108.2 reference

108

114.4 –

com-

module is

 

Architecture

Inputs to
¡

bit-width Output

Area Critical frequency LUTs in MHz 687 66.3 67.1 68.1

Area reduction Reference – 38.7
 

Architecture

Inputs to
 

bit-width Output

Area Critical frequency LUTs in MHz 1356 55.0 56.3 57.8

814
 

39.9

1356 LUTs, 1088 LUTs and 814 LUTs for 32 bit, 24 bit and bit optimal design. The area
 

reduction for

module is 39.9 .

The extrinsic information unit which computes the apriori information for the other decoder takes the systematic data values, the apriori information from the other decoder and the log likelihood ratio as its input during the decoder1 operation. The variation in LUTs ranges from 145 to 101, moving from 32 bit design to the bit optimal design with an area reduction of 30.3 . For the llr unit design the area estimates and the area reduction are shown in the table 5.5. The area reduction for the llr unit is 39.0%. The Synthesis observation of one decoder with the data supply unit is obtained. Decoder2 is identical in nature to the decoder1 in the turbo decoder structure except for a few differences of input data metrics to the computational units. The complete iterative decoder with its constituent decoder1 and decoder2 units is simulated using ModelSim for 91
 

 

Table 5.3:

unit area and frequency estimates

 

 

 

 

 

 

 

32 bit design 24 bit design Bit optimal

32 32 24 24 13 18

1088

¡

Table 5.2: Area reduction of

¡

¡

¡

     

32 bit design 24 bit design Bit optimal

32 32 24 24 13 18

553 421

unit Area reduction Reference –

Architecture

Inputs to extrinsic z1 ys llr z1 ys llr z1 ys llr

bit-width Output

Area Critical frequency LUTs in MHz 145 87.2

Area reduction Reference

32 bit design

24 bit design

Bit optimal design

101

88.1

30.3

extr

Table 5.4: Extrinsic unit synthesis results

Architecture

Inputs to LLR
¡

bit-width Output

Area Critical frequency LUTs in MHz 2392 18.8

Area reduction Reference

llr

39.0

Table 5.5: Synthesis observations of LLR unit

92

 

 

¡

 

Bit optimal design

 

 

¡

24 bit design

 

 

32 bit design

32 32 32 32 24 24 24 24 18 18 13 10

llr 1927 19.4 –

llr 1459 20.3

 

32 6 32 32 24 6 24 24 10 6 10 10

extr 129 87.5 –

extr

Architecture

Data metrics
¡

bit-width

Area Critical frequency LUTs in MHz 5216 18.7

Area reduction Reference

LLR extr ys,yp1,yp2
¡

LLR extr ys,yp1,yp2
¡

LLR extr ys,yp1,yp2

Table 5.6: Decoder unit synthesis observations

functional verification of the turbo decoder system for a total of six iterations. The verification of the intermediate data metrics is done as the number of iteration progresses. The synthesis of the data supply unit and one decoder structure is done to illustrate the area estimates of the design. Table 5.6 shows the results of the synthesis of the decoder unit. The 32 bit design occupies 5216 LUTs and bit optimal design occupies 2816 LUTs, which demonstrates an area reduction of 46 . In Figure 5.3, the area estimates of the decoder design in terms of LUTs for a 32 bit design unit and a bit optimal design unit is illustrated. The last two bars in the graph illustrate the area estimates of one decoder plus the data supply unit of the complete turbo decoder system. There is significant reduction in area for the bit optimal case. The major computational blocks in an individual decoder are , , , llr and the extrinsic unit. From the graph it is observed that the llr occupies the largest area. 93
 
¡    

 

 

bit optimal design

2816

20.2

46

 

 

 

24 bit design

 

 

32 bit design

32 32 32 32 32 6 24 24 24 24 24 6 18 18 13 10 10 6

3936

19.3

5.6.2 Power reduction due to bit-width optimization

Here the power estimates of the turbo decoder design modules using Xilinx Xpower tool, version 2.37 are presented. Each of the design modules physical constraint file and the net list is generated using the place and route option in the Leonardo synthesis tool suite. The power estimates are for a frequency of operation of 20 MHz. The power consumption is recorded for switching activity 0.2 and 0.5. The voltage is set to 3.3V. Table 5.7 shows the power estimates of the designs. From the data it is observed that bit optimal design is significantly power conserving. A power saving in the range of 33.9
 
¡    

to 43.1

llr and extr) it is observed that llr is the largest power dissipating unit with a total of 42.4 power. Power reduction % shown in the last column of table 5.7 is the power reduction comparing a 32 bit design with 50 factor. In Figure 5.4, the power estimates (in mW) of the design modules are indicated. From the graph it is observed that the llr unit dissipates the largest power when one decoder is considered individually. In Figure 5.5 the area and power saving in the turbo decoder structure due to the bitwidth optimization is illustrated. The area reduction due to bit-width optimization without performance degradation is as high as 46 ranging from 33.9 to 43.1 .
     

switching activity and a bit optimal design with 50

activity

and the significant power reduction is observed

In table 5.1 to 5.7, the 24 bit design is included to show the variation in power and area when it is compared to the 32 bit design and the bit optimal design. The 24 bit design dissipates less power than the 32 bit design and more power than the bit optimal design. Performance wise it is very close to 32-bit design. 94

 

 

 

 

is observed. Amongst the individual computational unit of one decoder ( , , ,

Design
 

Activity factor 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5

Power mW Power reduction 111 115 97 100 75 76 123 126 103 106 73 75 223 241 178 191 140 150 303 325 240 257 172 185 490 906 404 728 330 583 Reference –

32 bit
 

unit unit unit
 

24 bit

Bit optimal

33.9 Reference –

32 bit extr unit 24 bit extr unit Bit optimal extr unit 32 bit
¡

40.5 Reference –

unit
¡

24 bit

unit unit
¡

Bit optimal
 

37.8 Reference –

32 bit
 

unit unit unit
 

24 bit

Bit optimal

43.1

32 bit one decoder unit 24 bit one decoder unit Bit optimal decoder unit

35.6

Table 5.7: Decoder power estimates varying bit-width and varying switching activity of the design

95

         

5.7 Summary
In this chapter the low cost and low power design issues of the turbo decoder design are focused. A bit optimal design based on study of the numerical specification of the various data elements in the decoder is presented. Three different bit-widths, namely 32 bit, 24 bit and bit optimal designs were compared. The power saving in various modules of the decoder and the complete data supply unit and one decoder unit indicate a significant (33.9 to 43.1 ) power saving. The area reduction ranges from 30.3 to 46 .
       

96

Figure 5.3: Depicting the area for 32 bit design vs bit optimal design

97

Figure 5.4: Power estimates of turbo decoder design

98

Figure 5.5: Area and Power reduction due to bit-width optimization turbo decoder design

99

100

Chapter 6 System Level Power Optimization
6.1 Objectives
Power optimization at different levels involves tuning of different circuit parameters ranging from transistor size to voltage scaling. In this chapter, system level power optimization options are considered. At system level one can take advantage of power saving techniques namely the power shutdown mechanism. The objective of this work at system level analysis is to develop a strategy for avoiding wasted power during iterative decoding process. Application awareness is a pre-requisite to selectively power down the turbo decoder design modules. This is achieved by analzsing the requirement of the design modules to be in active mode or passive mode. Branch metric computation is one of the tasks in the decoding process. The number of branch metrics to be computed in a eight state turbo decoding mechanism is reduced from eight to four, due to symmetry (Chapter 2, section 2.3). Although it is intuitive that there is saving in power due to this simplification, we the power saving which forms a low power design technique for the application is quantified. Other researchers have proposed the simplification, but no study has been conducted on the quantitative impact on power due to this simplification. The rest of the chapter is organized as follows. Section 6.2 describes the related work. 101

Section 6.3 discusses the general design issues of power shut-down techniques. In section 6.4 the system level power optimization in turbo decoder is described. Experimental set up and results are given in section 6.5. The chapter ends with the presentation of the summary in section 6.6.

6.2 Related work
The power shut-down technique is presented for low power microprocessors by D Soudris et al. [32]. They also report a power conscious methodology for the implementation of digital receivers using the guarded evaluation technique. Guarded evaluation [115] is a shut-down technique at the RT level that does not require additional logic for implementation. The goal of the work in [116] is to introduce power management into scheduling algorithms used in behavioral level synthesis. Behavioral synthesis comprises of the sequence of steps by means of which an algorithmic specification is translated into hardware. These methods are called Data-dependent power down methods since the shutting down of logic is decided on a per clock cycle basis given by an input vector. Their work is centered around the observation that scheduling has a significant impact on the potential for power savings via power management. In real time systems the power shutdown technique has the overhead of storing processor state, turning off power and finite wake up time. Implementing accurate sleep-state transition is essential in carrying out power shutdown mechanism [117]. These well known results of [32] are used in proposing a power shut-down technique to deal with the issues of the power manager unit using the timing analysis of the HDL model of the decoder design. Such analysis is not possible at the system level since the accuracy in cycle time using the C domain model is not available. The branch metric simplification is presented in [61, 84, 62]. Their work reveals that the branch metric simplification can be conveniently done without any performance degradation. Our focus is to show the benefit on power savings using this simplification. 102

6.3 General design issues - Power Shut-Down Technique
Shutting down power supply reduces power dissipation to zero [118, 119]. This is the most effective way to save power in idle modules. Several conditions have to be fulfilled to employ this methodology.

1. The power switch will have to be designed to withstand the power on and power off, which will cause transient noise and voltage drops. These effects must be adequately shielded to avoid functional failures [120]. It takes a delay of DT (Delay Time) before supply voltage stabilizes in the switched back module. This makes the methodology applicable for components with an idle time greater then DT only. As a result, the timing constraints must be considered [121].

2. If the design contains any storage units, their values would be lost during power down [120]. For circuits that cannot be allowed to forget, some cleverly designed latches should be used for retention of stored values during the sleep mode [122, 123, 124]. Floating points may incur short circuit current, hence it should be avoided to reduce static power dissipation [83].

3. Power management is employed to reduce the total power consumption by putting certain blocks in the standby mode when other parts are still operational. This has become necessary in high performance processors to limit the power density of the chip for associated cooling and reliability issues [125].

4. The wake up latency is the time for a block to become usable after being in an inactive state. Faster wake up time is usually preferable to faster transition time because it reduces any performance penalty. To exploit a power shut-down technique, the microarchitecture must be designed to force blocks to preferably give early notice when the blocks are to be reawakened [126]. 103

6.4 Turbo Decoder : System level power optimization
Keeping in mind the general design issues for power shut-down technique, the sleep and wake time analysis is done for the turbo decoder application. When the power shut-down technique is to be used as a power manager, then it is desirable that all the details of timing analysis are taken into account to effectively apply the technique. In figure 6.1 the control flow diagram of the module interaction is depicted, concentrating on the on chip memory read and write control signals. Each node in the diagram illustrates a specific task. The boundary of the time slots are t1 to t9 for one iteration in the decoder. The arcs from one node to the other node indicate the control signal, which will be either a start signal if it represents a computation node, pop signal if it is a memory read operation from the computation node or a push signal if it is a memory write operation from the computation node. The activities being performed in each time slot is listed in figure 6.1. t1 : During time slot t1 the encoded data observed through the AWGN channel is demultiplexed to systematic bits, parity bit from encoder 1 and parity bits from encoder

respectively.
 

and interleaving of the systematic data (decoder1 units). t3 : Time slot t3 is mapped to the backward metric unit , log likelihood ratio llr and extrinsic computation units. Time slots t2 and t3 together constitute decoder 1 operations (decoder1 units). t4 and t5 : During t4 and t5 interleaving operation and the storing of the interleaved data to the on chip memory is performed, the job of which is to give the apriori information to the decoder2. t6 and t7 : The time slots t6 and t7 correspond to the basic modules of computation units in decoder2. Time slot t6 corresponds to the computation of branch metric , forward 104
 

¡

t2 : Time slot t2 corresponds to the computation of branch metric , forward metric

 
§

 

 
§

  ¡  
§

2. The separated data values are stored in ys, yp1, yp2 memory of width

,

,

§

¥

 

Control signals 2. Read from memory
Feedback signal t1 Encoded message
read read read start

1. Write to memory

3. Start signal to computational modules

Active memory modules

write demux

yp1 mem yp2 mem ys mem

Data supply

3

write write start write

read

t2
read

γ
start

start

write

β
t3

start

llr

start

extr

write extr mem

ys int

Decoder1

γ mem

α

α
mem

intr
write

14

read

11

TIME SLOTS

t4

start

intr
write

Interleaving

t5
read read read

t6

γ
start

start

write mem

read

read

β
t7

start

llr

start

extr
extr mem

Decoder2

γ

α

write

α
mem

One iteration

z1 mem

2

13

llr write mem

11

read

read

intr
t8
start write llr mem

de− intr
write z2 mem

Deinterleaving

t9

2

Retrieved message

Control flow diagram
Figure 6.1: Illustrating the turbo decoder timing slots which helps to define the sleep time and the wake time of the on chip memory modules ( = time slot)
¢ £¡

105

metric unit , log likelihood ratio llr and extrinsic computation units of the decoder2 units. t8 and t9 : Time slots t8 and t9 correspond to the deinterleaving operation, which is basically responsible to give the apriori information to the decoder1. The signals in bold broken lines indicate that the decoder2 gives feed back information to the decoder1 modules during the turbo decoding process. The decision bits (retrieved message) is obtained during this time slot. Sleep time is the duration for which the modules are not in active mode. Only the on chip memory modules, in particular for memory read and memory write operation is considered. Thus sleep time in this context is when memory read or memory write control signal is zero. Wake time is the active period of the memory modules. Memory read or memory signal is logic one in this duration. The total number of active memory modules is obtained from the control flow timing diagram, during each time slot. Here for the specific case of the 3G turbo decoder these numbers are assessed. The numbers on the right side of the timing diagram indicate the memory active modules during that time slot. Thus wake time and sleep time analysis assists in on and off timing decisions in the power sensitive budgeting using power shut-down technique. The task graph for the 3G turbo decoder is shown in figure 6.2 which indicates the tasks vs time slot details of the decoder during the iterative decoding, which is drawn using the information from the timing diagram. The task graph helps in computing the power budget and power savings for the shut-down mechanism.
 

6.5 Experimental setup
To estimate the on chip memory power of the various sizes of the memory units in the decoder a memory module is designed using VHDL, synthesized using Leonardo and the power estimates are computed using the Xilinx XPower tool. The various base configu106

¡

metric

and interleaving of the systematic data. Time slot t7 is mapped to the backward

Task Graph of the iterative turbo decoder second to sixth iteration First iteration

Deinterleaver

Decoder 2 Tasks Interleaver

Decoder 1

Data supply

t1

t2

t3

t4

t5

t6

t7

t8

t9

Time slot

Figure 6.2: Tasks in the decoder vs the time slot (Task graph).

107

rations of the memory units are with bit width 6, 10, 13 and 18 (Depth 128). The Xilinx FPGA Virtex device XCV800bg560 is used, with a process speed of 6 during the synthesis steps. The design is placed and routed to give physical constraint file. This is used to
¡ ¦ ¡  

estimate the power using the XPower tool for switching activity

and

of XPower is a report file that details the power consumption of the design. This power data is then used to determine the total power budget in the power sensitive analysis step. Modules are modeled in VHDL and simulated in isolation using ModelSim.

6.5.1 Experimental results
The computation of the active memory is straight forward. It is the product of the number of active memory modules, the bit width of the memory and the memory depth in the time slot of interest. Examining the cycle accurate HDL model of the design, the time duration of each time slot is obtained in number of cycles. The representative results are for the 3G turbo decoder with the frame length N equal to 128. The details of the active memory and the active time is shown in table 6.1. Column 1 gives the slot names while column 2 presents the number of active memory modules. The bit width of the memory active is indicated in column 3 with the memory depth in column 4. The total active memory bits are computed as the product of column 2 (active modules), column 3 (bit width) and column 4 (depth) and presented in column 5. Column 6 represents the time of each slot in number of cycles for a test case of the turbo decoder application for frame length N equal to 128. The number of cycles are obtained from the VHDL simulation results, observing the resulting waveforms at the end of each time slot. The last row in the table indicates the total number of memory modules in the design. The memory modules in the decoder 1 are
¡  

free during decoder 2 operation. Hence the

and

memory units can be the same. The

decoder design consists of the data supply unit, decoder 1 unit and the decoder 2 unit. The data supply unit has 4 memory units. Decoder 1 has 12 memory units and decoder 2 has thirteen memory units of which 10 are same as those of the decoder 1. So effectively the 108

 

 

. The output

Time slot Memory Active Bit width Depth Total bits Cycles t1 3 6 128 2304 128 t2 3 6 128 2304 128 2 13 128 3328 1 10 128 1280 8 18 128 18432 t3 2 13 128 3328 128 1 10 128 1280 8 18 128 18432 t4 and 2 10 128 2560 256 t5 t6 2 6 128 1536 128 2 13 128 3328 1 10 128 1280 8 18 128 18432 t7 2 13 128 3328 128 1 10 128 1280 8 18 128 18432 t8 and 2 10 128 2560 256 t9 Total Memory modules 19 Cycles 1152 Table 6.1: Active memory modules in the decoder total number of memory units is 19 which is indicated in the last row. The total number of cycles for one decoding operation is 1152 cycles as illustrated in the last row. Reducing amount of active circuitry in the design is a means to reduce power consumption. Power depends on the internal logic that switches at any given time slot. Reducing the amount of memory logic switched in the design reduces the current in the design. Table 6.2 gives the results of the memory designs for different bit-widths. Using these base configuration power estimates, the total power dissipation by on-chip memory in each time slot is computed using equation 6.1. The time slot t2 is taken to show the representative calculation.

109

£

  ¥¢

¡ ¢

where

is the power estimate in the time slot t2.

is the power estimate of

¢ ¢ £§ ¥ ¤¥
£

¢ £¦
 

¢ £§ ¥

¢ ¥ £§ ¥ £ ¥
£

¢ £¦
 

¢ ¥ £§ ¥ ¨ ¥
£

¢
¦

 

¢ ¥ £§ ¥
£

  ¢ ¡£¦

 

  §

(6.1)

¡ ¢
§

Base Config Depth 128 Width 18 Depth 128 Width 13 Depth 128 Width 10 Depth 128 Width 6

Area (LUTs) 144 104 80 48

Power mw Activity 94 82 71 61
 

Table 6.2: Power estimates in mW and area in terms of LUTs

base memory with width 6 and depth 128. The term 3 preceding this is the number of active modules with the width 6. Similar discussion holds good for the second, third and the fourth term in the right hand side of the equation.
  ¡     

In table 6.3 it is observed that for

of the total time, as soon as the decoding
 

commences upon arrival of the encoded frame, there is a significant saving of 88 . In this time duration the other memory modules can be in idle duration. For the next 22.2
 

of the time (t2, t3 slot) during the operation of the decoder1 in the first iteration a and 34.9
   

saving of 22.8

is observed.
 

During the interleaving period (time slot t4 and t5) for 22.2 of 91.7
 

in the memory modules is observed.

Time slot t6 and t7 which correspond to the decoder2 essentially the same analysis holds good as that explained for the decoder1 in time slot t2 and t3. Savings in time slot t8 and t9 are similar to that of t4 and t5 as described above. The average active power is 44.5% with the average power saving of 55.5%. Earlier works [83, 92] suggests that the decoder 2 should be shut-down when decoder 1 is operational. Here it is proposed that in the decoder 1 itself, appropriate computation modules and the memory modules can be shut-down to achieve additional power benefit. 110

¡ ¦

 

of the total time a saving

Time slot

Memory config

Cycles

Active Power mW period 11.1 11.1 183 183 164 71 752 164 71 752 142 122 164 71 752 164 71 752 142 1515

t1 t2

t3

t4 and t5 t6

(3,6,128) (3,6,128) (2,13,128) (1,10,128) (8,18,128) (2,13,128) (1,10,128) (8,18,128) (2,10,128) (2,6,128) (2,13,128) (1,10,128) (8,18,128) (2,13,128) (1,10,128) (8,18,128) (2,10,128) Total Memory power Average
 

128 128

Active power (energy) 12 77.2

Power (energy) saving 88 22.8

128

11.1

65.1

256 128

22.2 11.1

9.3 73.2

t7

128

11.1

65.1

t8 and t9

256

22.2

9.3

44.5

Table 6.3: Illustrating power (energy) saving using sleep and wake time analysis in the memory modules of the decoder

111

 

 

 

34.9

91.7 26.8

34.9

91.7

55.5

Computational Power module mW unit 76 unit 150 unit 185 llr unit 366 extr unit 75
  ¡

Power 8.9 17.6 21.7 42 8.8

Time slot t2 t2 t3 t3 t3

Table 6.4: Power estimates in the individual decoder, active time slot information with the associated power saving due to sleep and wake time analysis.

6.5.2 Power optimization
Let us take the case of only one decoder operation, since the other is identical except for the different inputs that are received at the decoder modules. From the control flow diagram as illustrated in figure 6.1 the time slots for the first decoder are t2 and t3.
 

The major computational block in the decoder are branch metric
 
¡

metric

unit which are active in time slot t2. The backward metric unit, the log likelihood

ratio llr unit and the extrinsic computation unit are active in time slot t3. Table 6.4 indicates the power estimates for the individual bit optimal computational units. The observations are for a switching activity of 50 Xilinx XPower tool version 2.37. From the power estimates performed during bit width analysis the observations of the estimates are recorded in table 6.4. Column 1 indicates the name of the computational module, with the power estimates for the design units in column 2. Column 3 represents the percentage power of each module when one decoder unit is considered. The data in Column 4 from the sleep time and wake time study performed using the control flow graph. Consider the analysis of one decoder in a single iteration. From the data presented in the
 
 

with a voltage of 3.3v using the

table it is clear that the

unit, llr unit and the extrinsic unit are idle during time slot t3
¡

t4. Applying the guidelines of power shut-down mechanism, it is observed that a notable saving of 73.5 in power is obtained during time slot t3 and 26.5 112
   

 

of the decoding operation.

and the

computational modules are idle during time slot

 

 

Power saving 73.5 26.5

 

unit, the forward

power saving during

the time slot t4. In table 6.5 the details of the design units to be put to doze mode and at the same time slot decoder modules to be in active mode are presented. D1 and D2 stand for decoder1 and decoder2 respectively. It refers to the 8 state turbo decoder which was earlier described in figure 2.2 of Chapter 2. The decision unit need to be active only in the sixth iteration. The llrinterleave2 module in the table at present is in doze mode for all the time slots t1-t9, the reason being the data presented is for the first 5 iterations and not for the last iteration, in which this will be in active mode. From table 6.5, it is observed that the llrinterleave2 has the largest doze time when the complete decoding process of six iterations is considered. The memory module read/write operation in the turbo decoder is essential to comment on the reusability of the memory modules. The benefit of mutually exclusive computational
   

units in the decoding process is useful in the power shut-down technique. The
¡   ¡   ¡   ¡   ¡   ¡   ¡

have short doze times, since the same memory modules are reused during de-

coder2 stage. This is fine because the data that is written in the decoder1 is consumed in the decoder1 duration itself. This intermediate data is no longer required during decoder2 operation. The turbo decoder presents a deterministic pattern in the usage of most of the memory
¡  

modules. Observe the set of

memory and the set of

memory and the associated active

control signals from table 6.5. The memory read write time slot is immediately followed by the memory read time slot. This indicates that there is no need of the retention circuit control design for these memory modules. The data retention problem does not occur here. Hence no need of the data retention control circuitry for these memory modules. In the Figure 6.3 the block schematic of the turbo decoder power manager is depicted. The numbers associated in each small blocks are those of the design modules which correspond to the ones shown in the table 6.5. The power manager block on the left has an awakening time slot unit which will be responsible to give the information of when the module should be put into active mode 113

 

,

,

,

,

,

,

,

& &£¥¥ & &££

,

& &¥£¥

& &££¥

& &¥¥£

& &£¥£

& &¥££

& &£££

& &¥¥¥ & &¥¥
   

, and

¡

 

Module Number

Table 6.5: Database to design the power manager for the 8 state turbo decoder (1 = Active, 0 = Doze)

& & & &¥¥¥ & &£¥¥ & &¥£¥ & &££¥ & &¥¥£ & &£¥£ & &¥££ & &£ £ £ & & ¥£ ¥£

 

 

 

 

 

 

 

 

 

 

 

 

 

 

¡

 

 

¡

 

¡

 

 

 

¡

¡

¡

¡

¡

¡

¡

¡ 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

t1 t2 t3 t4-t5 Data D1 D1 Inter Modules supply leave units unit 1 1 0 0 1 1 0 0 1 0 0 0 0 1 0 0 z2mem 0 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 extr1mem 0 0 1 1 z1mem 0 0 0 1 llrmem 0 0 0 0 extr2mem 0 0 0 0 unit1 0 1 0 0 unit1 0 1 0 0 unit1 0 0 1 0 LLRunit1 0 0 1 0 extrunit1 0 0 1 0 extrinterleave1 0 0 0 1 ysinterleave 0 1 0 0 llrinterleave2 0 0 0 0 unit2 0 0 0 0 unit2 0 0 0 0 unit2 0 0 0 0 llrunit2 0 0 0 0 extrunit2 0 0 0 0 extrinterleave2 0 0 0 0

Time slot

t6 t7 t8-t9 D2 D2 Deinter leave unit 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1

& ¡& ¢ £ ¡ & &   & &§  & ¤¥& ¡ 

114

Design of a turbo decoder power manager unit.

1 Turbo decoder Time slot controller Idle mode setter unit Awakening time slot alarm unit 8 15 Control signals 22 29

2 9 16 23 30

3 10 17 24 31

4 11 18 25 32

5 12 19 26 33

6 13 20 27

7 14 21 28

Power on/off controller

Turbo decoder Power manager

Turbo decoder design unit The boxes in shaded indicate the power on memory and computational modules that are in active mode during time slot t2. (Table 6.5)

Figure 6.3: Block diagram schematic of the power manager unit

115

memory power area
   

without with savings simplification simplification 656 mW 164 mW 4 times 624 LUTs 208 LUTs 4 times

Table 6.6: Power and area saving due to the branch metric simplification (ahead of its active start time). The idle mode setter unit gives the time slot information to the turbo decoder slot controller unit based upon the passive mode start time. The power on/off controller receives the control signals from the time slot controller and delivers them to the turbo decoder processor unit modules. The shaded blocks in the diagram indicate the blocks which are in active mode in time slot t2 of the decoding process. The general design issues for power shut-down mechanism are utilized to depict the turbo decoder power manager.

6.5.3 Branch metric simplification
 

The algorithmic simplification for the branch metric
 

operation is illustrated in [62, 61].

Using this simplification instead of using eight

data elements only two are sufficient memory storage units will
 

because of the symmetric property. Hence the number of

also reduce. The power estimate and area estimate of with and without the branch metric
 

simplification is depicted in table 6.6. The power estimate and the area estimate of one

memory unit of width 13 and depth 128 is indicated in table 6.2 as 82 mW and area as 104 LUTs.

6.6 Summary
A detailed analysis for the application of the power shut-down technique in the turbo decoder module is given. VHDL simulation for the individual computational and on-chip memory units is used to compute the power using XPower tool. The contribution being the analysis of the turbo decoder exploiting mutual exclusiveness in the individual decoder 116

modules, which illustrates a substantial power saving. The power savings in the various time slots of the decoding process ranges from 9 to 91 . The average active power in
   

time slots t1 to t9 is 44.5% with the average power saving of 55.5%. Although the study and analysis of the branch metric simplification was presented by earlier researchers, no study of savings in power and area were available. The power and area saving is illustrated. The next chapter brings the overview of the low power application specific design methodology that is based on the work reported in this thesis.

117

118

Chapter 7 Conclusion
During the system design phase for a given application, it is necessary to characterize the application specifications and correlate their impact on power consumption and area of the design. Application specific design methodology is described and then present a summary of results is reported in this thesis. Future directions are given at the end of the chapter.

7.1 Our design methodology
Framework for embedded system design is considered as an important task for the development of applications from algorithmic to the silicon stage. It is observed that the design space in application specific methodology has additional dimension. The application performance parameters have an impact on power and area of the design. Various parameters of the application have to be analyzed and their impact studied during the architectural and the system level power optimization. In our approach, memory intensive and data dominated applications are of interest. The methodology for the turbo decoder application from the wireless communication domain is illustrated. The design methodology can be used for other applications as well. Our design methodology allows for a smooth transition between the algorithmic and the synthesis level. With the introduction of the architecture exploration using SystemC, 119

the gap between the system level architecture modeling and behavioral level architecture definition is bridged. This results in a smooth transition of the design from the system level to the HDL algorithmic level. One specific issue of concern in application specific design is to have a common test bench that can be used across different abstraction levels. Since memory forms a major bottleneck in data dominated applications, more so in the context of low power solutions, custom memory options should be studied. A quick system level exploration following the data access pattern analysis assists in developing the HDL model of the design. The application specific design methodology is depicted in figure 7.1. Low power application specific methodology for turbo decoder requires BER (application constraint) to be within desired limits. The two distinct implementations targeted in this methodology are Software (SW) Solution
   

- Algorithmic level functional analysis. - System level on chip memory study. - Examining data exchange between the computational modules (for HW optimization).

Hardware (HW) Solution - Architecture definition and verification at system level using SystemC. - Modeling with VHDL. - Bit width optimization to reduce power and area. - Power shut down technique to achieve power savings.

The methodology is based on fast characterization of application and functionality verification using test benches valid across different abstraction levels. The performance is measured as BER for varying parameters like frame length, number of iterations, inter120

Appplication Constraints (BER) Design Objective(Power)

SW solution
Software Modeling Data access pattern analysis Hardware Modeling SystemC model VHDL model

HW solution

On−chip memory type Custom memory options Application SW model Test bench model

Simulation and Verification SystemC VHDL

Branch metric γ Cache + SPM Application Simulator Architectural optimization Bit−width analysis System level Power Optimization

Data relating Bit−error rate to Frame−length and Quantization

Area and Power reduction

Estimated Power saving

Power manager

Figure 7.1: Turbo Decoder : Low power application specific design methodology

121

leavers and quantization. The goal is to represent the design at the algorithmic level to test the functionality of the application. During application SW modeling, floating point and fixed point C models are developed to study the performance issues. The BER data with varying frame length, number of iterations and quantization are obtained using the application simulator. The test bench that is developed for the C domain SW simulations is also used at all other abstraction levels. The test bench modeling task is to define and generate the test inputs, which represent the message input data sets for the system level architecture and HW design exploration stages for functional verification. The intermediate results obtained from C simulation form ready reference data set for functionality verification during the design phases. During software modeling, the energy benefits of customizing memory is studied. Motivational study using encc framework on DSPstone benchmarks is done. The design space comprising of cache with addition of small sized SPM for the turbo decoder is explored. A methodology based on the sensitivity of the number of memory accesses to the on chip memory to select suitable cache with SPM configurations has been illustrated. The ARM7TDMI is used as the target architecture. The software model forms the base to study the application data access pattern. This assists in defining suitable on-chip memory for application hardware modeling. By observing the regular data access pattern, it should be possible to indicate suitability of regular structures like FIFO/LIFO for on-chip memory. The same test bench is used for functional verification during SystemC modeling. Design refinements are performed at this abstraction level to model internal modules. This reduces the energy consumption in the memory units. The specification of the application model is ready to be used to develop the VHDL model. The VHDL model is used during the architectural optimization where the application performance in terms of BER is monitored for varying frame lengths. The data relating BER, frame length and quantization derived from SW modeling forms the reference for the 122

architectural evaluator. The area and power savings are estimated using bit-width analysis. The design is synthesized using Leonardo Synthesis tool set. Xilinx Xpower is used to observe the power consumption. The system level power optimization using power shut-down mechanism is based on node activity in various time slots during the decoding process. This also forms the basis for designing the power manager unit for the turbo decoder. The design methodology is illustrated for the turbo code algorithm from wireless communication. Turbo decoders have outstanding error correcting capability. A frame length of 40 to 5114 is defined in the 3GPP turbo decoder standard.

7.2 Results summary
The main objective of the investigation reported in this thesis was to develop an efficient application specific design methodology concentrating on the memory issues, coupled with low power design techniques. The turbo code system which is going to be the defacto standard in 3G communication systems has been selected as the specific application in this thesis. The following are the key results of our investigations. From the application software model the application is characterized. The choice
 

of the type of encoders has been reviewed. Since decoding unit has a number of interleaving and deinterleaving tasks, properly designed, less-complex interleaver is an attractive option. This follows because the choice of interleavers are associated with the performance of the turbo codes. As an experimental result, it is demonstrated that symmetric interleavers are less complex with no performance degradation and the same can be used for deinterleaving as well. The performance improves in the first five to six iterations, subsequent iterations give diminishing returns. The number of iterations is fixed to six. Increased frame length also improves its performance. The 3GPP turbo decoder outperforms the four state decoder by almost one order of magnitude for a 1024 bit interleaver. 123

The area model and the energy models of the SPM has been developed which can
           

be used for different configuration sizes. The results indicate that an alternative solution to traditional on-chip caches is the energy efficient SPM. For the DSPstone benchmarks it is observed that the memory energy consumption on an average can be reduced by 40% using scratch pad memory. Impact of both cache plus SPM inclusive systems are also investigated in detail for the turbo decoder. As frame length N increases from N = 128 to N = 256 the BER improves and the on-chip memory energy consumption increases by 40.5%. By adding a small SPM (512 bytes) to 1 Kbyte cache, the memory energy consumption reduces by 51.33% with a minimum area penalty of 1.35 times. The exhaustive design space exploration for 42 configurations of cache with SPM indicates that cache size (2 Kbytes or 4 Kbytes) with SPM 8 K bytes is the optimal choice. Turbo codes demonstrate a regular interaction pattern between the computational unit during the iterative decoding. Turbo decoder designs can use FIFO/LIFO based onchip memory. Design modification is proposed and functional correctness is checked during SystemC modeling to reduce the number of accesses giving 33.24% memory energy reduction. To study area impact of these architectural changes a synthesizable VHDL model was created for the turbo decoder. The synthesis revealed that the log likelihood computation unit occupied the largest area of around
 

The 32 bit design is compared with the bit optimal design during architectural optimization. The power saving for one decoder unit is 35.6% with an area saving of 46%. The benefit of power shut-down technique to get significant power savings during the iterative decoding is illustrated. The timing diagram is derived from the cycle accurate HDL model and presented for the storage memory evaluated for the bit 124

   

.

width optimal architecture. Power saving using this technique in decoder’s memory modules is 22.8 and 34.9 . An ideal implementation of power shut-down will
   

result in an average power saving of 55.5%. The design of a power manager unit for the decoder is proposed.

7.3 Future directions
A low power design methodology for turbo decoder design is presented. Some of the issues which require further explanation are : An integrated on-chip memory exploration of cache coupled with Static or Dynamic
     

scratch pad memory would be interesting. Apart from access energy and area models for DRAMs, this would also have to incorporate critical path delay estimates of the design. This is because our assumption that varying the cache or SRAM size does not impact cycle time is less valid in today’s DRAMs. This work can be further enhanced developing the concept of turbo equalization and integrating this with the turbo decoding structure to study the impact on performance and implementation parameters. A number of standards are being used world wide for the implementation of turbo code structures. It would be advantageous to collect the information of the different functional and architectural parameter variations in these standards and come out with relevant comparison for the different implementations.

125

126

Bibliography
[1] A. P. Chandrakasan, S. Shang and R. Broderson, “Low Power CMOS Digital Design,” IEEE Journal of Solid State Circuits, vol. 27, no. 4, April 1992, pp. 478-483. [2] Jan Rabaey, Digital Integrated Circuits - A design perspective, Eastern Economy Edition, Prentice Hall India, 1997. [3] T. Sato, M. Nagamatsu and H. Tago, “Power and Performance Simulator: ESP and its Applications for 100 MIPS/W Class RISC Design,” IEEE Symposium on Low Power Electronics, San Diego, CA, USA, October 1994, pp. 46-47 [4] C. Piguet et al., “Low Power Design of 8-b Embedded CoolRISC Microcontroller Cores,” IEEE Journal of Solid-State Circuits, vol. 32, no. 7, July 1997, pp. 10671078. [5] Naehjuck Chang, Kuranho Kim and H. G. Lee, “Cycle accurate Energy consumption measurement and analysis:Case study ARM7 TDM1,” International Symposium on Low Power Electronics and Design, Italy, July 2000, pp. 185-190. [6] H. Mehta, R. M. Ovens and Mary Jane Irwin, “Energy Characterization based on Clustering,” 33rd Design Automation Conference, Las Vegas, Nevada, June 1996, pp.702-707. [7] Kanad Ghosh, “Reducing energy requirement for Instruction Issue and dispatch in superscalar Microprocessors,” International Symposium on Low Power Electronics and Design, July 2000, pp.231-234. 127

[8] R. Gonzalzez and M. Horowitz, “Energy dissipation in general purpose microprocessors,” IEEE Journal of Solid-State Circuits, vol.31, no.9, Sep. 1996, pp.1277-1284. [9] D. Somashekhar and K. Roy, “Differential current switch logic: A low power DCVS logic family,” IEEE Journal of Solid-State Circuits, vol.31, no.7, July 1996, pp.981991. [10] T. Indermaur and M. Horowitz, “Evaluation of charge recovery circuits and adiabatic switching for low power CMOS design,” Proceedings of the 1994 Symposium on Low-Power Electronics: Digest of Technical Papers, Oct 1994, pp. 102-103. [11] H. Mehta, R.M. Owens, M. J. Irvin, R. Chen and D. Ghosh, “Techniques for low energy software,” International Symposium on Low Power Electronics and Design, Aug. 1997, pp. 72-75. [12] H. Kojima, D.Gomy, K. Nitta, and K. Sasaki, “Power Analysis of a Programmable DSP for Architecture/Program Optimization,” In IEEE Symposium on Low Power Electronics, Digest of Tech. Papers, Oct. 1995, pp. 26-27. [13] M. Kamble and K. Ghose, “Analytical Energy Dissipation Models for Low Power Caches,” Proceedings of the International symposium on Low Power Electronics and Design, Aug. 1997, pp.143-148. [14] C. Y. Tsui, J. Monterio, M. Pedram, S. Devadas and A. M. Despai, “Power estimation in Sequential logic circuits,” IEEE Transaction on Low Power Systems, vol. 3, no. 3, Sep. 1995, pp.404-416. [15] D. L. Liu and C. Svensson, “Power Consumption Estimation in CMOS VLSI Chips,” IEEE Journal of Solid State Circuits, vol. 29, no. 6, June 1994, pp.663-670. [16] Q. Wu, C. Ding, C. Hsieh and M. Pedram, “Statistical Design of Macro-models For RT-Level Power Evaluation,” Proceedings of the Asia and South Pacific Design Automation Conference, Jan. 1997, pp. 523-528. 128

[17] P. E. Landman and J. M. Rabaey, “Activity-sensitive architectural power analysis,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 6, June 1996, pp. 571-587. [18] M. Nemani and F. Najm, “Towards a High Level Power Estimation capability,” IEEE Transaction on CAD, vol. 15, no. 6, June 1996, pp. 588-598. [19] A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. W. Brodersen, “Optimizing Power using Transformations,” IEEE Transaction on CAD, vol. 14, no. 1, Jan. 1995, pp.12-31. [20] A. Raghunathan and N. K. Jha, “Behavioral synthesis for Low Power,” International Conference on Computer Design, Cambridge MA, Oct. 1994, pp. 318-322 [21] N. Kumar, S. Katkoori, L. Rader and R. Vemuri, “Profile-Driven Behavioral synthesis for Low Power VLSI systems,” IEEE Design and Test of Computers, vol. 12, no. 3, Sep. 1995, pp.70-84. [22] M. R. Stan and Wayne P. Burleson, “Bus-invert coding for Low-Power I/O,” IEEE Transactions on VLSI Systems, vol. 3, no. 1, March 1995, pp.49-58. [23] Yanbing Li and Jorg Henkel, “A Framework for Estimation and Minimizing Energy Dissipation of Embedded HW/SW Systems,” 35th Conference on Design Automation, Moscone center, San Francico, California, USA, June 1998, pp.188-193. [24] Lisa Guerra, Miodrag Potkonjak and Jan M. Rabaey, “A Methodology for Guided Behavioral-Level Optimization,” 35th Conference on Design Automation, Moscone center, San Francico, California, USA, June 1998, pp.309-314. [25] Pieter van der Wolf and Wido Kruijtzer, “System level design of Embedded media systems,” Proceedings of 15th International Workshop on VLSI Design 2000, Tutorial, New Delhi, Aug 2000. 129

[26] Tony Givargis and Frank Vahid, “Interface Exploration for Reduced Power in CoreBased Systems,” Proceedings of International Symposium on System Synthesis Dec. 1998, pp.117-124.

[27] L. Benini, A. Bogliolo, G.A. Paleologo, and G. De Micheli, “Policy Optimization for Dynamic Power management,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 18, no. 6, June 1999, pp.813-833.

[28] R Mehra, J Rabaey, “Behavioral Level power estimation and Exploration,” Proceedings of the International Workshop on Low-Power Design, Apr. 1994, pp.197-202.

[29] D. Lidsky, J. Rabaey, “Early Power exploration:A world wide web application,” Proceedings of Design Automation Conference 1996, Las Vegas, June 1996, pp. 27-32.

[30] Rajeshwari M Banakar, Ranjan Bose and M Balakrishnan, “Low power design - Abstraction levels and RTL design techniques,” VLSI test and design Workshop, VDAT 2001 Bangalore, August 2001, pp.70-74.

[31] Klaus-Jurgen Koch, Bernd Reinkemeier and Stefan Thiel, “A High level design methodology for digital application specific hardware DSP software HLL tools,” International Conference on Signal Processing Applications and Technology, ICSPAT ’ 96, September 1996.

[32] N. Zervas, D. Soudris, C.E. Goutis and A. Thanailakis, “Low Power Methodology for Transformations of Wireless Communications Algorithms,” A Report : LPDG/WP2/DUTH/D2.2R1, EP25256/DUTH/D2.2R1, Sep. 2000, pp. 1-51.

[33] Milind B. Kamble and Kanad Ghose, “Analytical energy dissipation models for Low Power Caches,” International Symposium on Low Power Electronics and Design, ISPLED-97, Aug. 1997, pp. 143-148. 130

[34] Vasily G Moshnyaga and H.Tsuj, “Reducing Cache Energy Dissipation through DualVoltage Supply,” IEICE Transactions on Fundamentals in Electronics, Communications and Computer Sciences, vol.E84-A, no.11, Jan. 2001, pp.2762-2768. [35] F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele and A. Vandecappelle, Custom Memory Management Methodology – Exploration of Memory Organization for Embedded Multimedia System Design, Kluwer Academic Publishers., Boston, 1998. [36] F. Catthoor, S. Wuytack et al., Custom Memory Management Methodology, Kluwer Academic Publishers, Boston, 1998. [37] C. Berrou, A. Galvieux and P. Thitimajshima, “Near Shannon Limit Error-Correcting Coding and Decoding: Turbo Codes,” Proceedings ICC 93, Geneva Switzerland, May 1993, pp. 1064-1070. [38] M. Valenti, “Iterative Detection on Decoding for Wireless Communications,” PhD dissertation, Virginia Polytechnic Institute and State University, July 1999. [39] D. Divsalar and R. J. McEleice, “Effective free distance of turbo codes,” Electronics Letters, vol. 32, no. 5, Feb. 1996. [40] C. Berrou and A. Glavieux, “Near Optimum Error Correcting Coding and Decoding: Turbo Codes,” IEEE Transactions on Communications, vol. 44, no. 10, Oct. 1996, pp.1261-1271. [41] Mark S. C. Ho and Steven S. Pietrobon, “A variance mismatch study for serial concatenated turbo codes,” IEEE International Symposium on Turbo Codes and Related Topics, Sep. 2000, pp. 483-485. [42] Mark S. C. Ho, Steven S. Pietrobon, Tim Giles, “Interleavers for punctured turbo codes,” Proceedings of APCC/ICCS Conference, Singapore, November 1998. [43] www.ee.vt.edu/ee/valenti/turbo.html 131

[44] www.e-insite.net/esec/Article82203.htm [45] www.chipcenter.com/dsp/DSP000510F1.html [46] www331.jpl.nasa.gov/public/JPLtcodes.html [47] D. Divsalar and F. Pollara, “Turbo codes for deep-space communications,” TDA Progress Report 42-126, Jet propulsion Laboratory, California, Feb. 1995, pp. 2939. [48] K. Andrews, V. Stanton, S. Dolinar, V. Chen, J. Berner, and F. Pollara, “TurboDecoder Implementation for the Deep Space Network,” IPN Progress Report 42-14, Feb. 2002. [49] www.csee.wvu.edu/ reynolds/ee562/TurboPreRelease.pdf [50] expo.jspargo.com/milcom02/21st.asp [51] B. A. Banister, B. Belzer, and T. R. Fischer, “Robust Image Transmission using JPEG2000 and Turbo Codes,” IEEE Signal Processing Letters, vol. 9, April 2002, pp.117-119. [52] O. Aitsab, R. Pyndiah and B. Solaiman, “Iterative source/channel decoding using block turbo codes and VQ for image transmission,” Proceedings of IEEE International Symposium on turbo codes and related topics, Brest, Sep. 1997, pp. 170-173. [53] Peng, Yih-Fang Huang and Daniel J. Costello, Jr., “Turbo Codes for Image Transmission-A Joint Channel and Source Decoding Approach,” IEEE Journal of Selected Topics in Communications, vol. 18, no. 6, June 2000, pp.868-879. [54] http://www.bbwexchange.com/news/2002/mar/vocal032502.htm [55] http://www.chipcenter.com/pld/pldf092.htm [56] Juan Alberto Torres, “Turbo codes for xDSL modems,” xDSL Workshop, (Vocal Technologies Ltd.) Vienna, September 2000. 132

[57] Perez L. C., Seghers J. and Costello D. J. Jr, “A Distance Spectrum Interpretation of Turbo Codes,” IEEE Trans Information Theory, vol.42, no.6, Nov. 1996, pp. 16981709. [58] S. Hamidreza Jamali and Tho Le-Ngoc, Coded modulation techniques for fading channels, Kluwer Academic publishers, Boston, pp. 74-95, April 1994. [59] Benedetto S. Montorsi G., Divsalar D. and Pollara F., “Soft-Output Decoding Algorithms in Iterative Decoding of Turbo Codes,” TDA Progress Report 42-124, Jet Propulsion Laboratory, Pasadena, CA, Feb. 1996, pp. 63-87. [60] Divsalar D. and Pollara F., “Hybrid Concatenated Codes and Iterative Decoding,” TDA Progress Report 42-130, JPL, Pasadena, CA, Aug. 1997, pp. 1-23. [61] W. Bruce Puckett, “Implementation and Performance of an improved Turbo Decoder on a Configurable Computing Machine,” MS Thesis, Virginia Polytechnic Institute and State University, July 2000. [62] Jelena Nikolic-Popovic, “Implementing a MAP Decoder for cdma2000 Turbo codes on a TMS320C62x DSP device,” Application Report, Texas Instruments, SPRA629, May 2000. [63] Yufei Wu and Brian D. Woerner, “Internal data width in SISO decoding module with modular renormalization,” Proceedings of IEEE Vehicular Technology Conference, Tokyo, Japan, May 2000. [64] Hughes Network Systems, “Decoder complexity of 8-state turbo decoders vs. 4-state serial concatenated codes,” TSG(99)068, Yokohama, Feb. 1999. [65] Yufei Wu and Brian D. Woerner, “The influence of quantization and fixed point arithmetic upon BER performance of turbo codes,” Proceedings of IEEE Vehicular Technology Conference Houston, TX, May 1999. 133

[66] Lode Nachtergacle, Francky Cathoor and Chidamber Kulkarni, “Random Access Data Storage Components in Customized Architectures,” IEEE Design and Test of Computers, May-June 2001, pp. 40-54. [67] J. L. Hennessy and D. A. Patterson, Computer Architecture - A Quantitative Approach, Morgan Kaufman, San Francisco, CA, 1994. [68] V. Zivojnovic, J. Velarde, and C. Schlager, “DSPStone : A DSP-oriented benchmarking methodology,” Proceedings of the 5th International Conference on Signal Processing Applications and Technology, October 1994. [69] ls12-www.cs.uni-dortmund.de/research/encc [70] Preeti Ranjan Panda, Nikhil Dutt and Alexandru Nicolau, Memory issues in embedded systems on-chip - Optimization’s and exploration, Kluwer Academic Publishers, 1999. [71] T. Ishihara and H. Yasuura, “A power reduction technique with object code merging for application specific embedded processors,” Proceedings of Design Automation and Testing, Europe Conference (DATE 2000), March 2000. [72] Rajeshwari Banakar, S. Steinke, B. S. Lee, M. Balakrishnan and P. Marwedel, “Comparison of cache and scratch pad based memory system with respect to performance, area and energy consumption,” Technical Report 762, University of Dortmund, Sep. 2001. [73] Stefan Steinke, Christoph Zobiegala, Lars Wehmeyer and Peter Marwedel, “Moving program objects to scratch-pad memory for energy reduction,” Technical report number 756, University of Dortmund, Apr. 2001. [74] Luca Benini, Alberto Macii, Enrico Macii and Massino Poncino, “Synthesis of application specific memories for power optimization in embedded systems,” Design Automation Conference DAC-2000, Los Angeles, California, June 2000, pp. 300-303. 134

[75] S. J. E. Wilton and N. P. Jouppi, “CACTI : An enhanced access and cycle time model,” IEEE Journal of Solid-State Circuits, vol. 31 no. 5, May 1996, pp. 677-688. [76] S. Steinke, M. Knauer, L. Wehmeyer and P. Marwedel, “An Accurate and Fine Grain Instruction-Level Energy Model Supporting Software Optimizations,” Proceedings of 11th International Workshop of Power And Timing Modeling, Optimization and Simulation (PATMOS), Sep. 2001. [77] Steve Furber, ARM - System-on.chip architecture, Second edition, Addison-Wesley Publications, 2000. [78] AT91M40400 processor, www.atmel.com, ATMEL Corporation. [79] ARM Processors, www.arm.com, Advanced RISC Machines Ltd. [80] G. Masera, G. Piccinini, M. Ruo Roch, and M. Zamboni, “VLSI architectures for turbo codes,” IEEE Transactions on VLSI Systems, vol. 7, no.3, Sep. 1999, pp. 369378. [81] J. Dielissen, J. Van Meerbergen, Marco Bekooiji, Francoise Harmsze, Sergej Sawitzki, Jos Huisen and Albert van der Werf, “Power-efficient layered Turbo Decoder processor,” In the proceedings of Design Automation and Test Europe Conference DATE 2001, Munich, Germany, March 2001. [82] Z. Wang, H. Suzuki, and K. K. Parhi, “VLSI implementation issues of turbo decoder design for wireless applications,” Proceedings of 1999 IEEE Workshop on Signal Processing Systems: Design and Implementation, Taipei, Oct. 1999, pp. 503-512. [83] Jia Fei, “On a turbo decoder design for low power design,” MS Dissertation, Virginia Polytechnic Institute and State University, July 2000. [84] Emmanuel Boutillon, Warren J. Gross and Glenn Gulak, “VLSI Architectures for the MAP Algorithm,” IEEE Transactions on Communications, vol. 51, no. 2, Feb. 2003, pp. 275-185. 135

[85] M. C. Valenti and J. Sun, “The UMTS Turbo Code and an Efficient Decoder Implementation Suitable for Software-Defined Radios,” International Journal of Wireless Information Networks, vol. 8, no. 4, Oct. 2001, pp.203-215. [86] David Garrett, Bing Xu and Chris Nicol, “Energy Efficient Turbo Decoding for 3G Mobile,” Proceedings of the International Conference on Low Power Electronics and Design, ISPLED, 2001, pp.328-333. [87] J. Vogt, J. Ertel, and A. Finger, “Reducing bit width of extrinsic memory in turbo decoder realizations,” Electronic Letters, vol.36, no. 20, Sep. 2000, pp.1714-1716. [88] Engling Yeo, Payam Pakzad, Borivoje Nikolic and Venkat Ananthraman, “VLSI Architecture for Iterative Decoders in Magnetic Recording Channels,” IEEE Transactions on Magnetics, vol.37, no. 2, March 2001, pp.748-755. [89] Curt Schurgers, Francky Cathoor and Marc Engels, “Memory optimization of MAP Turbo Decoder Algorithms,” IEEE Transactions on VLSI Systems, vol.9, no.2, Apr. 2001, pp. 305-312. [90] Stefan Steinke, Nils Grunwald, Lars Wehmeyer, Rajeshwari Banakar, M Balakrishnan and Peter Marwedel, “Reducing Energy Consumption by Dynamic Copying of Instructions onto On-chip Memory,” Proceedings of ISSS 2002, Kyoto Japan, Oct 2002, pp 213-218. [91] Rajeshwari Banakar, Stefan Steinke, M. Balakrishnan and Peter Marwedel, “Scratch Pad Memory - A Design Alternative for Cache On-chip memory in Embedded Systems,” In the Proceedings of Tenth International symposium on Hardware/Software Codesign CODES 2002, Colorado, USA, May 2002,pp.73-78. [92] A. Worm, H. Michel and N. Wehn, “Power minimization by optimizing data transfers in Turbo-decoders,” Kleinheubacher Tagung Program, Montag 27, September 1999. 136

[93] A. P. Chandrakasan, S. Shang and R. Broderson, “Low Power CMOS digital design,” IEEE Solid State Circuits, vol. 27, no. 4, Apr. 1992, pp. 478-483. [94] M. Horowitz, T. Indermaur and R. Gonzalez, “Low Power digital design,” Proceedings of the IEEE Symposium on Low Power electronics, San Diego, Oct. 1994, pp.811. [95] Farid N. Najm, “A survey of Power estimation techniques in VLSI circuits,” IEEE Transactions on very large scale integration system, vol. 2, no. 4, Dec. 1994, pp. 447-455. [96] Kin J., Chunho Lee, Mangonie Smith and W. H. Potkonjak, “Power efficient mediaprocessors - Design Space exploration,” Proceedings of the 36th Design Automation Conference 1999, June 1999, pp. 321-326. [97] James G. Harrison, “Implementation of a 3GPP Turbo Decoder on a Programmable DSP Core,” Communications Design Conference, San Jose, California, Oct. 2001. [98] www.epic.com/products.html [99] www.anagraminc.com [100] www.apic.com/AMPS [101] simplex.com/ng/products/tnl [102] www.mentorg.com [103] powereda.com/wwinfo.html [104] synopsis.com/products/power [105] V. Tiwari, Sharad Malik and A. Wolfe and M. Lee, “Instruction Level Power Analysis and Optimization of Software,” Journal of VLSI Signal Processing Systems, Aug. 1996, pp. 1-18. 137

[106] Vishal Dalal, “Software Optimization in an Embedded System,” MTech Thesis, VDTT, IIT Delhi, Dec. 1999. [107] Vivek Tiwari, Sharad Malik and Andrew Wolfe, “Power analysis of embedded software power minimisation,” Proceedings of the International Conference on computer aided design, Nov. 1994, pp. 384-390. [108] Akshay Sama, M. Balakrishnan and J. F. M. Theeuwen, “Speeding up of Power estimation of Embedded Software,” International Symposium on Low power Electronics and design, Italy, July 2000. [109] Paul E. Landman and Jan Rabaey, “Power Estimation for High level synthesis,” Proceedings of EDAC-EUCO ASIC’93, Paris, France, Feb. 1993, pp. 361-366. [110] Paul E. Landman and Jan Rabaey, “Black Box Capacitance Model for Architectural Power Analysis,” Proceedings of 1994 International workshop on Low Power design, Napa, CA, Apr. 1994, pp.165-170. [111] Virtex Power Estimator, http://support.xilinx.com/cgi-bin/powerweb.pl. [112] V. K. Prasanna Kumar and Y. Tsai, “On Synthesizing Optimal Family of Linear Systolic Arrays for Matrix Multiplication,” Proceedings of IEEE Transactions on Computers, vol.40, no.6, June 1991, pp.770-774. [113] S. Choi, J. Jang, S. Mohanty and V. K. Prassanna, “Domain-Specific Modeling for Rapid System-Wide Energy Estimation of Reconfigurable Architectures,” Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), 2002. [114] G. Stitt, B. Grattan, J. Villarreal and F. Vahid, “Using On-Chip Configurable Logic to Reduce Embedded System Software Energy,” Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, Napa Valley, Apr. 2002, pp.143151. 138

[115] Vivek Tiwari, S. Malik and P. Ashar, “Guarded Evaluation : Pushing Power management in logic synthesis design,” International symposium on Low power design,Apr. 1995, pp.221-225. [116] J. Montefro, S. Devadas, P. Ashar and A. Mauskar, “Scheduling techniques to enable power management,” DAC-33 : ACM/IEEE Design Automation conference, June 1996, pp. 349-352. [117] Amit Sinha and Anantha Chandrakasan, “Dynamic Power Management in Wireless Sensor Networks,” Special issue on Wireless Power Management of IEEE design and test of Computers, March-April 2001. [118] Pentium Processor Power distribution, http://www.intel.com/design/pentiumiii/ designgd/24533505.pdf [119] Rudenko A., Reiber P., Popek G. J. and Kuenning G. H., “Saving portable computer battery power through remote process execution,” ACM Mobile computing and communication Review (MC2R), vol.2, no. 1, 1998 [120] http://www.lowpower.de/charter/designguide3.php [121] Anand Raghunathan, Niraj K. Jha and Sujit Dey, High level Power analysis and optimization, Kluwer Academic Publishers, Boston, pp.100-110, 1998. [122] Burcin Bay and Borris Murmann, “Circuit techniques for reduced subthreshold leakage currents in digital CMOS circuits,” Mid term Report, Dept of Electrical Engg. and Computer Science, University of California, Berkeley, CA, 2001. [123] T. McPherson et al., “760 MHz G6S/390 Microprocessor exploiting multiple Vt and Copper Interconnects,” Proceedings of 2000 IEEE International Solid State Circuits Conference, Digest of Technical Papers, 2000, pp. 96-97. [124] N. Weste. and K. Eshragian, Principles of CMOS VLSI Design, Second edition, Reading, Mass Addison Wesley, 1993. 139

[125] K. Fujii et al., “A sub 1V triple threshold CMOS/SIMOX circuit for active power reduction,” Proceedings of IEEE International Solid State Circuits Conference, Digest of Technical Papers, June 1998, pp.190-191. [126] Seongmoo Heo, Kenneth Barr, Mark Hampton, and Krste Asanovic, “Dynamic FineGrain Leakage Reduction Using Leakage-Biased Bitlines ,” Proceedings of IEEE International Solid State Circuits Conference, ISCA-2002, Anchorage, Alaska, May 2002. [127] Amit Sinha and A. P. Chandrakasan, “JouleTrack - A Web Based Tool for Software Energy Profiling,” Proceedings of Design Automation Conference, DAC 2001, June 2001.

140

Technical Biography of Author
Rajeshwari M Banakar received B.E degree in Electronics and Communication from Karnataka University, (India) in 1984, and the M.Tech degree in Digital Electronics and Advanced Communication from Regional Engineering College Surathkal (India) in 1993. She has worked in Indian Space Research Organisation, Sriharikota, India in 1987 and 1988. She is Assistant Professor in Electronics and Communication at B.V.Bhoomaraddi College of Engineering and Technology, Hubli, Karnataka and working towards the Ph.D degree in Electrical Engineering at Indian Institute of Technology Delhi. Her interests include Embedded Systems and Low Power Application Specific Design.

141

Sign up to vote on this title
UsefulNot useful