Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
1Activity
×
0 of .
Results for:
No results containing your search query
P. 1
High Performance Reconfigurable Balanced Shared Memory Architecture For Embedded DSP

High Performance Reconfigurable Balanced Shared Memory Architecture For Embedded DSP

Ratings: (0)|Views: 626|Likes:
Published by ijcsis
Reconfigurable computing greatly accelerates a wide variety of applications hence it has become a subject of a great deal of research. It has the ability to perform computations in hardware to increase performance, while keeping much of the flexibility of a software solution. In addition reconfigurable computers contain functional resources that may be easily modified after field deployment in response to changing operational parameters and datasets. Till date the core processing element of most reconfigurable computers has been the field programmable gate array (FPGA) [3]. This paper presents reconfigurable FPGA-based hardware accelerator for embedded DSP. Reconfigurable FPGAs have significant logic, memory and multiplier resources. These can be used in a parallel manner to implement very high performance DSP processing. The advantages of DSP design using FPGAs are high number of Instructions/Clock, high number of Multipliers, high Bandwidth Flexible I/O and Memory Connectivity. The proposed processor is a reconfigurable processing element architecture that consists of processing elements (PEs), memories and interconnection network and control elements. Processing element based on bit serial arithmetic (multiplication and addition) was also given. In this paper, it is established that specific universal balanced architecture implemented in FPGA is a universal solution, suited to wide range of DSP algorithms. At first the principle of modified shared-memory based processor are shown and then specific universal balanced architecture is proposed. An example of processor for TVDFT Transformation on the given accelerator is also given. By the proposed architecture, we could reduce cost, area and hence power in the best-known designs in the Xilinx FPGA technology.
Reconfigurable computing greatly accelerates a wide variety of applications hence it has become a subject of a great deal of research. It has the ability to perform computations in hardware to increase performance, while keeping much of the flexibility of a software solution. In addition reconfigurable computers contain functional resources that may be easily modified after field deployment in response to changing operational parameters and datasets. Till date the core processing element of most reconfigurable computers has been the field programmable gate array (FPGA) [3]. This paper presents reconfigurable FPGA-based hardware accelerator for embedded DSP. Reconfigurable FPGAs have significant logic, memory and multiplier resources. These can be used in a parallel manner to implement very high performance DSP processing. The advantages of DSP design using FPGAs are high number of Instructions/Clock, high number of Multipliers, high Bandwidth Flexible I/O and Memory Connectivity. The proposed processor is a reconfigurable processing element architecture that consists of processing elements (PEs), memories and interconnection network and control elements. Processing element based on bit serial arithmetic (multiplication and addition) was also given. In this paper, it is established that specific universal balanced architecture implemented in FPGA is a universal solution, suited to wide range of DSP algorithms. At first the principle of modified shared-memory based processor are shown and then specific universal balanced architecture is proposed. An example of processor for TVDFT Transformation on the given accelerator is also given. By the proposed architecture, we could reduce cost, area and hence power in the best-known designs in the Xilinx FPGA technology.

More info:

Published by: ijcsis on Aug 13, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less

10/25/2012

pdf

text

original

 
High Performance Reconfigurable BalancedShared Memory ArchitectureFor Embedded DSP
J.L.Mazher Iqbal
 
Assistant Professor,ECE Department,Rajalakshmi Engineering College,Chennai-602 105, Indiamazheriq@gmail.com 
 Abstract— 
Reconfigurable computing greatly accelerates a widevariety of applications hence it has become a subject of a greatdeal of research. It has the ability to perform computations inhardware to increase performance, while keeping much of theflexibility of a software solution. In addition reconfigurablecomputers contain functional resources that may be easilymodified after field deployment in response to changingoperational parameters and datasets. Till date the coreprocessing element of most reconfigurable computers has beenthe field programmable gate array (FPGA) [3]. This paperpresents reconfigurable FPGA-based hardware accelerator forembedded DSP. Reconfigurable FPGAs have significant logic,memory and multiplier resources. These can be used in a parallelmanner to implement very high performance DSP processing.The advantages of DSP design using FPGAs are high number of Instructions/Clock, high number of Multipliers, high BandwidthFlexible I/O and Memory Connectivity. The proposed processoris a reconfigurable processing element architecture that consistsof processing elements (PEs), memories and interconnectionnetwork and control elements. Processing element based on bitserial arithmetic (multiplication and addition) was also given. Inthis paper, it is established that specific universal balancedarchitecture implemented in FPGA is a universal solution, suitedto wide range of DSP algorithms. At first the principle of modified shared-memory based processor are shown and thenspecific universal balanced architecture is proposed. An exampleof processor for TVDFT Transformation on the given acceleratoris also given. By the proposed architecture, we could reduce cost,area and hence power in the best-known designs in the XilinxFPGA technology.
  Key Word; Reconfigurable architectures; FPGA; Pipeline; Processing Element; Hardware Accelerator 
I.
 
INTRODUCTION Now that design rules have stopped shrinking for ASICs,ASSPs and the like, they seem likely to be replaced byFPGAs. With their design rules coming into the 40nmgeneration, FPGAs will soon be level with ASICs and ASSPsin terms of circuit size and performance. The circuitconfiguration of FPGAs can be freely revised by equipment
Dr.S.Varadarajan
Associate Professor,ECE Department,Sri Venkateswara College of Engineering,Sri Venkateswara University,Tirupati-517 502, Indiavaradasouri@gmail.com manufacturers on the spot, which means that they have noneed to pay development costs, including mask sets. Even better, FPGAs do not require any circuit fabrication after design, which means faster equipment development. Nowadays, consumer appliances have become moreadvanced than ever. They are required to be more functionaland portable. Moreover, the span of the product’s life has become shorter. There are two important issues to developLSIs, period and cost. Development of new LSI demandsinvesting hugely and taking a big risk. Programmabledevices, such as CPU, DSP and FPGA, have become a keyto resolve these issues and hardware reconfigurability has been paid attention because of its high performance [3].FPGA has high flexibility and suitable to implement controlcircuits, but FPGA suffers from the low area efficiency toimplement data dominated circuits. When implementingindustrial application systems, the area of FPGAimplementation is far larger than that of ASICimplementation because of the high reconfigurability.Reconfigurable architecture has the capability to configureconnections between programmable logic elements,registers and memory in order to construct a highly parallelimplementation of the processing kernel at run time. Thisfeatures makes them attractive, since a specific high speedcircuit for given instance of an application can be generatedat compile or even run time. Since the appearance of thefirst reconfigurable computing systems, DSP applicationshave served as important test cases in reconfigurablearchitecture and software development. In the area of special purpose architecture for digital signal processing,systolic arrays are recognized as a standard for high performance
.
Systolic designs represent an attractivearchitectural paradigm for efficient hardwareimplementation of computation-intensive DSP applications, being supported by the features like simplicity, regularityand modularity of structure. In addition, they also possesssignificant potential to yield high-throughput rate byexploiting high-level of concurrency using pipelining or  parallel processing or both [1]. Today’s objective is to tailor system performance to given task at minimal cost in terms
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010198http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
of chip area and power consumption. Finding a universalsolution suited to wide range of DSP algorithms is permanently actual task. To reach relevant real-time performance, it must be multiprocessor architecture. At thearchitectural level, the main interest is the overall organizationof the system compound using processing elements (PEs),memories, communication channels and control elements. Oneof the possible approaches is so called shared memoryarchitecture. Our architecture can obtain a high area efficiencyand high performance for implementing industrialapplications.II.
 
PRINCIPLE OF SHARED MEMORY BASEDPROCESSOR In this section, we review shared-memory approach for DSP application [13]. The architecture of shared memory isshown in Figure 1. The idea is very simple. In order tosimultaneously provide the PEs with input data, we need to partition the shared-memory into blocks. Processing elements(PEs) usually perform simple memory less mapping of theinput values to a single output value. Using a rotating accessscheme, each processor gets access to the memories once per  N (N - number of PE’s) cycles. During this time processor either writes or reads data from memory. All processors havethe same duration time slot to access to the memories andaccess conflict is completely avoided. The disadvantage of using shared-memory architecture is the memory bandwidth bottleneck. In order to avoid bandwidth bottleneck andsimultaneously provide the processors with several (K) inputdata, the shared-memory is partitioned into K memories(figure 3).
Fig. 1. Shared-memory architecture
In this paper, a special instance of that architecture is presented. The main target is to find balance betweencomplexity of interconnection network, type of computationmodel of PEs (serial vs. parallel), number of PEs and memorysize. Chosen compromise should fulfill following factors:required performance, minimal power consumption and costin terms of chip area. Another important requirement is tocreate flexible, easy reconfigurable architecture suited to widerange of DSP algorithms.
 A.
 
 Processing Elements (PEs)
 
Usually Processing elements perform simple memory lessmapping of the input values to a single output values. The PEscan be in parallel or serial fashion. In parallel form it requiresa parallel data bus and careful design because of delays andcarry propagation. Parallel form leads to arithmetic operationmade in one clock cycle, but when compared to serial form, itconsumes more chip area. Serial PEs receives their inputs bit serially, and their results are also produced bit-serially.Hence, only a single wire is required for each signal. Design process could be more simple and robustness. The cost interms of chip area and power consumption is therefore low.However, to achieve required performance bit-serialcommunication leads to high clock frequencies.
 
Fig. 2. Shared-memory architecture
 B.
 
Memory elements
 
Memory elements comparing to PEs are slow. It is desirableto make a trade-off between additional registers and RAMto achieve appropriate (in comparing to PE) read and write performance. By bit parallel PE high speed register will playa trivial (one word) cache memory role. By bit serial PEthere must be a shift register. The data can be shifted in toand out from register with high speed. Then word can bewritten into the RAM. The RAM addressing requires onlycyclic work. Reading data is bit-parallel and stored into shiftregister. Number of RAM words should be enough to storeall variables accordingly to realize algorithm.
C.
 
 Interconnection network (ICN)
 Interconnection network provides the communicationchannel needed to supply the PEs with proper data and parameters, and store results in the proper memories. Thedata movement should be kept simple, regular and uniform.Major design issues involve the topology of thecommunication network and its bandwidth.III.
 
BALANCED MODIFIED SHARED MEMORYARCHITECTURERealizing single, basic arithmetic operation like additionor multiplication, it is obviously bit-parallel version of PEsthat has several times higher performance then bit serialone. However, taking into account whole module with PEs,input and output registers, memory and interconnectionnetwork, advantage of parallel form is not so clear.Including power consumption and chip area, serial formcould be more convenient. Generally smaller chip area andsmaller clock leads to smaller power consumption. Therequirements on the PE are that it completes its operationwithin the specified time limit. Self–explanatory chip areaof single serial-PE is much smaller then parallel-PE, but toget the same performance needs faster clock. Parallel PE’slead to more connections lines, consistently more area and power. Parallel-PE looks to have several times bigger computational throughput (then serial by the same clock),however when considering 2-3 PE’s in shared– memory
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010199http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
architecture it could be impossible to use high speed clock  because of noise in signal propagation in long parallel buses.Otherwise control part of whole system in serial-PE versionmay be in micro program fashion, where implementedalgorithm will be changed by the way of changing controlmemory contest. Using parallel-PE, control part must besignificantly changed due to change type of computationaltask. This brief considerations show that shared-memoryarchitecture with serial-PE’s can be easier to implement and is better suited to wide range of DSP algorithms.The proposed balanced shared-memory module based onthe approach [13] is shown on figure 3. The Processingelements are capable of performing three computing functions: bit–serial full addition (inc. carry), bit–serial multiplicationand negation, with two inputs and one output. Other arithmeticoperation will be done as sequence of additions. Because of two serial inputs and one serial output, each PE is equippedwith four shift registers (two inputs and one output for twoindependent memories). Those registers are used as singleword cache memory and as serial/parallel parallel/serialtranslators on communication path to RAM memory. Henceinterconnection network is very simple. This leads to smallchip area and possibly of using high speed clock. Number of PEs should be 2 and consequently 1 RAM memory blocks(one RAM per multiplied output of PE). The multiplier output of PE1 is shipped into the RAM block using signalss1=1 and s2=0 to PE2 via shift register. PE2, accumulatethe multiplier output and write back the result to the output buffer. Our shared-memory architecture offers good balancein terms of chip area, power consumption, computationalthroughput and flexibility. In fact of lack of required performance proposed module could be “multiplied” i.e.connected as shown in figure 5 and figure 6. Using parallel(figure 5) or cascade (figure 6) connection of modules it iseasy to create processor suited to wide range of algorithms.Almost every required performance could be achieved aswell. In the next part of this article the example realizationof multiplier based on serial arithmetic and TVDFTtransformation based on proposed architecture is presented.The heart of proposed architecture is PE (Figure 4). Thiselement makes all bit-serial arithmetic calculation such asmultiplication and addition. Moreover that, the processor element can made negation on “b” input when control signal“not_b” is high. Control signals s1 and s2 enablemultiplication and addition respectively.
Fig. 3. Instance of universal specific balance architectureDEMUXRAMS1S2OUTPUTIN2R1.1PE1R2.1PE2R3.1IN1R1.2 R2.2R3.2
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010200http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->