You are on page 1of 13

Low Power Techniques in Graphics Processing Units Deepak Verma dverma@syr.

edu Department of Computer Engineering Syracuse University Syracuse, NY 13210

Abstract
Power can be minimized at either system level or architecture level or algorithm level or micro architecture level or gate level or circuit level. Here, we are going to discuss the different levels where the minimization has been done in history and the present research’s going on to reduce the power at various levels.

History 1. Implementation of Low Power One-Chip MUSE Video Processor
In past the emphasis has been done to do the Logic minimization to reduce the power in chip to further decrease the cost of chip. The low power design allowed us to mount the chip on inexpensive plastic packages as if the chips consume more than 1.5w then it was necessary to mount them on expensive ceramic packages. Here we deal with the circuit reduction, in previous chip sets, 160 word * 234 bit, RAM consists of two parts since each are of RAM existed on another chip. But the implementation of one chip results in a reduced bit size 480 word* 65 bit. This memory consists of the same 6transistor memory cell as the previous single-port SRAM devices but the individual cells are accessed through shift registers as shown below.

Introduction
Most of the power consumption in computer is done in the Graphics Processing Units. So here we are going to study few of the areas where the research is being done in this field. In section 1, the Advancement done is past to reduce the chip size is and its cost is explained, whereas in other sections the Present Research is mentioned. Improvements at Algorithmic level are discussed in section 2 along with the new technologies like Low power video processor, High Quality motion Estimation, Real time DVB-S2 Low-Density ParityCheck (LDPC) for GPU’s. Development at Architecture level is explained with an example of Low Power Interconnects for SIMD Computers in section 3. The last level of minimization i.e. at Hardware level with technology of Hardware-Efficient Belief Propagation and Area Optimized Low Power 1-Bit Full-Adder is in section 4.

P=CV2f. Single port RAM is larger than the sequential access memory. (2) the memory cell. F=frequency. In the dedicated memory block. Since the access speed of single-port ram is not high enough to access at 16MHZ. We estimated the size of memory and its peripherals. Memory speed was increases from approx. (5) the memory cell. Power is reduced by lowering the operating voltage from 3. (4) the main decoder. (3) the pre-decoder.By changing from using address decoders with an address buffer and address transient detector to a shift register. Power. C=capacitance(chip area) V=operating voltage. .. Data output from previous single port SRAM blocks had to propagate through (1) the address counter. (6) the amplifier. the memory circuit size was reduced.7v to 3. (2) the address buffer. data only propagate through (1) the shift reg. we use 1–to-3 serial-to-parallel converter and 3-to-1 parallel-to-serial converter and needed three 160 word * 65 bit RAM blocks.3v and reducing the chip area through circuit reduction.(3) the amplifier. 5Mhz to 16 Mhz. Thus the access speed increases.

the video input/output rate). we resorted to computing the difference of every two successive pixels. . and the logic exploits it.Present In this section we are going to discuss the various levels on which the research is being done to lower the power. Thus. The Winner: Algorithmic Transformation We have taken advantage of the facts that video pixels are often spatially correlated. thus also prohibiting low power redesign at that level. Thus. for the average conditions at this video processor. Low Power Processor Video Multiple power saving methods were applied to a video processor for color digital video and still cameras. and converting the linear section of the algorithm to work on those differences. but found that the extra power required by the delay lines and the handshake circuits far exceeded the power saved by the elimination of the clock. What remains is for the designer to use best judgment at the architectural levels (RTL and behavioral). The ‘bundled data’ methodology with delay lines and full handshake Interconnect is employed.1. Hamming distance logic on the sender side determines which of the value or its complement incurs less switching. compared to the previous value that is dynamically stored on the bus. The differences are mostly zero or 1-2 bit numbers. 2. Power Saving Methods That Were Rejected Two power reduction methods were investigated which were rejected: Asynchronous Design: Three factors are typically expected to reduce power: The clock network is eliminated. and dynamic voltage scaling may be employed. and at the algorithmic level. This was due to the very low frequency of the clock (13. This method was shown to save up to 80% of total power during periods of low activity.5MHz. Algorithmic level 2. and that most of the processing algorithm is linear. Architectural level methods failed to save However changing the algorithm to work on pixel differences yielded 3-15% power reduction in typical cases. Analysis shows that. Bus Switching Reduction: This is possible by selecting between sending a value or its complement. it is inapplicable to this small processor. when the processor may be slowed down. each module receives inputs only when it needs to compute. the bus load must exceed 1pF before this method shows any benefit. The designer is often constrained by system level specifications that cannot be changed.

The chip implements a new high performance motion estimation algorithm based on a modified genetic search strategy that can be enhanced by tracking motion vectors through successive frames. i.8 E 2. the requirements for both. evaluation of the “fitness” of the “chromosomes” . the difference algorithm performs poorly after any sharp image gradients.e. The new combined original/difference algorithm has been designed to create output that deviates by no more than a single digital value from the original (a ‘single lsb’ error). and iteration of the creation of new populations (i.4 5. We have retained the original circuitry and have employed it each time an edge has been encountered. Power Reduced Pixel Baseline Image saving current differences current (%) (mA) (%) (mA) 15 8 7 -3 6.e.3 4. Basic search approach is based in Genetic algorithm. and in up to 15% power saving.e find a starting set of “chromosomes”. below). i. processing power and IO bandwidth are extremely high if excellent encoding quality is required as it is the case in studio environment. i. Images A-C exhibit typical edge contents.This observation is obviously false near edges in the image. each of them corresponding to one motion vector.9 98 0. Five images were simulated on the new circuit which yielded different portions of pixel differences (Tab. However.e. Due to rounding errors. of the set of motion vectors with the lowest chromosomes MAE. Image D has little edges. selection of the “fittest chromosomes”.3 5 9 37 66 7.4 5.9 5. they require less power in the baseline processor. randomly change the new vectors according to a defined probability. due to the additional overhead of the more complex architecture. and mutation of the “children’s”. and simulations have verified this on all our test images. Actually. there is some loss. i. Ignoring these extreme cases.2 A B C D -12 0. repeat steps 2 to 5) until a defined . the new architecture is useful for power saving on more typical images. The digital TV formats like the main level/main profile of MPEG2.e. produce new motion vectors(children’s) from the selected set of motion vectors (parents).i. the difference algorithm is turned back on and the original algorithm is shut off. so that there is no saving in both cases. Once the gradient has subsided and relatively stationary pixel levels have been reestablished.2. this vlsi aims to replace complex. crossover of the “chromosomes”. Therefore one main focus of the VLSI activities was to provide a solution for a “front-end” high quality motion estimator which satisfies these constraints and implements some coding optimizations. and flat image E contains but one value. It consists of six basic steps: initialization. and result in less than 50% pixels which could be processed as differences.e. High Quality motion Estimation Motion estimation is the most expensive computational part in the video encoding process.7 4. calculation of the MAE criterion of the corresponding motion vectors. FPGA-based prototype hardware. Both exhibit very high ratio of pixel differences.3 5.

associated with large code size. i. A set of 9 “fittest chromosomes”(best motion vectors so far) is always kept to assure that the final motion vector is the one with the lowest MAE of all performed matching and not only the best result of the last population. This character can be exploited to significantly improve the results of our algorithm by applying a VECTOR TRACING technique. and nonMPEG2 conform predictions in display order are performed with the only goal to calculate very exact tracing vectors by adding up partial results of this estimation (e. We show that it is possible to achieve real-time DVBS2 LDPC decoding with throughputs above 90 Mbps on ubiquitous GPU computing platforms. For very complex sequence (like basketball). Furthermore in many cases. they mainly address short and regular codes. instead of the search window center. low complexity motion estimation algorithm of the IMAGE chip. Nearly Although LDPC decoding solutions have recently been proposed for multi-core platforms. This scheme has been used as a basis to develop the new high performance. In the second phase. the best motion vector of the previous macro-block in the same slice is used as initialization of the search. In the first phase. B2->B1. to form the tracing vector P->1. the prediction quality for scenes with very fast motion can further enhanced by applying a 2-phase vector tracing scheme. The spatial correlation of the motion vectors in the frame is exploited. Also. Thread-level and data-level parallelism can be conveniently exploited.3. to harness the computational efficiency of these GPU-based signal processing algorithms. which are expected to have a higher number of cores.Real time DVB-S2 LowDensity Parity-Check (LDPC) for GPU’s Till now the computational power required to decode large code words in real-time was not available.convergence has been reached. creates challenges which are difficult to overcome. The algorithms developed support multicodeword decoding and are scalable to future GPU generations. The mean of the two “parent” motion vectors is used as crossover operations and mutation is performed by adding small random vectors to the generated “sons”. This 2-Phase motion estimations referred as VT-MGS2 implies of course an significant increase in processing power (by 50%) due to additional first phase. In this paper we propose for the first time LDPC decoders based on Graphics Processing Units (GPU) for the computationally demanding case of irregular LDPC codes adopted in the Digital Video Broadcasting Irregular nature of these LDPC codes can impose memory access constraints and this. This corresponds to add at the initialization phase the best vectors of the nine surrounding macroblocks in the reference frame(with appropriate scaling for the B frames). The “chromosomes” is represented directly by the motion vector with its two components concatenated. . together with the use of fast local memories.g. the scheduling mechanism imposes important restrictions on the attempt to parallelize the algorithm.e. MPEG2 conform motion estimation is done using the pre-calculated initialization vectors. 2. These adaptations result into the Vectortraced Modified Genetic Search Algorithm (VT-MGS). and P->B2 are added up and stored). the sequence is treated like a low delay coding sequence. the non-MPEG2 motion vectors B1->1. the motion vector fields of consecutive frames are highly correlated.

. . . .. . 0 0 0 |1 1 0 . .. engine that can speed up processing by simultaneously performing the same operation on distinct data distributed by many arithmetic processing units. .. . The properties of DVB-S2 codes are exploited for the GPU parallel architectures in this paper. . BNd. . . |. .110 00 . Paritycheck matrix H has the form shown below H(N-K) * N = [A (N-K) * K | B(N-K) * (N-K)] = = a0.0 aN-K-1. |.The LDPC codes adopted in DVB-S2 have a periodic nature. |. aN-K-1... . and threads are grouped and processed in B blocks on the c) GPU many-core architecture. |..k-1 a0. . . The Min-Sum algorithm was adopted in this work to perform the decoding of computationally intensive long LDPC codes. . . .. .. .0 a0. . .0 . . aN-K-2. which allows the exploitation of suitable representations of data structures for attenuating their computational requirements.. .. .011 where A is sparse and B is a staircase lower triangular matrix. .. . The GPU is a massively parallel processing Parallel multithreaded LDPC decoder processing a) kernels 1 and 2 on the GPU using one thread per node of the Tanner graph where. . . . The execution of a kernel on a GPU is distributed across a grid of thread blocks with adjustable size. . .0 .k-1 |1 0 .. . . .. . . . . |0 1 1 . GPUs are programmable and one of the most widely used programming models is the NVIDIA Compute Unified Device Architecture (CUDA). 0 . .. . . . |.K-1| 00. . . The periodicity constraints imposed on the pseudo. for example b). . . .0 . aN-K-2. . |. . a0. ..BNK are BNs connected to CN0. . BNf. PROPOSED PARALLEL ALGORITHM FOR MANY-CORE GPU A computing system with a GPU consists of a host.. .. . . . typically a Central Processing Unit (CPU) that is used for programming and controlling the operation of the GPU.K-1| . ..random generation of A allow a significant reduction in the storage requirements without code performance loss. . .

Low Power Interconnects for SIMD Computers. to fully exploit the massive processing power of the GPU.The algorithm developed attempts to exploit two major capabilities of GPUs: the massive use of thread and data parallelism and the minimization of memory accesses. image processing and wireless comm.…tc15) being processed in parallel inside block 0 for the Check Node processing indicated in kernel 1 from Algorithm 1. one of the most power efficient ways to utilize this transistor area is through integrating multiple processing elements (PE) within a die. Nevertheless. Additionally. be avoided.. parallel accesses to the slow global memory may kill performance and should. as depicted in figure below. which are grouped and executed in other blocks of the grid. Moreover. This is one of the efficient parallel algorithms to perform the massive decoding of DVB-S2 LDPC codes on GPU’s and High throughputs can be achieved for realtime applications. … tcN-K of kernel 1.) is the scalability of the interconnect network between the processing elements in terms of both area and power. data is contiguously aligned in memory. modern GPU hardware can be more efficient at dealing with out-of-order memory accesses and related issues. We use XRAM instead of SRAM. This is represented by many architectures in the form of increased number of single instruction multiple data (SIMD) lanes in processors.1. and the shift from multi-core to many-core architectures. To optimize this type of operation. whenever possible. which compares favorably with existing state-of-the-art VLSI DVB-S2 LDPC decoders that typically use 5 or 6 bit to represent Data. A similar approach is applied to the remaining threads tc16. Architecture level 3. because of the complexity in control wire and control signal generation . which is a low power high performance matrix style crossbar. with values surpassing the 90 Mbps. Figure illustrates this strategy with 16 threads per block (here represented by tc0. the algorithm performs multicodeword decoding by decoding 16 code words in parallel. The efficiency of this parallelism is achieved by adopting a flooding schedule strategy that eliminates data dependencies in the exchange of messages between BNs and CNs.. 3. the proposed DVB-S2 LDPC decoder exploits a threadper-node approach (thread per row and thread per column based processing). Network-on-Chip architectures show that the crossbar itself consumes between 30% to almost 50% of the total interconnect power. which often degrade performance in multi-core systems. this solution uses 8 bit to represent data.tBN-1 perform the equivalent parallel Bit Node processing. A limit of the SIMD width different architectures(like 3D graphics. Also. Another critical problem is that existing circuit topologies in traditional interconnects do not scale well. which favors coalescence to take effect and allows several threads to access corresponding data in simultaneous . Multithread-based processing: In order to extract the essence of full thread-level parallelism from the GPU. high definition video. in kernel 2 threads tB0. Coalesced accesses to data structures: In a GPU.

XRAM fundamentals. XRAM re-uses output channels for programming. The state of the SRAM bit cell at a cross point determines whether or not input data is passed onto the output bus at the cross point.4x performance and consumes 2.logic which directly effects the delay and power consumption. In an SRAM array. This allows for the programming of multiple SRAM cells (as many bit lines available in the channel) simultaneously. resulting in improvement of silicon utilization to 45%. Along a column. creating an array of cross points. Finally borrowing low voltage swing techniques that are currently used in SRAM arrays improves performance and lowers the energy used in driving the wires of the XRAM. Until the XRAM is programmed the output buses do not carry any useful data. One circuit technique that helps solve the control complexity problem is to embed the interconnect control within the cross points of a matrix style crossbar using SRAM cells. Though these techniques help solve the performance and scaling problem of traditional interconnects. By caching some of the patterns that are most frequently used. only utilize a small number of permutations over and over again. the area scales linearly with the product of the input×output ports while consuming almost 50% less energy. like using the same output bus wires to program the cross point control. Along a channel (output bus). A case study shows that the XRAM achieves 1. This differs from the traditional technique where interconnections are set by an external Controller. Matrix type crossbars incur a huge area overhead because of quadratically increasing number of control signals that are required to set the connectivity at the cross points. XRAM reduces power and latency by eliminating the need to configure and reprogram the XRAM for those patterns. Hence. these can be used to configure the SRAM cells at the cross points without affecting functionality. Each cross point contains a 6T SRAM bit cell. the XRAM is only able to store a certain number of swizzle configurations at a given time. XRAM uses techniques similar to what is employed in SRAM arrays. . especially in the signal processing domain. The input buses run horizontally while the output buses run vertically. Other circuit techniques. We find that many applications. To mitigate this. To further improve silicon utilization. only one bit cell is programmed to store a logic high and create a connection to an input. multiple SRAM cells can be embedded at each cross point to cache more than one shuffle configuration. the same bit line is used to read as well as write a bit cell. help reduce the number of control wires needed within the XRAM. one drawback is flexibility. compared to conventionally implemented crossbars.5x less power in a color-space conversion algorithm. each SRAM cell is connected to a unique bit line of the bus.

we do not have to access the variables outside the group of those nodes. The memory and bandwidth required by this technique is only a fraction of the ordinary BP algorithms. bandwidth. as shown in Fig. Because the messages are sequentially updated and each message is constructed via a sequential procedure. Hardware level 4. as tested by the publicly available Middlebury MRF benchmarks. Again. The answer is the data costs of p’s neighbors. pixel-wise. . Furthermore. But the quality of the results. only a half of the messages are stored An interesting question is. is comparable to other efficient algorithms. Therefore. and the smoothness cost between p and q. Besides these. the saving/loading of messages consumes considerable bandwidth. we only need the messages from the second set to the first one. Let us first consider the process of generating outgoing messages of a node p. but it requires high memory. Tile-based BP is used to address these issues. typically on the order of tens to hundreds times larger than the input data. BP algorithms generally require a great amount of memory for storing the messages. we do not need the data of the nodes far away from p. Besides since each message is processed hundreds of times. and sequential operations of BP make it difficult to parallelize the computation.4x and between 1. and the messages sent from the neighbors of these neighbors. what are the data required to generate the messages toward p. Tile-based BP splits the Markov random field (MRF) into many tiles and only stores the messages across the neighboring tiles.1. In this paper. we propose two techniques to address these issues.5-2. Hardware-Efficient Propagation Belief Loopy belief propagation (BP) is an effective solution for assigning labels to the nodes of a graphical model such as the Markov random field (MRF). the data costs of p. To generate messages from the first set to the second one. we need four incoming messages toward p . Tile-Based BP 4. this property is exploited in bipartite graph where the nodes are split into two sets so that every edge connects two nodes of different sets. Therefore. it is difficult to utilize hardware parallelism to accelerate BP. the XRAM improves performance by 1. (b).Compared to conventional MUX based implementations.5x lower power for applications such as colorspace conversion. the iterative. and computational costs. although BP may work on highend platforms such as desktops.

If both A and B are ‘1’s then carry is generated because summing A and B would make output SUM ‘0’ and CARRY ‘1’. bandwidth. When we perform message passing in one subset. the design is attributed as an area efficient and low power ALU. and computational costs of BP and enabled the parallel processing. After we reach the region center. A conventional CMOS full adder consists of 28 transistors. Given enough computations. We can see that as the algorithm iterates(Ti= iterations). But. 4. With these two techniques. AREA OPTIMIZED LOW POWER 1-BIT FULL-ADDER we proposed a low power 1-bit full adder (FA) with 10-transistors and this is used in the design using low power 1-bit full adder in the implementation of ALU. One set N1 contains the nodes in a 3-by-3 tile and the other set N2 contains all other nodes. without knowing messages in N1(dotted edges in Fig). To generate the messages from the shaded nodes in Fig. Only messages that must always reside in the memory are those sending from one subset to another. All the messages in the tile are irrelevant to the message passing outside the tile because they are never used in evaluating. When we perform BP in N2. we can spit the nodes of MRF into two sets. (b) top’s neighbors. This technique greatly reduced the memory. The only required inputs are the data costs and smoothness costs of this region and the messages of the boundary nodes. we only need the messages from their neighbors. the energy keeps decreasing. we only need the messages coming from N1 to drive the propagation (outward arrows in Fig). Therefore.2. The leakage power of the design is also reduced by designing the full adder with less number of power supply to ground connections. the power and area are greatly reduced to more than 70% compared to conventional design and 30% compared to transmission gates. if we have the messages from the boundary of a region. here we have designed a full adder only with 10 number of transistors. we can then sequentially generate the outward messages. This concept can be extended to multiple regions and iterations. We have demonstrated the applicability of the proposed techniques for various applications in the Middlebury MRF benchmark. Table shows the carry status of full adder. we can sequentially generate the messages inward. If both A . as shown in Fig. the messages in the other one can be removed without affecting the operation. BP becomes more suitable for low-cost and power-limited consumer electronics. the approximation quality is good enough to drive the propagation to converge. For example. These techniques can also be applied to other parallel platforms. which occupies very less area and also consumes very less power. A VLSI circuit and a GPU program for stereo matching based on the proposed techniques have been developed. So. This design does not compromise for the speed as the delay of the full adder is minimized thus the overall delay.This rule can be easily extended.

The total Power Consumption of a CMOS circuit includes: dynamic Power Consumption. The dominant factor is the dynamic power based on the equation.and B are ‘0’s then summing A and B would give us ‘0’ and any previous carry is added to this SUM making CARRY bit ‘0’.transistor 1-bit full adder is designed at transistor level.18 m CMOS process technology. This is in effect deleting the CARRY.18 m CMOS process technology. The third input Cin represents carry input to the first stage. static Power Consumption and short circuit power consumption. The 10. Static complementary CMOS adders using 28 transistors. using 0. P= CLfVdd2. The full adder circuit uses 0. which provides transistors with three characteristics. The typical supply voltage for this process is 1. namely high-speed.2995µW Power where as a conventional full adder consumes 16. B and Cin.675µW. E= Idd (t) Vdd (t) Fourteen Transistor (14T) Full adder with Transmission Gates 10 Transistor Full Adder Design(10TFA) The proposed 10TFA also takes the three . which shows a 62. The last two items are neglected due to their low contribution to the power. so the transistors are selected for it accordingly. Based on the simulation. The outputs are SUM and CARRY. As the main target of this design is to minimize Power. inputs A. low-voltage and low-leakage.2% of power savings. the 10transistor 1-bit full adder consumes 6. The instantaneous power P (t) drawn from the power supply is proportional to the supply current Idd (t) and the supply voltage Vdd (t).8 V. P (t) =Idd (t) Vdd (t) The energy consumed over that time interval T is the integral of instantaneous power.

Vitor Silva and Leonel Sousa [5] Low Power Interconnects for SIMD Computers by Mark Woh [6] Hardware-Efficient Belief Propagation by Chia-Kai Liang. 5 239.33 onal Full 42 Adder 4 Transmiss 2 x 1 0. Joao Andrade . low power video processor by Friederich Mombers 7 6.N o.60 79 Design 1 2 3 CMOS 2 x MUX 4 x MUX Logical MUX ALU with CMO S gates ALU with CMOS gates & 10Transist or full adder 351.6 06 16.52 1197.25 Transistor 99 Full Adder the leakage power is also very less as the number of power supply to ground connections are greatly reduced.92 15 1 3.5 ALU with Transmissio n gates & 10Transistor full adder Energy(p J) Power (µW) 840. Design Cell Ener gy y(pJ) 1 0.26 25 6 37.6 75 1.1 45 [2] A Low Power Video Processor by Uzi Zangi and Ran Ginosar [3] Image:A low cost.8 2 4204.29 95 [4] REAL-TIME DVB-S2 LDPC DECODING ON MANY-CORE GPU ACCELERATORS by Gabriel Falcao .02 45 8.36 ion Gates MUX 25 with Transmiss ion gates 4 x 1 0. Chao-Chung Cheng.60 75 15. Yen-Chieh Lai .42 MUX 85 with Transmiss ion gates 10 1.5µw.S.5 Conventi 3.52 12 Pow er (µW ) 4.1 23 42. The power consumption of 16-bit ALU with 10 transistor full adder is observed to be 1197.95 1759.85 MUX 25 with Transmiss ion gates Logical 7. References [1] Implementation of low power one-chip muse by Tetsuo Aoki 5 4.

ed u/search/advsearch.jsp .syr.ieee. Esther Rani [8] http://ieeexplore.libezproxy2.[7] AREA OPTIMIZED LOW POWER ARITHMETIC AND LOGIC UNIT by T.org.