This action might not be possible to undo. Are you sure you want to continue?

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 3, MARCH 2010

**Reducing SRAM Power Using Fine-Grained Wordline Pulsewidth Control
**

Mohamed H. Abu-Rahma, Member, IEEE, Mohab Anis, Member, IEEE, and Sei Seung Yoon

Abstract—Embedded SRAM dominates modern SoCs, and there is a strong demand for SRAM with lower power consumption while achieving high performance and high density. However, the large increase of process variations in advanced CMOS technologies is considered one of the biggest challenges for SRAM designers. In the presence of large process variations, SRAMs are expected to consume larger power to ensure correct read operations and meet yield targets. In this paper, we propose a new architecture that signiﬁcantly reduces the array switching power for SRAM. The proposed architecture combines built-in self-test and digitally controlled delay elements to reduce the wordline pulsewidth for memories while ensuring correct read operations, hence reducing the switching power. Monte Carlo simulations using a 1-Mb SRAM macro in an industrial 45-nm technology are used to verify the power saving for the proposed architecture. For a 48-Mb memory density, a 27% reduction in array switching power can be achieved for a read access yield target of 95%. In addition, the proposed system can provide larger power saving as process variations increase, which makes it an attractive solution for 45-nm-and-below technologies. Index Terms—Built-in self test (BIST), low power, random variations, SRAM, statistical, statistical yield estimation.

I. INTRODUCTION ITH TECHNOLOGY scaling, the requirements of higher density and lower power embedded SRAMs are increasing exponentially. It is expected that more than 90% of the die area in future systems-on-chip (SoCs) will be occupied by SRAM [1]. This is driven by the high demand for low-power mobile systems, which integrate a wide range of functionality such as digital cameras, 3-D graphics, MP3 players, and other applications. In the mean time, random variations are increasing signiﬁcantly with technology scaling. Random dopant ﬂuctuation (RDF) is the dominant source of random variation in the bit cell’s transistors. The variations in due to RDF are inversely proportional to the square root of device area [2]. Therefore, SRAM bit cells experience the largest random variations on a chip, as bit-cell transistors are typically the smallest devices for the given design rules [3]–[6]. Embedded SRAMs usually dominate the SoC silicon area, and their power consumption (both dynamic and static) is a considerable portion of the total power consumption of a SoC.

W

Fig. 1. Memory read power and bitline differential versus memory in 65-nm technology.

T

for a 512-kb

Manuscript received May 27, 2008; revised November 19, 2008. First published June 10, 2009; current version published February 24, 2010. M. H. Abu-Rahma and S. S. Yoon are with Qualcomm Incorporated, San Diego, CA 92121 USA (e-mail: marahma@qualcomm.com). M. Anis is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail: manis@vlsi.uwaterloo.ca). Digital Object Identiﬁer 10.1109/TVLSI.2009.2012511

Moreover, the SRAM yield can dominate the overall chip yield. Hence, statistical design margining techniques are used to guarantee high memory yield. However, to achieve high yield, memory power consumption (and speed) is negatively impacted. The stringent requirements of high yield and low power consumption require combining circuit and architectural techniques to reduce SRAM power consumption. SRAM array switching power consumption is considered one of the largest components of power in high-density memories [7], [8]. This is mainly because of the large memory arrays and the requirements for high area efﬁciency which force SRAM designers to use the maximum numbers of rows and columns enabled by the technology. Fig. 1 shows the dynamic power consumption for read operations versus wordline (WL) pulsewidth

**1063-8210/$26.00 © 2009 IEEE
**

Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on March 15,2010 at 00:48:48 EDT from IEEE Xplore. Restrictions apply.

ABU-RAHMA et al.: REDUCING SRAM POWER USING FINE-GRAINED WORDLINE PULSEWIDTH CONTROL

357

for a 512-kb memory macro in an industrial 65-nm technology. Power consumption results are extrapolated to to estimate the component of switching power due to the peripheral circuits. For normal operating conditions, array power consumption is more than 60% of read power. Therefore, it is important to reduce the array switching due to its strong impact on the memory’s total power as well as the SoC’s power. Several circuit techniques have been proposed to reduce the SRAM array switching power consumption by reducing the WL pulsewidth. One of the most common techniques to control is using a bit-cell replica path, which reduces the bitline differential, hence lowering power consumption [4], [6], [9]–[11]. Replica-path (e.g., self-timed) techniques provide a simple approach of process tracking for global variations (interdie or systematic within die) as well as environmental variations (voltage and temperature). However, these circuit techniques are not efﬁcient when memory bit cells experience large random variation, since circuit techniques cannot adapt to random variations. Therefore, their effectiveness decreases with process scaling, and larger design margins are used which increases power con. To reduce the loss due to excessumption due to larger sive margining, circuits and architectures must be designed together to reduce power and manage variability. Higher levels of design abstraction can have better variation-tolerance capabilities because the impact of random variation can be measured at that level. Therefore, combining architecture techniques with circuit-level designs can reduce the pessimism in using worst case approaches and can help adapt the circuit to random variations, which can reduce power consumption [3], [12], [13]. In this paper, we propose a new architecture that reduces SRAM switching power consumption by using ﬁne-grained WL pulsewidth control. The proposed solution combines a memory built-in self-test (BIST) with additional logic to reduce the switching power consumption for the memory. The proposed architecture helps recover the switching power consumed due to the worst case assumption used in SRAM design to achieve a high yield. The proposed architecture utilizes infrastructure which is already available in SoC with very low area overhead. The rest of this paper is organized as follows. In Section II, we derive statistical models for memory read access yield and read power consumption, which show the tradeoff between these metrics. In Section III, we describe the proposed system and its operation. In Section IV, we present the statistical simulation ﬂow and the power savings using the proposed system when applied for memories in an industrial 45-nm technology. In addition, we discuss some design considerations related to the proposed system. In Section V, we present our conclusions. II. YIELD AND POWER TRADEOFF Due to the random variation in SRAM bit cells, there is a tight coupling between memory yield and power consumption. To achieve high yield, read access failures should be minimized. Read access yield1 is deﬁned as the probability of a correct read operation. In a read operation, the selected WL is activated for a period of time to allow the bitlines to discharge. The WL activation time ( ) is a critical parameter for memory design since

1In

Fig. 2. Typical SRAM architecture.

it affects the memory speed (access time) as well as memory power. To reduce read access failures, the WL pulse should be large enough to guarantee adequate bitline differential, which can be sensed correctly using the sense ampliﬁer (SA). The total power consumption for a memory in a read or write cycle can be expressed as (1) where is the total leakage power from the array and the peripheral circuitry, and and are the switching powers from the array and the peripheral circuitry, respectively (as shown in Fig. 2). In a read access, the array switching power can be calculated as (2) where and are the number of bitlines and WLs in a is the bitline capacitance per memory bank, respectively. is the bitline differential in read access (used to bit cell, is the supply voltage, and sense the bit cell’s stored value), is the operating frequency. can be calculated as for for (3)

where is the bit-cell read current. To a ﬁrst order, can , for the be approximated by assuming linear dependence on where , as shown in Fig. 1. range of Therefore, from (2) and (3), the array switching power can be computed as for for (4)

this paper, we use the term yield to refer to read access yield.

Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on March 15,2010 at 00:48:48 EDT from IEEE Xplore. Restrictions apply.

358

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 3, MARCH 2010

From (4), it is clear that is directly proportional to (when ), which is conﬁrmed by the read power results shown in Fig. 1. to be large enough to A correct read operation requires guarantee a correct sensing using the SA. Hence, a large implies having a sufﬁciently large that enables weak bit cells (with low ) to be correctly sensed for a given yield requireincreases the read access yield; however, at ment. Increasing increases power consumption, as the same time, increasing has a direct impact on a memory’s shown in (4). Moreover, is usually set to ensure a access time [4], [14]. Therefore, correct read operation for a given read access yield requirement and memory density. The bit-cell read current is strongly affected by the random variations in the bit-cell access device (pass gate) as well has been as the pull-down device. Due to these variations, with a shown to follow a Normal distribution and a standard deviation of [4]–[6], [14], [15]. mean of Moreover, recent measurement results for a 1-Mb memory have conﬁrmed that indeed follows a Normal distribution up to for the mean value [16]. It is noteworthy to mention that, with technology/supply variation, the device opscaling and with the increase of eration region may change from strong to moderate inversion. In the extreme case of the device working in the subthreshold region, distribution will be lognormal due to the exponential [17]. dependence on Random within-die (WID) variations also have strong impact on the SA offset [4], [14], [18]–[20], which cause SAs to show input offset voltages that affect the accuracy of the read operation. In addition, systematic variations due to asymmetric layout can increase the SA input offset; that is why highly symmetric layouts are typically used for SA [4], [14]. Moreover, due to the small differential signal developed on the SA inputs, an aggressor located near the SA may couple a large noise at the SA input which can affect the accuracy of the read operation. Nevertheless, by using layout noise shielding techniques and highly symmetric SA layout styles, the impact of this component can be minimized. Typically, the SA input offset due to random variations can be modeled using a Normal distribution [19]. However, due to the complexity of deriving a closed-form yield relation that acand the SA offset statistically, we treat the counts for both SA input offset as a worst case approach instead of statistical as was used in [4], [5], and [15]. To guarantee a correct read operation and by using a statistical treatment for and a worst case approach for the SA offset, the following condition should be satisﬁed [4], [5], [15]: (5)

Fig. 3. I PDF showing the 3, 4, and 5 I points which correspond to different ). memory yield targets (assumed =j

= 15%

related to the target yield and the memory density [4], [15], and can be computed as (6) is the inverse standard Normal cumulative distriwhere is the memory read access yield target, bution function, and are the total number of bit cells in the memory. For example, for a 1-Mb memory, if the target read access yield is . There95%, then the required design coverage is fore, to achieve the same yield for a large memory density, should be increased. From (5), this means that a larger is required. and It is important to note that the relation between is nonlinear as shown in (3) and (5). In fact, assuming that is a Normal distribution, the probability density function can be calculated using one-to-one mapping (PDF) of from (5) as follows [21] (details of the derivation are provided in Appendix A):

for

(7)

where is the minimum required bitline differential voltage, which is a function of the SA input offset and its ). immunity against coupling noise (typically is the mean bit-cell read current, and is the relative is the required design coverage, which is variation in .

is the PDF for and is the PDF for , which where is a Normal distribution. and Figs. 3 and 4 show the distributions of both bit cell . Note that the PDF is not symmetric, but instead, it is skewed to larger values. Moreover, the 3, 4, and 5 values are shown for and the corresponding values for ’s. It is clear that is very sensitive to variations. these , increases four times compared with its nominal For value (calculated using ). skewed distribution, large values of Because of the are required to ensure an acceptable read access yield. Moreover, to achieve the same yield as the memory size increases, a coverage is required as shown in (6), which signifhigher (due to the nonlinear relation between icantly increases and ). Therefore, due to statistical variations in the bit cell,

Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on March 15,2010 at 00:48:48 EDT from IEEE Xplore. Restrictions apply.

ABU-RAHMA et al.: REDUCING SRAM POWER USING FINE-GRAINED WORDLINE PULSEWIDTH CONTROL

359

Fig. 5. Proposed architecture: Fine-grained WL pulsewidth control.

Fig. 4. T PDF. Furthermore, shown are the points corresponding to 3, 4, and ). 5 I . T PDF is skewed toward a higher T (assumed =j

= 15%

Next, we present more details on the three main components of the proposed system: BIST, WL delay, and pulsewidth control blocks. A. SRAM BIST SRAMs are prone to different types of failure (catastrophic and parametric). To address yield problems in SRAMs, large arrays are provided with redundant rows and/or columns, which can be used replace defected bit cells, hence repairing the memory. To enable the repair, the memory should be tested ﬁrst to locate the defected bit cells. However, since embedded memories lack direct access to chip input/output signals, embedded memory testing becomes complicated and time consuming [25]. Memory BIST is nowadays an integral part of SoCs [1], [22]. A memory BIST engine is used to generate patterns to test a memory and detect memory failures. Typical data patterns include checkerboard, solid, row stripe, column stripe, double row stripe, and double column stripe. Different test algorithms are used to cover different types of faults; however, one of the most widely used algorithms is the MARCH test [22]. After testing the memory, the unit under test is either discarded or repaired using memory redundancy (depending on failure information). BIST not only reduces testing time (and cost) but also allows testing the memory under actual clock speed (at-speed testing). Moreover, BIST can be used to enable BISR, where the BISR logic analyzes the failing addresses from BIST and generates a failure bitmap for memory. The BIST then uses the failure bitmap to replace failing bit cells using the redundant rows and columns. BISR is also used to enable soft repair instead of laser repair (hard repair) and without using ATE which further reduces testing cost [1], [22]. Memory self-test can be performed every time the chip is reset (in power-up test mode). B. WL Programmable Delay Elements WL and SA timing are of utmost importance for a correct read operation. The timer block shown in Fig. 2 is responsible of generating these critical internal timing signals. For SRAM postsilicon debugging purposes and yield learning, programmable delay elements are used to control internal timing for WL and SA [23], [24], [26], [27]. For example, these delay elements are used to characterize the margin in bitline differential voltage. Moreover, they are used in relaxing address

should be pessimistically large to achieve the required yield target. This will negatively impact the dynamic power for memories, which do not have weak bit cells (or have a small spread of around its mean value). Therefore, it is desirable to have a postsilicon approach that can recover the pessimism in deter. mining III. FINE-GRAINED WL PULSEWIDTH CONTROL As discussed in the previous section, the read switching power increases signiﬁcantly due to process variations, because at the design time, we do not have information about which memories will have weak bit cells with low ’s. Therefore, a worst case approach is applied to determine for a given memory density and yield requirement, as shown in (5). Memory BIST is nowadays an integral part of a SoC [1], [22]. Typically, a memory BIST engine is used to generate patterns to test a memory and detect memory failures. Based on the memory failure information, the unit under test is either discarded or repaired using memory redundancy. To reduce SoC repair cost, memory repair using built-in self-repair (BISR) is also used to enable soft repair instead of laser repair (hard repair) and without using automatic test equipment (ATE) [1], [22]. A postsilicon approach is required to recover the loss in switching power due to the worst case design practices used in SRAM design. Fig. 5 shows a conceptual view of the proposed architecture. The system is composed of three main components: BIST, “WL delay,” and “pulsewidth control” block. The memory BIST is used to test the memory functionality using different testing patterns. Each memory instance contains a WL delay block, which is a digitally programmable delay element that controls the WL pulsewidth . This adjustment for is achieved by adding the programmable delay element in the disable path of the WL. The delay element is controlled using a digital code provided by the pulsewidth control logic [23], [24]. The pulsewidth control logic increases or decreases the digital code based on the BIST result. Therefore, the proposed system creates a closed feedback loop between the BIST, the WL pulsewidth control, and the memory internal timing, which can be used to reduce the power consumption, as explained in the following.

Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on March 15,2010 at 00:48:48 EDT from IEEE Xplore. Restrictions apply.

360

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 3, MARCH 2010

Fig. 6. One type of programmable delay elements [23].

setup time requirements by delaying the clock edge which starts an access [23]. Fig. 6 shows one type of programmable delay elements [23], [26]. C. Pulsewidth Control Logic Pulsewidth logic is the control logic for the proposed architecture. It can be as simple as a digital counter, which varies the digital code for the delay element depending on the output of the BIST. It can also be implemented in software. For example, a programmable processor can be used to control both the BIST and the programmable delay elements. Fortunately, modern SoCs include programmable processors that can be used at start-up time to test and verify the operation of other modules [25], [28]. D. System Operation Fig. 7 shows the operation of the proposed system: Initially, the pulsewidth control logic provides the initial code for the memory. This initial digital code will correspond to the required for a given yield requirement, which is deworst case termined in the design time [using (5)]. The BIST tests the memory instance using the initial code. If the memory fails, it may require repairing or it may be discarded, which is a typical BIST testing sequence. However, if the memory passes the BIST testing using the initial code (or after repair), the BIST signals the pulsewidth control logic to reduce the digital code, hence reducing using the programmable delay element. Using the new digital code for , the BIST tests the memory once again, and this process of BIST and reduction is repeated until the memory fails in the read operation. The last passing code is then stored on the built-in registers inside the memory instances. If the lowest code is reached without the memory failing, the code is also stored, and the operation is terminated. The aforementioned steps are repeated for all the memories in the chip. Hence, the proposed architecture reduces for all memories in which the bit cells have sufﬁcient to ensure a correct read operation. Therefore, by reducing , the read power of the memories can be reduced. This operation can be part of the system testing or power-up, where the ﬁnal codes for each memory can be stored on built-in registers or burned on fuses. IV. RESULTS AND DISCUSSION To test the power savings using the proposed architecture, Monte Carlo simulations are used to capture the impact of device variation on bit cell and the corresponding . Simulations are performed using a 1-Mb macro from an industrial

Fig. 7. Flowchart for the operation of the proposed ﬁne-grained WL pulsewidth control system.

45-nm technology. The 1-Mb macro uses replica path to reduce power consumption and improve process tracking [4], [9]–[11]. Hardware-correlated bit-cell statistical models2 are used to comand which are used in the simulation ﬂow shown pute in the following. Postlayout switching power simulations were used to measure the power versus dependence, as shown in Fig. 1. In SoC design, typically, a high-density macro (on the order of 512 kb or 1 Mb) is used as a building block for larger size memories. Therefore, in our simulations, we assume that the minimum memory macro size is 1 Mb, and multiple instances of that macro are used to realize larger memories. To estimate power saving using the proposed system, a Monte Carlo simulation ﬂow has been developed as follows. 1) For every memory instance in the chip, generate samples of Normal distribution with a mean of and a standard deviation of to represent the read current variation in the macro. 2) Find the minimum current in each memory instance and compute the from the information of the weakest bit cell which has the lowest ((3)). Therefore, represents the minimum WL pulsewidth that guarantees

2The statistical models are provided by the foundry, and the statistical data are extracted from measuring a large sample of bit cells as shown in [16]. Random WID variations are reﬂected into the Spice model by varying V , W , L, and other parameters of the bit-cell devices according to the measured statistics.

ABU-RAHMA et al.: REDUCING SRAM POWER USING FINE-GRAINED WORDLINE PULSEWIDTH CONTROL

361

Fig. 8. Power reduction using the proposed architecture versus chip-level memory density. Different values of yield target are shown.

Fig. 9. Power reduction using the proposed architecture versus yield target for two cases of chip-level density: (a) 1 and (b) 48 Mb.

a correct read operation for that instance. This value of should be automatically determined by the proposed system, since the WL control block shown in Fig. 5 will reduce until that memory instance fails a read operation. 3) Using (4), calculate the power consumption of that memory instance. 4) Repeat all the aforementioned steps for all memory instances on that chip and estimate the total read power for memories on that chip.3 5) Repeat all the aforementioned steps for a large number of chips and get the average power. 6) From the chip-level yield requirement and the total memory density, calculate the design coverage using (6), . This will be the value and ﬁnd the worst case of , which would have been used for all memory instances if the proposed architecture was not enabled. For , the corresponding power consumption can be calculated using (4). 7) From the last two steps, calculate the power reduction using the proposed architecture. Fig. 8 shows the power reduction achieved using the proposed architecture for different memory densities and yield targets. Note that these yield values represent the intrinsic yield before applying any repair or correction [i.e., redundancy or error-correction code (ECC)]. As the memory density increases, the power saving increases, which shows the effectiveness of the proposed system particularly if a high yield target is required (as in high-volume low-cost products). Array switching power consumption can be reduced between 15% and 35% for a 48-Mb memory density depending on the yield target. Fig. 9 shows the achievable power saving versus memory yield target. For a 1-Mb memory density, the array switching power savings can be as high as 15% for a yield of 99%. From Figs. 8 and 9, it is clear that the proposed architecture reduces array switching power signiﬁcantly. As the SRAM content increases in a SoC,

3Here, we assume that all memories are accessed simultaneously. Hence, we add the individual read power of each memory. For our analysis, this is a fair assumption since we do not make any assumptions related to how the system accesses these memories. Nevertheless, the same simulation ﬂow can be used if the switching activity for each memory is known.

Fig. 10. Power reduction using the proposed architecture versus chip-level memory density for different values of I variation for a yield target of 95%.

it is expected that a higher power saving can be achieved using the proposed architecture. SRAM bit cells show strong sensitivity to device variations, which increase signiﬁcantly with scaling. To investigate the impact of variations on the power, we increase variations from 10% to 12.5% [29]. Fig. 10 shows power reduction for these two cases of variation. It is obvious that signiﬁcant power savings can be achieved. For the 48-Mb case, power saving increases by more than two times, from 25% to 55% (even though variations only increased from 10% to 12.5%). This shows the effectiveness of the proposed architecture, which makes it attractive for power reduction in the presence of large process variations which are expected to worsen with technology scaling. For the previous simulation results, we assumed that a 1-Mb memory is the minimum memory instance size that can have a speciﬁc . However, in memory design, multiple subarrays (i.e., banks) are used to implement a single memory instance, as shown in Fig. 5. Typically, a timer circuit is used per subarray; hence, the ﬁne-grained concept can be further applied to the subarray level. This requires adding the digitally controlled delay element per subarray and has the capability of storing the digital code per subarray. Nevertheless, the area overhead can still be very small since the additional area is amortized over

362

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 3, MARCH 2010

Fig. 11. Power reduction using the proposed architecture versus chip-level memory density for different values of the minimum-controlled memory instance (or subbanks) for a yield target of 95%.

the large size of the memory macro size. To evaluate the benecontrol at the subarray level, we assume that we have ﬁts of 16 subarrays in the 1-Mb macro. Each subarray is composed of 256 bitlines by 256 rows. Therefore, using the proposed architecture, can be adjusted for a 64-kb block. Fig. 11 shows the achieved power saving for the 64-kb block size as well as the 1-Mb full macro. By adjusting at the subarray level, power saving increases from 24% to 42% for the same 48-Mb chip-level memory density (1.75 times improvement). Furthermore, for the 1-Mb chip-level memory density, power saving increases from 7.5% to 20% (2.67 times improvement). This shows the importance of reducing the size of the memory block, which can be individually controlled using , as this will signiﬁcantly reduce the array switching power consumption. The proposed architecture reduces power consumption by reducing the WL pulsewidth for memories which have enough margin for a read operation. It is important to investigate how reducing will affect the system timing for both setup and hold time requirements. Reducing also reduces the access time of the memory , which implies that the logic delay through the memory is now faster. Therefore, the setup time requirement will be automatically met. Reducing , however, may cause hold violations. This situation can be easily resolved in design time by using the lowest expected for hold time veriﬁcation (i.e., using the minimum that the programmable delay element provides). In the aforementioned analysis, the discrete quantization effect on power savings is not considered. In reality, since we are using a digitally controlled delay element, can only take discrete values. However, this may not have a signiﬁcant impact on the shown results, since small area and low power delay elements can

cover a large range of delays with ﬁne control [23], [24]. In addition, by using the Monte Carlo simulation ﬂow described previwith the highest probaously, we can determine the range of bility of occurring and modify the delay elements in design time to have enough steps in that region. It is important to note that delay elements do not add extra area as they are typically used in memories for debugging purposes [23], [24], [27]. While the proposed system accounts for process variations which are static in nature, it cannot adapt to environmental variations such as voltage or temperature variation [13] due to their dynamic nature. This is because will be ﬁxed after running the pulsewidth control system in the start-up. Hence, environmental variations may cause the memory to fail if they reduce the sensing margin. This problem can be addressed in design time, by ensuring that the minimum step size for delay control provides sufﬁcient margin for voltage and temperature variations. Moreover, if self-timed memories are used, the replica path will provide efﬁcient tracking for environmental variations [4], [9]–[11]. In addition, in product testing, low voltage screening can be used in power-up to set , which guarantees a sufﬁcient margin for environmental variations, since the product will operate at a supply voltage typically larger than the low-voltage test condition. It is important to note that, in addition to environmental variations, transistor aging mechanisms such as hot carrier injection, time-dependent dielectric breakdown, and negative bias temperature instability cause device characteristics to shift with time, which increase gate delay [30], [31]. To prevent the circuit from failing due to device aging, additional guard banding (in delay or in the minimum supply voltage) is required, which may lower the expected power savings of the proposed architecture. For an accurate estimate of power savings including device aging, reliability models for these effects need to be included in the proposed simulation ﬂow. In this paper, we presented the analysis and results only for read power reduction. Nevertheless, the proposed system can be extended to reduce the switching power in write operations. In a write operation, one side of the bitline is pulled down to zero, while the other side is held at using write drivers. This operation is enabled for columns which are selected for write (using column multiplexer). However, half-selected bit cells (bit cells on the same WL which are not accessed for write operation) still experience a dummy read operation, and the bitline discharges using for a period determined by . Array switching power for write operation can therefore be calculated as shown at the bottom of the page, where is the input/output data width for the memory. The term including in (8) refers to the power consumption in the selected bitlines for write operation, which are pulled down to zero using write drivers (not the bit-cell read current , therefore, there is no dependence on ). The term

for (8) for

Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on March 15,2010 at 00:48:48 EDT from IEEE Xplore. Restrictions apply.

ABU-RAHMA et al.: REDUCING SRAM POWER USING FINE-GRAINED WORDLINE PULSEWIDTH CONTROL

363

including is the power consumption in the half-selected bitlines in write operation, which is dependent on . Array switching power due to half-selected bit cells can contribute signiﬁcantly to the total write power particularly in highdensity memories with small IO width and large muxing option ). Therefore, the proposed architecture can be (i.e., used to reduce in write operations which reduces the power consumption of half-selected bit cells. However, as is reduced, the bit cell becomes more susceptible to write failures. A write failure happens when the cell fails to write the desired value during write operation [5], [15]. Therefore, the proposed system can reduce until a write failure occurs. To evaluate the power savings in this case, a statistical simulation ﬂow that includes models for write failure is required. To implement both read and write power reduction, additional for read and write separately, as logic is required to control shown recently in [27]. Nevertheless, the area penalty can be negligible due to the small size of digital delay elements. Moreover, as mentioned earlier, these delay elements are typically used in memories for debugging purposes [23], [24], [27]. In addition to reducing power, it is noteworthy to mention that reducing (for read or write operations) has another beneﬁt, which is the reduction of read-disturb failures [32]–[35]. Read disturb happens when the bit cell ﬂips its stored data when it is accessed (or half-selected). As becomes smaller, the bit cell has shorter time to ﬂip, which reduces the read-disturb-failure probability [32]–[35]. While, in this work, we focused on read access yield only, it is interesting to evaluate the reduction in read disturb using the proposed system, which will be a point of future research. V. CONCLUSION Array switching power is one of the largest components of power consumption in high-density SRAM. Moreover, with the large increase in process variation, SRAMs are expected to consume even larger array switching power to ensure correct read operations and meet yield targets. In this paper, we propose a new architecture that signiﬁcantly reduces the array switching power. The proposed architecture combines BIST and digitally controlled delay elements to reduce the WL pulsewidth for memories while ensuring a correct read operation, hence reducing the switching power. Combining both architecture and circuit techniques enables the system to detect weak bit accordingly. Therefore, the cells using the BIST and adjust proposed architecture recovers the power consumption since to ensure a it tests each memory individually and adjusts correct read operation. A new simulation ﬂow was developed to evaluate the power savings for the proposed architecture. Monte Carlo simulations using a 1-Mb SRAM macro from an industrial 45-nm technology was used to examine the power reduction achieved by the system. The proposed architecture can reduce the array switching power signiﬁcantly and show large power saving particularly as the chip-level memory density increases. For a 48-Mb memory density, a 27% reduction in array switching power can be achieved for a read access yield target of 95%. In addition, the proposed system can provide larger power saving as process variations increase, which makes it a very attractive solution for 45-nm-and-below technologies.

APPENDIX PDF ( EXPRESSION FOR

)

In this section, we derive the PDF for as shown in (7). Given a function , where the PDF of is known, the PDF of can be expressed as [21] (9) where and are the PDFs of and , respectively. is the derivative of , and are the real roots . of the equation given the PDF of , Therefore, to determine the PDF for and we begin by the relationship between (10) and the derivative w.r.t. (11) Equation (10) has a single solution with . Using (9), (10), and (11) (12)

(13)

Therefore

for where is the PDF for .

(14)

, which is a Normal distribution

REFERENCES [1] Y. Zorian and S. Shoukourian, “Embedded-memory test and repair: Infrastructure IP for SoC yield,” IEEE Des. Test Comput., vol. 20, no. 3, pp. 58–66, May/Jun. 2003. [2] M. Pelgrom, H. Tuinhout, and M. Vertregt, “Transistor matching in analog CMOS applications,” in IEDM Tech. Dig., 1998, pp. 915–918. [3] T.-C. Chen, “Where is CMOS going: Trendy hype versus real technology,” in Proc. ISSCC, 2006, pp. 22–28. [4] R. Heald and P. Wang, “Variability in sub-100 nm SRAM designs,” in Proc. Int. Conf. Comput. Aided Des., 2004, pp. 347–352. [5] K. Agarwal and S. Nassif, “Statistical analysis of SRAM cell stability,” in Proc. 43rd DAC, 2006, pp. 57–62. [6] R. Venkatraman, R. Castagnetti, and S. Ramesh, “The statistics of device variations and its impact on SRAM bitcell performance, leakage and stability,” in Proc. ISQED, 2006, pp. 190–195. [7] M. Q. Do, M. Drazdziulis, P. Larsson-Edefors, and L. Bengtsson, “Parameterizable architecture-level SRAM power model using circuit-simulation backend for leakage calibration,” in Proc. ISQED, 2006, pp. 557–563. [8] A. Macii, L. Benini, and M. Poncino, Memory Design Techniques for Low Energy Embedded Systems. Norwell, MA: Kluwer, 2002. [9] B. Amrutur and M. Horowitz, “A replica technique for wordline and sense control in low-power SRAM’s,” IEEE J. Solid-State Circuits, vol. 33, no. 8, pp. 1208–1219, Aug. 1998. [10] K. Osada, J.-U. Shin, M. Khan, Y.-D. Liou, K. Wang, K. Shoji, K. Kuroda, S. Ikeda, and K. Ishibashi, “Universal-Vdd 0.65–2.0 V 32 kB cache using voltage-adapted timing-generation scheme and a lithographical-symmetric cell,” in Proc. ISSCC, 2001, pp. 168–169.

364

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 3, MARCH 2010

[11] M. Yamaoka, N. Maeda, Y. Shinozaki, Y. Shimazaki, K. Nii, S. Shimada, K. Yanagisawa, and T. Kawahara, “Low-power embedded SRAM modules with expanded margins for writing,” in Proc. ISSCC, 2005, vol. 1, pp. 480–611. [12] D. Marculescu and E. Talpes, “Variability and energy awareness: A microarchitecture-level perspective,” in Proc. 42nd DAC, 2005, pp. 11–16. [13] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De, “Parameter variations and impact on circuits and microarchitecture,” in Proc. 40th DAC, 2003, pp. 338–342. [14] M. H. Abu-Rahma, K. Chowdhury, J. Wang, Z. Chen, S. S. Yoon, and M. Anis, “A methodology for statistical estimation of read access yield in SRAMs,” in Proc. 45th DAC, 2008, pp. 205–210. [15] S. Mukhopadhyay, H. Mahmoodi, and K. Roy, “Statistical design and optimization of SRAM cell for yield enhancement,” in Proc. Int. Conf. Comput. Aided Des., 2004, pp. 10–13. [16] T. Fischer, C. Otte, D. Schmitt-Landsiedel, E. Amirante, A. Olbrich, P. Huber, M. Ostermayr, T. Nirschl, and J. Einfeld, “A 1 Mbit SRAM test structure to analyze local mismatch beyond 5 sigma variation,” in Proc. IEEE ICMTS, 2007, pp. 63–66. [17] R. Rao, A. Srivastava, D. Blaauw, and D. Sylvester, “Statistical analysis of subthreshold leakage current for VLSI circuits,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 2, pp. 131–139, Feb. 2004. [18] P. Kinget, “Device mismatch and tradeoffs in the design of analog circuits,” IEEE J. Solid-State Circuits, vol. 40, no. 6, pp. 1212–1224, Jun. 2005. [19] B. Wicht, T. Nirschl, and D. Schmitt-Landsiedel, “Yield and speed optimization of a latch-type voltage sense ampliﬁer,” IEEE J. Solid-State Circuits, vol. 39, no. 7, pp. 1148–1158, Jul. 2004. [20] S. Mukhopadhyay, K. Kim, K. Jenkins, C.-T. Chuang, and K. Roy, “Statistical characterization and on-chip measurement methods for local random variability of a process using sense-ampliﬁer-based test structure,” in Proc. ISSCC, Feb. 2007, pp. 400–611. [21] A. Papoulis, Probability, Random Variables, and Stochastic Processes, 3rd ed. New York: McGraw-Hill, 1991. [22] R. Rajsuman, “Design and test of large embedded memories: An overview,” IEEE Des. Test Comput., vol. 18, no. 3, pp. 16–27, May 2001. [23] W. Kever, S. Ziai, M. Hill, D. Weiss, and B. Stackhouse, “A 200 MHz RISC microprocessor with 128 kB on-chip caches,” in Proc. ISSCC, Feb. 1997, pp. 410–411. [24] Y. H. Chan, T. J. Charest, J. R. Rawlins, A. D. Tuminaro, J. K. Wadhwa, and O. M. Wagner, “Programmable sense ampliﬁer timing generator,” U.S. Patent 6 958 943, Oct. 25, 2005. [25] A. Chandrakasan, W. J. Bowhill, and F. Fox, Design of High-Performance Microprocessor Circuits. New York: Wiley-IEEE Press, 2000. [26] M. Min, P. Maurine, M. Bastian, and M. Robert, “A novel dummy bitline driver for read margin improvement in an eSRAM,” in Proc. IEEE Int. Symp. DELTA, Jan. 2008, pp. 107–110. [27] R. Joshi, R. Houle, D. Rodko, P. Patel, W. Huott, R. Franch, Y. Chan, D. Plass, S. Wilson, S. Wu, and R. Kanj, “A high performance 2.4 Mb L1 and L2 cache compatible 45 nm SRAM with yield improvement capabilities,” in Proc. IEEE Symp. VLSI Circuits, Jun. 2008, pp. 208–209. [28] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits, 2nd ed. Englewood Cliffs, NJ: Prentice-Hall, 2002. [29] H. Yamauchi, “Embedded SRAM circuit design technologies for a 45 nm and beyond,” in Proc. 7th Int. Conf. ASIC, Oct. 2007, pp. 1028–1033. [30] K. Kang, S. P. Park, K. Roy, and M. A. Alam, “Estimation of statistical variation in temporal NBTI degradation and its impact on lifetime circuit performance,” in Proc. IEEE/ACM ICCAD, 2007, pp. 730–734. [31] X. Li, J. Qin, B. Huang, X. Zhang, and J. Bernstein, “SRAM circuitfailure modeling and reliability simulation with SPICE,” IEEE Trans. Device Mater. Rel., vol. 6, no. 2, pp. 235–246, Jun. 2006. [32] M. Khellah, Y. Ye, N. Kim, D. Somasekhar, G. Pandya, A. Farhang, K. Zhang, C. Webb, and V. De, “Wordline and bitline pulsing schemes for improving SRAM cell stability in low Vcc 65 nm CMOS designs,” in Proc. IEEE Symp. VLSI Circuits, 2006, pp. 9–10.

[33] S. Ikeda, Y. Yoshida, K. Ishibashi, and Y. Mitsui, “Failure analysis of 6 T SRAM on low-voltage and high-frequency operation,” IEEE Trans. Electron Devices, vol. 50, no. 5, pp. 1270–1276, May 2003. [34] J. B. Khare, A. B. Shah, A. Raman, and G. Rayas, “Embedded memory ﬁeld returns—Trials and tribulations,” in Proc. IEEE ITC, Oct. 2006, pp. 1–6. [35] M. Yamaoka, K. Osada, and T. Kawahara, “A cell-activation-time controlled SRAM for low-voltage operation in DVFS SoCs using dynamic stability analysis,” in Proc. 34th ESSCIRC, Sep. 2008, pp. 286–289.

Mohamed H. Abu-Rahma (S’01–M’09) received the B.Sc. degree (with honors) in electronics and communication engineering from Ain Shams University, Cairo, Egypt, in 2001, the M.Sc. degree in electronics and communication engineering from Cairo University, Cairo, Egypt, in 2004, and the Ph.D. degree in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 2009. From 2001 to 2004, he was with Mentor Graphics, Egypt, where he worked on MOSFET compact model development and extraction. He has been with Qualcomm Incorporated, San Diego, CA, since 2005, where he has been engaged in the research and development of low-power embedded SRAM and CMOS circuits. He is the inventor/coinventor of six pending patents. His research interests include low-power digital circuits, variation-tolerant memory design, and statistical design methodologies.

Mohab Anis (S’98–M’03) received the B.Sc. degree (with honors) in electronics and communication engineering from Cairo University, Cairo, Egypt, in 1997 and the M.A.Sc. and Ph.D. degrees in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 1999 and 2003, respectively. He is currently an Assistant Professor and the Codirector of the VLSI Research Group, Department of Electrical and Computer Engineering, University of Waterloo. He is the Cofounder of Spry Design Automation. He has authored/coauthored over 60 papers in international journals and conferences and is the author of the book Multi-Threshold CMOS Digital Circuits—Managing Leakage Power (Kluwer, 2003). He is an Associate Editor of the Journal of Circuits, Systems and Computers, the American Scientiﬁc Publishers Journal of Low Power Electronics, and the Journal of VLSI Design. His research interests include integrated circuit design and design automation for VLSI systems in the deep submicrometer regime. Dr. Anis is an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–II. He is also a member of the program committee for several IEEE conferences. He was the recipient of the 2004 Douglas R. Colton Medal for Research Excellence in recognition of his excellence in research leading to new understanding and novel developments in microsystems in Canada and won the 2002 International Low-Power Design Contest.

Sei Seung Yoon received the B.S.E.E. degree from Yonsei University, Korea, in 1988. He joined the Qualcomm Memory Design Team, San Diego, CA, in 2004. Prior to joining Qualcomm, he had more than 16 years experience in Micron technology, T-RAM, and Samsung as a Memory Circuit Designer. He is currently working on low power and high performance embedded SRAM design. He holds more than 20 U.S. patents.

Sign up to vote on this title

UsefulNot useful- 06 - CS131 Lec - Memory Errors
- eepromnotes
- Memory Guide
- E InterleavedMemory
- Plugin-blackfin Core Architecture Overview Transcript
- Pro Tools - Session Creation
- Computer Memory Finder
- 03 Memory Structures
- 7008825-PIC
- Memory Architecture Exploration for Programmable Embedded Systems
- GSM BASED DISTANCE CALCULATION FOR TRANSMISSION LINE FAULT
- co
- Methodology
- CA08
- 2
- 11_bt_0016_SHARC_br
- Data Form
- Nr 220501 Computer Organisation
- elm327dsh
- Fixed Point Arithematic
- GAO A0040007
- obd2-elm327-120828010211-phpapp01
- 8051
- OMRON PLC CQM1 CPM1 CPM2 SRM1 Programing Manual Programming
- 9-ESSNo1
- 35a871edea2d4c0d4e4b65e3608812f4
- rfc1186
- Supporting Radio Clocks in OpenBSD
- 5 Causes of Random access memory Loss.
- Hybrid Reduced-Complexity Multiuser Detector
- 05071192

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd