Professional Documents
Culture Documents
5V
(nominal), 27C
Figures:
Delivery Item
Value
Metric (mWatts2*ns*mm2)
179633.89
.000498
.524
68.916
72.18
38.39
72.18
Calculations:
Calculation 1: Modeled Resistance and Capacitance on
Array Model
Bit Cell Area for Entire SRAM (1024 x 1024 bit cells):
28569.6um x 18278.4um = .522mm2
(this does not include periphery)
For periphery:
Take average area of transistor (83pm2) and multiply by approximate amount of transistors added for periphery (see Calculations
4)
83pm2 * (10752 + 2058 + 1152 + 2048 + 10752) = 2.22um 2
-> assuming periphery adds about 2.22um2 , total area is .524mm2
Power:
Write power: 1.819mW
array)
Read Power: 2.288mW (both these average power values for read/write to one bit cell in a whole
Delay:
Worst case read delay: 499.06ns 428.79ns - .694ns (delay of 1:2 decode) + 2.6ns (delay of 8:256 decode) = 72.18ns slowest delay
Worst case write delay: 403.5ns 367.01ns -.694ns (delay of 1:2 decode) + 2.6ns (delay of 8:256 decode) = 38.39ns
Metric calculation:
For Read
For Write
Pre-charging only the block columns as opposed to each block allowed us to use only:
256 transistors per bit line and bit line bar
4 per block column
256 x 2 x 4 = 2048 transistors/SRAM
As opposed to:
256 x 2 x 16(each block) = 8192 transistors/SRAM
We also saved area by using Sense Amps for each block column as opposed to each word of each block
9 transistors per Sense Amp
32 Sense Amps per block column
4 block columns per SRAM
9 x 32 x 4 = 1152 transistors/SRAM
As opposed to using a sense amp for each word in a block in each of the 16 blocks:
9 x 32 x 8(words/block) x 16(each block) = 36864 transistors/SRAM
Saving 35712 transistors
The data at the bottom of the SRAM cell that enters into each block column as opposed to each block uses:
256 transistors per bit line and bit line bar
4 block columns per SRAM
256 x 2 x 4 = 2048 transistors/SRAM
As opposed to:
256 x 2 x 16(each block) = 8192 transistors/SRAM
Saving 6144 transistors
The way the clock buffer was sized was an approximation of the number of stages it was driving. Here is a list of the stages that
the CLK signal drives:
Hierarchical Pre-charging : 512
Decoder Enable Logic: 2 ANDs per x 4 block columns
Block Select Enable: 4 ANDs
Input Register: 2ANDs per input x 32 inputs
Output Register: 2 ANDs per output x 32 outputs
512 + 2x4 + 4 + 2x32 + 2x32 = 652 stages
Using this as our primary metric for buffer sizing, assuming an FO4 to obtain the minimum delay, the optimal number of stages
was 4. We used 4 inverters, sized 4x larger than the previous, which was ultimately used to drive all 652 of the previously
mentioned stages. By using hierarchical pre-charging, we were able to allow charging for only 512 transistors as opposed to
2048 for every bit line and bit line bar.
*This calculation was under the assumption that the inputs are ideal if driving less than 512 stages. The CLK signal was the only
one of our signals that drove more than this number.
14.3mW
(2) To reduce this power, we tried to reduce VDD to 2.5V and kept the same clocking as (1)
The average power was REDUCED for a read to 2.288mW and write 1.819mW
This is about 6 times less power for read, and 8.5 times less power for a write
At this VDD and clock, the read delay is 72.18ns and write delay is 38.39ns
Considering the total power to be 5 reads for every write, the total power for one bit cell is roughly
(1.819mW + 2.228mW (5)) / 6 =
2.159mW
Compared to (2), this is about 21 times less power for a read, but almost 12 times more power
for a write
The delay for read and write is almost approximately the same as (2) so there is significant
tradeoff for delay
How does this compare for total power to (2)? Considering the total power to be 5 reads for every
write, the total power is roughly
(21.48mW+ 108.7uW (5)) / 6 =
Going a step further, maybe increasing VDD back to 5V, but keeping the Clocking the same as (3) will lower power:
(4) VDD at 5V and Clock period to 280us, pulse width 80us, and transient simulation time of 560us (now time is on
MICRO-second scale)
The average power is significantly INCREASED for a read to 1.924mW and INCREASED
significantly for a write to 440.2mW
Compared to (3), this is about 17 times more power for a read and 20.5 times more power for
a write
Delay is not much different than (3), so once again, no significant tradeoff delay
Obviously, this is way more power and not a good option. To quantitatively verify, the total power for
(4) is:
(440.2mW+ 1.924mW (5)) / 6 =
74.97mW (MUCH greater than the total power for either (1), (2), or
(3))
CONCLUSION: Operating at (3) will give the lowest power option ( VDD at 2.5V and Clock period
280us, pulse width 80us, and transient simulation time of 560us). (3) was used for calculations in the
Metric Breakdown.
The following calculation demonstrates the read delay for our SRAM design, using the shorted BL and BLBs. This is
compared to the read delay for a non-shorted BL and BLB.
For the shorted BL/BLB, across 4 blocks the read delay is:
499.06ns - 428.78ns = 70.27ns
For the read across 1 block (i.e. not shorted bit lines), the delay is:
463.49ns - 428.78ns = 34.71ns
The comparison demonstrates that by shorting the BL/BLB for a block column (that is, reading the BL/BLB across 4
blocks), there is an increase in delay, but it is not 4 times the delay for a read to one block (which one might
expect). We sacrifice this delay to reduce the area of the SRAM significantly (detailed in Calculation 4) and reduce
the overall power metric.