You are on page 1of 66

Graph 1: W1, R1, W0 at TT, 2.

5V
(nominal), 27C

Graph 2: W1, R1, W0 at TT, 5.5V, 50C

Graph 3: W1, R1, W0 at TT, 5.5V, 27C

Graph 4: W1, R1, W0 at TT, 5.5V, 0C

Graph 5: W1, R1, W0 at FF, 5.5V, 50C

Graph 6: W1, R1, W0 at FF, 5.5V, 27C

Graph 7: W1, R1, W0 at FF, 5.5V, 0C

Graph 8: W1, R1, W0 at SS, 5.5V, 50C

Graph 9: W1, R1, W0 at SS, 5.5V, 27C

Graph 10: W1, R1, W0 at SS, 5.5V, 0C

Graph 11: W1, R1, W0 at SF, 5.5V,


50C

Graph 12: W1, R1, W0 at SF, 5.5V,


27C

Graph 13: W1, R1, W0 at SF, 5.5V, 0C

Graph 14: W1, R1, W0 at FS, 5.5V,


50C

Graph 15: W1, R1, W0 at FS, 5.5V,


27C

Graph 16: W1, R1, W0 at FS, 5.5V, 0C

Graph 17: W1, R1, W0 at TT, 5V, 50C

Graph 18: W1, R1, W0 at TT, 5V, 27C

Graph 19: W1, R1, W0 at TT, 5V, 0C

Graph 20: W1, R1, W0 at FF, 5V, 50C

Graph 21: W1, R1, W0 at FF, 5V, 27C

Graph 22: W1, R1, W0 at FF, 5V, 0C

Graph 23: W1, R1, W0 at SS, 5V, 50C

Graph 24: W1, R1, W0 at SS, 5V, 27C

Graph 25: W1, R1, W0 at SS, 5V, 0C

Graph 26: W1, R1, W0 at SF, 5V, 50C

Graph 27: W1, R1, W0 at SF, 5V, 27C

Graph 28: W1, R1, W0 at SF, 5V, 0C

Graph 29: W1, R1, W0 at FS, 5V, 50C

Graph 30: W1, R1, W0 at FS, 5V, 27C

Graph 31: W1, R1, W0 at FS, 5V, 0C

Graph 32: W1, R1, W0 at TT, 4.5V,


50C

Graph 33: W1, R1, W0 at TT, 4.5V,


27C

Graph 34: W1, R1, W0 at TT, 4.5V, 0C

Graph 35: W1, R1, W0 at FF, 4.5V,


50C

Graph 36: W1, R1, W0 at FF, 4.5V,


27C

Graph 37: W1, R1, W0 at FF, 4.5V, 0C

Graph 38: W1, R1, W0 at SS, 4.5V,


50C

Graph 39: W1, R1, W0 at SS, 4.5V,


27C

Graph 40: W1, R1, W0 at SS, 4.5V, 0C

Graph 41: W1, R1, W0 at SF, 4.5V,


50C

Graph 42: W1, R1, W0 at SF, 4.5V,


27C

Graph 43: W1, R1, W0 at SF, 4.5V, 0C

Graph 44: W1, R1, W0 at FS, 4.5V,


50C

Graph 45: W1, R1, W0 at FS, 4.5V,


27C

Graph 46: W1, R1, W0 at FS, 4.5V, 0C

Figures:

Figure A: Average Write Power

Figure B: Write 0 Simulation

Figure C: Average Read Power

Figure D: Read 1 Simulation

Figure E: Average Power 1:2 Decoder in Model


Array

Figure F: Average Power 8:256 Decoder in


SRAM

Figure G: Metric Breakdown

Delivery Item

Value

Metric (mWatts2*ns*mm2)

179633.89

Bitcell area (um2)

.000498

Total area (mm2)

.524

Read power (mW) (for a 32 bit read)

(2.288 * 32) = 73.22

Write power (mW) (for a 32 bit write)

(1.819 * 32) = 58.21

Total power (mW) average of 5 reads


for each write operation

68.916

Read delay (ns)

72.18

Write delay (ns)

38.39

Total delay (ns)

72.18

Figure H: Bit Cell Layout

Figure I: Bit Cell Layout Array

Calculations:
Calculation 1: Modeled Resistance and Capacitance on
Array Model

(Based capacitance/length and resistance/length on table of pg 144 to model BL and WL capacitances/resistances)


>Capacitance/length = 95aF/um
From layout, WL length is 27.9um for one bit cell.
For 1024 bit cells (number of bit cells across in SRAM array): 27.9um * 1024 = 28569.6um
From layout, BL length is 5.4um for one bit cell
For 1024 bit cells (number of bit in a column in SRAM array): 5.4um * 1024 = 5529.6um
Modeled Capacitance:
WL: 28569.6um * 95aF/um = 2.714 pF
BL: 5529.6um * 95aF/um = .525pF

>Resistance equation is R = R (L / W) where R = 1 and W (for BL and WL) are 1.2um


Modeled Resistance:
WL: .1*(28569.7um) / 1.2um = 2380.8
BL: .1 *(5529.6um) / 1.2um = 460.8

Calculation 2: Metric Breakdown Calculations

Bit Cell Area:

27.9um x 17.85um = .000498um2

Bit Cell Area for Entire SRAM (1024 x 1024 bit cells):
28569.6um x 18278.4um = .522mm2
(this does not include periphery)
For periphery:
Take average area of transistor (83pm2) and multiply by approximate amount of transistors added for periphery (see Calculations
4)
83pm2 * (10752 + 2058 + 1152 + 2048 + 10752) = 2.22um 2
-> assuming periphery adds about 2.22um2 , total area is .524mm2

Power:
Write power: 1.819mW
array)

Read Power: 2.288mW (both these average power values for read/write to one bit cell in a whole

Total power is average of 5 reads for every write:


(1.819mW + 2.228mW (5)) / 6 = 2.159mW
Subtract the decoder power for the 1:2 decoder used in the modeled array because the actual SRAM uses a 8:256 decoder
(55.02uW)
2.159mW 55.02uW = 2.104mW
Then, for a full 32 bit word read/write, this power is multiplied times 32:
2.104mW * 32 = 67.33mW
Add in the 8:256 decoder power used in the actual SRAM (1.586mW)
67.33mW + 1.586mW = 68.916mW

Delay:

Worst case read delay: 499.06ns 428.79ns - .694ns (delay of 1:2 decode) + 2.6ns (delay of 8:256 decode) = 72.18ns slowest delay
Worst case write delay: 403.5ns 367.01ns -.694ns (delay of 1:2 decode) + 2.6ns (delay of 8:256 decode) = 38.39ns

Metric calculation:

(Area)*(Power)*(Delay) = (mm2) * (mW2) * (nanoseconds) = (.524)*(68.916^2)*(72.18) = 179633.89

Calculation 3: Bit Cell Ratio

For Read

For Write

Calculation 4: Area Savings


By using connected bit lines and bit line bars we drastically reduced the power of the SRAM by reducing the number of
transistors needed, saving area.
By having a MUX that feeds to the output from each of the four block columns as opposed to each of the 16 blocks, we were
able to use:
6 transistors per MUX
7 MUXs across x 32 MUXs for a 8 32 input MUX
2 8 32 input MUXs per block column
4 Block columns per SRAM
6 x 7 x 32 x 2 x 4 = 10752 transistors/SRAM
As opposed to:
6 x 7 x 32 x 2 x 16(each block) = 43008 transistors/SRAM
Saving 32256 transistors

Pre-charging only the block columns as opposed to each block allowed us to use only:
256 transistors per bit line and bit line bar
4 per block column
256 x 2 x 4 = 2048 transistors/SRAM
As opposed to:
256 x 2 x 16(each block) = 8192 transistors/SRAM

Saving 6144 transistors

We also saved area by using Sense Amps for each block column as opposed to each word of each block
9 transistors per Sense Amp
32 Sense Amps per block column
4 block columns per SRAM
9 x 32 x 4 = 1152 transistors/SRAM
As opposed to using a sense amp for each word in a block in each of the 16 blocks:
9 x 32 x 8(words/block) x 16(each block) = 36864 transistors/SRAM
Saving 35712 transistors

The data at the bottom of the SRAM cell that enters into each block column as opposed to each block uses:
256 transistors per bit line and bit line bar
4 block columns per SRAM
256 x 2 x 4 = 2048 transistors/SRAM
As opposed to:
256 x 2 x 16(each block) = 8192 transistors/SRAM
Saving 6144 transistors

Calculation 5: Clock Buffer

The way the clock buffer was sized was an approximation of the number of stages it was driving. Here is a list of the stages that
the CLK signal drives:
Hierarchical Pre-charging : 512
Decoder Enable Logic: 2 ANDs per x 4 block columns
Block Select Enable: 4 ANDs
Input Register: 2ANDs per input x 32 inputs
Output Register: 2 ANDs per output x 32 outputs
512 + 2x4 + 4 + 2x32 + 2x32 = 652 stages
Using this as our primary metric for buffer sizing, assuming an FO4 to obtain the minimum delay, the optimal number of stages
was 4. We used 4 inverters, sized 4x larger than the previous, which was ultimately used to drive all 652 of the previously
mentioned stages. By using hierarchical pre-charging, we were able to allow charging for only 512 transistors as opposed to
2048 for every bit line and bit line bar.

*This calculation was under the assumption that the inputs are ideal if driving less than 512 stages. The CLK signal was the only
one of our signals that drove more than this number.

Calculation 6: VDD, Clock Sensitivity, and Shorted Bit


Line Tradeoffs
These calculations analyze some of the power and delay tradeoffs when lowering VDD and the clock period. The
end of this section also details the delay for a read using shorted bit lines versus non shorted bit lines. These
simulations were for a read and write to ONE bit cell in a modeled array because of the difficulty in simulating an
entire 32 bit read/write.
Originally, the simulation for read and write was at a minimum clock period for functionality was the following
(1) VDD=5V and Clock period of 280ns, pulse width 80ns, and transient simulation time of 560ns
The average power of the read was 14.1mW and write was 15.56mW
Considering the total power to be 5 reads for every write, the total power for one bit cell is roughly
(15.56mW + 14.1mW (5)) / 6 =

14.3mW

(2) To reduce this power, we tried to reduce VDD to 2.5V and kept the same clocking as (1)
The average power was REDUCED for a read to 2.288mW and write 1.819mW
This is about 6 times less power for read, and 8.5 times less power for a write
At this VDD and clock, the read delay is 72.18ns and write delay is 38.39ns
Considering the total power to be 5 reads for every write, the total power for one bit cell is roughly
(1.819mW + 2.228mW (5)) / 6 =

2.159mW

---Can the power be reduced further by altering the clock?


(3) Try to reduce the power further by keeping VDD at 2.5V and changing Clock period to 280us, pulse width 80us,
and transient simulation time of 560us (now time is on MICRO-second scale)
The average power is REDUCED for a read to 108.7uW and INCREASED significantly for a write
to 21.48mW

Compared to (2), this is about 21 times less power for a read, but almost 12 times more power
for a write

The delay for read and write is almost approximately the same as (2) so there is significant
tradeoff for delay
How does this compare for total power to (2)? Considering the total power to be 5 reads for every
write, the total power is roughly
(21.48mW+ 108.7uW (5)) / 6 =

3.671mW which is GREATER than (2), therefore (2) is a better choice

for low power

Going a step further, maybe increasing VDD back to 5V, but keeping the Clocking the same as (3) will lower power:
(4) VDD at 5V and Clock period to 280us, pulse width 80us, and transient simulation time of 560us (now time is on
MICRO-second scale)
The average power is significantly INCREASED for a read to 1.924mW and INCREASED
significantly for a write to 440.2mW
Compared to (3), this is about 17 times more power for a read and 20.5 times more power for
a write
Delay is not much different than (3), so once again, no significant tradeoff delay
Obviously, this is way more power and not a good option. To quantitatively verify, the total power for
(4) is:
(440.2mW+ 1.924mW (5)) / 6 =

74.97mW (MUCH greater than the total power for either (1), (2), or

(3))

CONCLUSION: Operating at (3) will give the lowest power option ( VDD at 2.5V and Clock period
280us, pulse width 80us, and transient simulation time of 560us). (3) was used for calculations in the
Metric Breakdown.

The following calculation demonstrates the read delay for our SRAM design, using the shorted BL and BLBs. This is
compared to the read delay for a non-shorted BL and BLB.
For the shorted BL/BLB, across 4 blocks the read delay is:
499.06ns - 428.78ns = 70.27ns
For the read across 1 block (i.e. not shorted bit lines), the delay is:
463.49ns - 428.78ns = 34.71ns
The comparison demonstrates that by shorting the BL/BLB for a block column (that is, reading the BL/BLB across 4
blocks), there is an increase in delay, but it is not 4 times the delay for a read to one block (which one might
expect). We sacrifice this delay to reduce the area of the SRAM significantly (detailed in Calculation 4) and reduce
the overall power metric.

You might also like