(P)Delay Analysis and Design Exploration for 3D SRAM | Random Access Memory | Cpu Cache

Delay Analysis and Design Exploration for 3D SRAM

Xi Chen, W. Rhett Davis
Department of Electrical and Computer Engineering, North Carolina State University {xchen10, rhett_davis}@ncsu.edu
Abstract-The emerging three-dimension (3D) integration technology provides a solution to reduce delay in SRAM. In this paper, we present a physical based delay analysis approach to explore 3D SRAM design options. Our analysis can be used to optimize the 3D SRAM timing performance at both sub-array and system level. Design examples based on the MITLL 3D process are constructed to demonstrate the trade-offs. As the analysis results show, the optimized 3D sub-array provides up to 20% extra improvement for SRAM access time reduction.

can directly learn about the parasitic effect like Through Silicon Via (TSV) [10], and optimize the SRAM array in different 3D processes with more flexibility. The outline of the paper is as follows. Section 2 introduces structure and partitioning of 3D SRAM. Section 3 presents the physical based analysis and derive the analytical equations for 3D SRAM Delay calculation. Design exploration examples and results are presented in Section 4. Conclusion is addressed at Section 5. II. STRUCTURE OF 3D SRAM

I. INTRODUCTION In modern Very Large Scale Integrated Circuits (VLSI) development, device delay is continuously scaled down, while interconnect delay increases dramatically [1]. The interconnect wire delay has become the main constraint of VLSI performance in deep sub-micron nodes [2]. Three dimensional integrated circuit (3DIC) technology aims to alleviate the problems by significantly reducing wire length compared with its 2D counterpart [3]. Recent SOCs, e.g. processors, tend to require increasingly larger on-chip memory (cache) [4]. SRAM arrays are the most important portions of cache, and determine the operating speed of whole system. The 3D integration technique can improve the SRAM design by reducing the global wire length and optimizing sub-array design [5]. The benefit of 3D cache design has been shown in [6] by Black et al. Those designs are mainly custom designs, and largely depend on the designer’s experience. For cache performance analysis, a tool CACTI [7] is widely used. The 3D-CACTI is developed by Tsai et al. at PSU [8]. A closely related work is PRACTICS developed by Zeng et al. [9]. Those tools predict 3D memory performance at system level, and few details about components and physical level calculation are shown. To optimize the 3D SRAM design in more flexible conditions, general analysis methodology based on physical components is highly needed. In this paper, a physical based delay analysis approach for 3D SRAM is introduced. The discussion focuses on calculating the access time of a 3D SRAM array by breaking the delay time into sections related to the circuit components along the signal path. By this method, designer
978-1-4244-4512-7/09/$25.00 ©2009 IEEE

A 2D hierarchical SRAM architecture is shown in Fig.1, which can be used as a reference for 3D analysis.

(a) Array structure

(b) Bank structure Fig.1 Hierarchical SRAM architecture

978-1-4244-4512-7/09/$25.00 ©2009 IEEE

sa are output resistance and loading capacitance of Sense-amp.dec is the parasitic delay of decoder. there are several sub-arrays. a physical based scaling methodology is proposed by Amrutur and Horowitz in [13]. Then the local word-line wire length in one tier is reduced.out = Dsw + Dcell + Dbl + Dmux (4) = R pre ⋅ ( + C gd ) + + C pre Rgwl ⋅ (C gwl + Cld ) = Cbl ⋅ ( Cmux + Csa ) C ⋅ ( Cbl + Cmux ) + Rmux ⋅ sa ( Cbl + Cmux + Csa ) ( Cbl + Cmux + Csa ) RC chain for Dwire. Distributed RC delays like Dwire. address and data are routed to and from the bank matrices on H-tree distribution networks. and the penalty is from extra capacitive loadings of 3D via (TSV).in) and the output wire delay (Dwire. Dgwl and Dlwl in equation (3) are: Rgwl Cgwl (5) Dgwl = ⋅( + Cld ) 8 ⋅ N tier N tier Dlwl = Rlwl ⋅ Clwl 2 2 ⋅ N tier (6) Ddec = N dec ⋅τ fo 4 + DP . which could be extracted after layout. In large SRAM array (several Mega bytes). global word-line and local word-line. and will be obviously impacted by 3D partitioning. Trise. there are Ntier local-decoders located at Ntier stacked tiers and they are all driven by global-decoder. SRAM cell (Dcell). and provides access time improvement from higher level wire length reduction. Vth V ⋅ Trise. which are selected by address input. the feasibility of separating six transistors into several tiers is constrained by 3-D via size. Cgd and Cld are capacitive loadings caused by global and local word-line drivers.out defined in (3) and (4) are sensitive to wire length. The drivers of local word-line are duplicated for each tier.wl is the word-line voltage rising time.wl in (4) will also be reduced. The cost of extra loading to global word-line could be compensated by smaller driver size. which are contributed by Decoder (Ddec) and Sense-Amp delay (Dsa).dec (1) (2) For each local word-line. Cbl. and benefit the total access time.in and Dwire. in addition to the bank level and out-of-bank H-tree wire lengths reduction. Multiplexer and Sense-Amp input respectively.in are RC delay from pre-decoder wire (Dpre). Icell is the SRAM cell driving current.3 [11]. sa ⋅ CL. Delay Analysis of 2D SRAM Bank As the hierarchical structure in Fig. Generally the delay inside a bank can be divided into components as logic gate delays. Nsubx (Nsuby) Here τfo4 is the extrinsic delay of the fanout-of-4 inverter. RO.1. then through bit-line and sense-amp to the output of bank. The loading Cld in equation (3) is: Dsa = 2τ fo 4 + RO . In 3D design. bit-line (Dbl) and multiplexer (Dmux). Cell partitioning is valuable only when the ratio of TSV width to cell width is smaller than 0. and RC delays of wires.in = D pre + Dgwl + Dlwl (3) Rlwl ⋅ Clwl 2 8 2 Three portions in Dwire. hence the delay and power. Word-line split means divide each word-line in an array into several tiers. as equations (1) and (2). At SRAM cell level. like word-line split. the benefit of 3D partitioned SRAM comes from shorter wire lengths. to provide uniform access. through pre-decoder wire. distributing SRAM array into several tiers obviously reduces the delay of global interconnects. Generally. Word-line Split and Bit-line Split. This structure provides benefits within the sub-array. III. Effect of 3D Sub-array Partitioning This sub-section analyzes the SRAM performance influence of 3D stacking. the critical path in SRAM bank is from the address decoder input. The Bit-line split is a similar concept to the word-line split. partitioning at the cell level is difficult in present and near future technology nodes. In this partitioning. Ndec is the optimized gate number in decoder path. Dwire .out includes word-line switch (Dsw). This also makes a smaller sub-array dimension. . For 2D SRAM timing analysis. global word-line (Dgwl) and local word-line (Dlwl). Cmux and Csa are capacitance on the Bit-line. wl + b ⋅ (Cbl + Cmux + Csa ) VDD I cell 1) Word-line split 3D SRAM If the SRAM sub-array is separated into Ntier stacked tiers by word-line split 3D partitioning. Vb is the bit-line voltage swing. Nld*Cg is the input capacitance of local word-line driver. Dwire . Trise. In every bank. Current dimensions of 3-D via sizes vary from 1µm×1µm to 10µm ×10µm [12] and they are comparable to SRAM cell dimensions. RC delay of wire consists of the input wire delay (Dwire.sa and CL. Therefore.sa Cld = N ld ⋅ C g ⋅ N subx ⋅ N tier + N subx ⋅ ( N tier − 1) ⋅ CTSV (7) CTSV is the capacitance of a Through-Silicon-Via. Logic gate delay in SRAM bank has two parts. DP.out) as equations (3) and (4).The SRAM system consists of array of banks. Each of them provides part of the output data. we have two options. 3D SRAM TIMING ANALYSIS A. the bit-line length in the sub-array as well as the number of pass transistors connected to a bit-line is reduced. + Rbl ⋅ B. the lengths of global word-lines and local word-lines are reduced by Ntier times. to the SRAM cell. Generally speaking. 3D SRAM partitioning in the sub-array level is a more practical choice [5].

5) ⋅ Lbank (9) Wbank = N subx ⋅ (Wld + N bl ⋅ Wcell ) Lbank = N suby ⋅ ( Lsa + N wl ⋅ Lcell ) Nbankx and Nbanky are numbers of bank in row and column in the SRAM array. CTSV won’t be a major problem when its value is reasonable. . the system delay (access time) tends to primarily come from H-tree networks (Fig. Lsa is the length cost by sense-amp. the optimized sub-array structure can be derived for a 3D SRAM with a defined size. for a given technology and a given interconnect layer. CTSV could be much larger than the capacitance of normal vias.g. So Csa becomes: proposed analysis. and design guidelines for 3D SRAM are provided.3 shows the calculation results of a 1MBit word-line split 3D SRAM bank on NCSU 45nm 3D FreePDK. With more sub-arrays in bank.38 + 1.out is achieved. 2) Bit-line split 3D SRAM For Bit-line split.02 1 + γ ) ⋅ LHtree ⋅ Rd Cd rH cH (10) Typically. and increase the delay. Wld is width of the local word-line driver in sub-array. H-tree Network in Large 3D SRAM Array In large memory array (>1MBit) with a lot of banks.2. more tiers in stacking tends to increase the access time quickly. the minimum long wire delay with repeaters is as: 1 4 16 64 Num of sub-array 256 Fig.2 is the calculation result showing the relationship of 3D SRAM access time and the sub-array configuration. But if CTSV has reasonable value. Rbl and Cbl in equation (4) are both divided by Ntier for separated bit-line. Wcell and Lcell are width and length of a SRAM cell.5) ⋅ Wbank + ( N banky / 2 − 0. Delay (ps) 1500 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 16 4 1 Nbl/Nwl 1/4 1/16 LHtree = ( N bankx / 2 − 0. In an example of 16MBit 4 tiers 16 banks word-line split 3D SRAM design on MITLL technology. when CTSV becomes 100fF.1 (a)). So we may want to exploit the relationship between available tier number and biggest acceptable CTSV value.2 Access time sweep of 16MBit. for 20fF CTSV. under nonoptimal conditions (e. From [14]. 4 tiers word-line split 3D SRAM Dp . Placing extra repeaters along the wire distribution will reduce the total delay.is the number of columns (rows) of sub-arrays in a bank. increasing CTSV will diminish the benefit of multi-tiers partitioning. we can still get obvious timing improvement from sub-array and bank level wire reduction. IV. Wbank and Lbank are width and length of a bank. However. Rd and Cd are output resistance and intrinsic loading of repeater. γ=1 for repeater. As (7) shows. Csa = N sa ⋅ C g + ( N tier − 1) ⋅ CTSV (8) C. and number of banks in the SRAM array is defined by specification. which is extracted by Calibre® XRC and simulated by Q3D® [11]. L is the wire length (from SRAM array I/O port to address decoder input in bank) in H-tree. Rpre and Cpre in (3) are also reduced because of smaller sub-array height.min = (1. extra loading of CTSV should be added into Csa. rH and cH are resistance and capacitance per unit wire length in H-tree. A. This configuration minimizes the total access time to 481ps. Fig. the best trade-off point is 64 sub-arrays with same bit-line and word-line number (64 for both) in each bank. Process parameters used in the calculation are from MITLL 3D SOI technology and recently developed NCSU 45nm 3D FreePDK. In this case. thick substrate bulk process). Fig. Assumed sense-amp is shared between tiers. But beyond the optimized point. Nbl and Nwl are bit-line and word-line numbers in a sub-array. Sub-array Performance Constraint of CTSV As mentioned in section 3. the delay performance tends toward a constant for more stacking tiers. The CTSV value of MITLL technology is 2fF. and smaller Dwire. Normally the available tier number is constrained by fabrication technology. the input capacitance of sense-amp. DESIGN EXPLORATION FOR 3D SRAM Three example cases are demonstrated to verify the B. Timing Optimization Design for 3D SRAM Based on the analytical equations in Section 3. The H-tree wire length (LHtree) from I/O port to address decoder input can be calculated as equation (9). Delay analysis and optimization for global interconnect is introduced into the total access time calculation for large SRAM array. and it’s much larger than the Random Cycle Time which defined by sub-array delay. the area cost by local word-line driver in each sub-array will result in longer global interconnect length. the delay will decrease at the beginning.

In this 4 tiers 16 sub-arrays case (MITLL technology). Thoziyoor et al. Franzon. April 2009.” in Proc. optimized sub-array partitioning provides extra benefit in delay reduction. Mineo. S. an easy way for converting a SRAM design from 2D to 3D is distributing the sub-arrays in a bank to several tiers. P. Test Comput.147–151. January 2009 [5] Yuan Xie. April 2. and P. A. IEEE Int. ASIC/SOC Conf. John F. As Fig. Sub-array Partition 1000 900 800 700 600 500 400 300 200 100 0 0 2000 4000 6000 8000 Square Root of SRAM size (Bit) Bank Distribution Ratio 1. W. “Demystifying 3D ICs: The pros and cons of going vertical. M. and K. and J. “Application Exploration for 3D Integrated Circuits: TCAM. Zarkesh-Ha. pp.. 2000. Banerjee. “CACTI 5. Vijaykrishnan. J. Nov. Aamir Zia. “Die stacking (3D) microarchitecture. Tsai. Mehrotra. Kerry Bernstein “Design Space Exploration for 3D Architectures”. Horowitz. ACM Journal on Emerging Technologies in Computing Systems. [7] S.1”.3 59. No. “Digital Integrated Circuits. J. ACKNOWLEDGMENT We would like to thank the Semiconductor Research Corporation for their support under Grant 1824. 2008 [8] Y. 35.. [13] Bharadwaj S. “First-order performance Fig. and FFT Case Studies”.3 3D SRAM performance improvement constrained by CTSV value C..2 Ratio 1. N. Comput. Xu. D.4 shown.3 68. Mick. Black et al. Sule. “Speed and Power Scaling of SRAM’s”. Int. Vol. vol.3 66. Gabriel H. McDonald and Kerry Bernstein. Page(s): 213 – 220 [3] K.1 1 0.Wilson. 1. Vol.4 77. J. Rose. As the design example shows. Souri. 17.. 548–555. K. delay optimized configuration of 3D SRAM sub-array can be derived under various process conditions. Jun. 2005. Sep. Souri.9 0. Jun. Fig.-F. No. Test Comput. Page(s):602 – 633 [4] By Philip Jacob. Irwin. the access time of 3D SRAM array with repeaters in H-tree networks is linear to the square root of SRAM size. Saraswat. C. no. CONCLUSION 3D IC offers opportunity to improve SRAM memory performance beyond the constraint of chip area. Lu. Jin-Woo Kim. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. Russell P. Joyner. Conf. [12] J. A Design Perspective. Belemjian. performance analysis. vol.1 54. This extra improvement is more obvious (10~20%) when SRAM size is less than 16MBit. A. 22. Paul M. Loh. Pages 65–103. and design implications”.Decoder 90 80 70 60 (ps) 50 40 30 20 10 0 WordLine BitLine SenseAmp interconnect reduction. Franzon. Kraft. “A stochastic global net-length distribution for a three-dimensional system-on-a-chip (3DSoC). and M. pp. Proceedings of the IEEE. Amrutur. The largest tier number can be stacked for SRAM array is constrained by parasitic capacitance of 3D via (TSV). Sub-array Partitioning and Bank Distribution Without sub-array partitioning (word-line or bit-line split). Davis. Saraswat. No. “Multiple Si layer ICs: motivation. S. R. HP Laboratories. Analysis method presented in this paper is valuable for 3D SRAM design and optimization. J..1 54.. [10] W.9 54. Okan Erdogan. In this paper. H. pp. 6. Xie. Des. Banerjee.0 67. D. K. 469–479. C. a physical based 3D SRAM delay analysis method is proposed. J.” IEEE Des. K.0 54. [11] W. pp. “3-D ICs: a novel chip design for improving deep-submicrometer interconnect performance and systems-on-chip integration”.” IEEE Des. Rabaey. Int. P.4 Sub-array 3D partitioning and simple distribution in 4 tiers V. 2. Feb 2000 [14] Jan M. Borivoje Nikolic. Access time of SRAM can be improved by bank distribution for shorter global interconnects. 498–510.3 1. Microarch. 2005. second edition”. “Three-dimensional cache design exploration using 3D cacti. Anantha Chandrakasan. Gutmann. [6] B. 519–524. No. no. Symp. C. [9] A. April 2006. REFERENCE CTSV =20fF CTSV =100fF Delay(ps) Ntier 77. Mark A. Eun Chu Oh.3 66. 2008 [2] S.7 70. Ambarish M. Bryan Black. Kapur. Michael Chu. Sule. 2. 97.” in Proc.” in Proc. pp. Y. Vol. Zeng. JSSC. 22. and R. 3D sub-array partitioning provides about 50ps access time reduction compared to bank distribution. However. Proceedings of the IEEE. Rhett Davis. Meindl.4 1 2 4 8 16 32 1 2 4 8 16 32 [1] The International Technology Roadmap for Semiconductors. 4. FIFO. 2005. J. 2001. 6. Paul D. Sub-array partitioning provides extra delay improvement in addition to the global . “Mitigating Memory Wall Effects in High-Clock-Rate and Multicore CMOS 3-D Processor Memory Stacks”. May 2001. 14th Ann. Vol. J. Hua. 2006.8 10000 Access Time(ps) prediction of cache memory with wafer-level 3D integration. M. IEEE Design Automation Conference. 2. Steer.

Sign up to vote on this title
UsefulNot useful