Professional Documents
Culture Documents
High Speed Memory Alsc
High Speed Memory Alsc
"
"
"
"
"
"
"
"
"
"
"
CLK to WL timing is one of the most critical timing in High Speed SRAM design. (50% of SRAM access time). Important for both Read and Write. For Read, after WL is turned on, data from the cell comes on Bit-line which is sensed and sent to data-path blocks. For write, after WL is turned on, data-to be written is presented on bitline which flip the cell. WL in high speed designs in "pulsed" till data is sensed (Read) / data is written (Write).
Clock to WL path
Location of Input buffers
Address Decoding/Multiplexing (PR Generation) Row Redundancy using dynamic logic
"
"
"
Long RC lines are driven by big drivers and it switches at regulated VDD supply and hence increase current consumption in set-up phase. Input buffers if kept close to the pad result in long RC lines for address from Input buffer till master/slave latch (typically kept in chip-centre) Current consumption is a function of address change pattern. (More for address compliment) Typical distance for a 36M (0.11um) from the bond-pad to the middle of the chip is 7.5K.
"
"
"
Place the input buffer at chip-centre and route the pad signal all the way. Increase the pin capacitance as entire 7.5K routing is seen on the bondpad.
"
"
Entire RC line is driven by external driver and not by the regulated VDD supply so address switching current is independent of address pattern.
Width of the pad signal to be decided based on ESD requirement plus pin capacitance.
"
"
"
"
"
Higher the width, better for ESD but input capacitance becomes higher. Layout should avoid 90 degree bends for better signal waveforms. Use 45 degree instead. Use top-metal for better EM and lesser capacitance. Requires low cap ESD structure to meet pin capacitance spec Shield the pad signals on both sides by supply lines to avoid delay variations due to signal switching condition.
Issues : Shielding
" "
Switching of lines in the flow through path depends upon external conditions. External conditions means switching pattern during setup/hold conditions, rise/fall time.
"
"
"
This effect could be severe particularly if pad signal is directly routed till the chip-middle. Amount of dips/bumps on supply lines depend up on rise/fall time of the signal. No logic should be put on shielding VDD/GND lines due to dips/bumps.
Hence minimum width VDD/GND lines to be used for shielding to minimise space overhead.
"
Master Latch
"
Required to meet hold time. Located in HMIDDLE to minimize hold time skews across different addresses. Kept as close as possible to CLKGEN. IN to OUT delay should be min. for good setup.
"
"
"
TG latch to register decoded addresses. Driver input is driven low by 4 series nMOS hence speed suffers. High gate-load on Address inputs As / NAs.
"
"
Dynamic logic for slave latch. Driver input is driven by 2 series nMOS (faster). Less input gate load on As / NA. Area efficient. (Small NOR3 devices)
"
"
"
Master/slave latches should be placed close by so that min sized devices can be used in M-latch. Small pass-gate sizes (for CLK) in master latch means less gate-loading for CLK and hence better hold time. (1 M-latch for each address) Predecoder gate load at the input of slave latch should be small so that driver in master-latch (and also pass-gate) can be kept as small as possible w/o sacrificing M-latch delay. Hence choice of slave-latch (2) is better.
"
"
"
Watch out for the connection RC (done in lower level metal) of the clk signal to M-latch/Slave-latch. Place M/S-latches as close as possible to main Clk trace (Top metal). Typically M/S-latch exists for all address and control inputs so connection RC gets multiplied by no. of M/Slatches. Keep the Clk connection away from other connection to latches (to reduce coupling caps).
"
"
"
Row-Redundancy
"
"
Evaluation one of the bottleneck to fire WL (one of the gating item to fire normal WL)
Address based redundancy is faster but for different RD/WR address timing (eg QDR) PR based redundancy is simpler.
"
"
Kept in HMIDDLE.
Typically 1 redundant row per Core. There is one signal carrying redundancy eval info for every core.
"
Common Clock controlling WL pulse width is heavily loaded as it has to drive decoders (with huge input cap) for both Top/Bottom half of core. Local Inverters are used for wave-shaping.
Clock information if sent through redundancy eval signal will be specific to each core and hence will be less loaded. Local wave-shaping inverters are not required.
"
Dynamic Row-Redundancy
"
"
"
Dynamic decoders have lesser area ; Easy to layout in tight space. Considerably reduces gate load for long lines hence sharper waveforms. Gate-sharing (series nMOS) reduces the gate-load (Eg : Common nMOS for PRI in dynamic scheme) Separate WLE_NREDEN_CORESEL as compared to common WLE_SEL means sharper Clock waveforms without local waveshaping inverters. For pulsed high inputs, no timing margin constraints in dynamic scheme (Evaluation during high).
Less input gate loading for dynamic scheme (3 times less loading for SEGEN input).
"
"
"
Voltage sensing : Sense amplifer needs to be turned on after a delay from WL to allow the differential to build. Current Sensing : Sense amplifier turns on as soon as WL turns on. Current sensing consumes more current (biasing) than voltage sensing.
"
"
"
Current sensing becomes risky in case of technology with high leakage currents, Voltage sensing can be made to work by delaying SAMP enable more.
Two ways of connecting the BL/BL_ to SAMP (a) Through the column pass-gate (pMOS) (b) Directly (No column pass-gate) In case of (a) typically 16 or 32 to 1 Mux is used i.e. 1 SAMP for 16/32 bitline pairs
In case of (b) there is 1 SAMP per BL-BL_. More circuitry in Core for SAMP and related logic in case of (b). SAMP layout has to fit in BL-BL_ pitch which favours choice of voltage SAMP.
"
"
"
"
"
"
"
"
"
Column pass-gate mux adds routing parasitics and wired-OR gate loading in the weakly driven critical path of voltage differential on SAMP . Typically pMOS are used as Bit-lines are around VDD in read mode. (pMOS passes 1 efficiently) A 32 : 1 Mux involves drain load of 32 pMOS and around 50um routing. This adds 500ps delay for the 80mv differential to appear on SAMP nodes (CSM 0.13um) WL pulse has to be wider to get the desired differential . Higher BL/BL_ spilt and hence higher precharge current/longer precharge time at the end of the Read cycle. Not much speed gain with 16:1,8:1,4:1 mux.
"
" "
"
"
"
"
16: 1 Mux is kept at the output of SAMP which are strongly driven CMOS signals (faster).
SAMP has to be laid out in BL/BL_ pitch hence SAMPBANK has lot of devices packed in a tight area. SAMP input nodes are liable to coupling. BL/BL_ pitch (core DRC rules) and SAMP transistors rule (peripherey DRC rules) vary across technologies so difficult to port layout. More devices hence more leakage paths per BL/BL_ in SAMPBANK hence not suitable for designs with low leakage current spec. To minimize leakage current, transistor length for logic in SAMPBANK area is kept > min channel length there by putting more constraint on pitched layout design.
"
"
COLRED logic to repair faulty bitlines which translate to faulty I/O repair. Typically a redundant bitline can repair more than one I/Os. COLRED logic contains fuses corresponding to each column address plus IO fuses.
"
"
"
"
Two schemes to mux the data from redundant bitlines and normal bitlines
(a) Provide separate RIOR lines each redundant bitline data out and put mux to select RIOR and IOR lines. (b) Use IOR lines to carry both normal and redundant bitline data (Wired-OR I/O line). Need to disable normal path incase of COLRED evaluation.
"
"
No muxing at immediate output of SAMP, Muxing at the input of final Read Driver.
Scheme (a) involves mux in the data-path (post read driver) hence ideal location is away from VMID and close to I/O path circuitry. Scheme (b) requires that evaluation should happen with sufficient margin to disable path hence it should be kept close to address latches for faster evaluation. (In VMID).
Advantage of scheme (b) is less switching current due to lesser routing for address to eval logic.
"
"
"
"
Echo Clocks
Prefetching for DDR operation Write DIN muxing in case of address-match (Coherency) Read I/O line muxing for X36, X18, X9 etc
"
"
"
"
"
"
Output Stage
"
"
"
Wider Data-valid window for QDR II. Data is more or less coincident with K / K_ rising.
1st Data is half-cycle delayed in QDR II.
"
"
"
"
Input data valid window for any chip is equal to set-up time + Hold time (tS + tH). Some ASICs/FPGAs do have tS/tH of 1ns. QDR wont be able to interface with a chip with tS = tH = 1ns (data-validity = 2ns) but QDR II will. Shorter data valid window means little margin for set-up/hold time window to the interfacing logic IC.
" "
"
Logic gate speed becomes half at fast corner (Higher voltage/Cold temp/fast process) as compared to slow corner (Low voltage/high temp/slow process). Hence tCO = 2.5ns result in tDOH = 1.2ns. tCO includes external clk to Q-latch clk, Q-latch delay, predriver delay and output buffer delay. (Quite a few gates). Hence way to increase data valid window is to reduce tCO i.e. reduce no of gates in Clock to output path.
Traditional Sync SRAMs have single clock to control input and output. Flight time delays both clock and output data for farthest SRAM.
"
"
Flight time variation makes it difficult for controller to latch the data.
Input clocks (K/K_) for command/address registration. Output clocks (C/C_) for read output timing.
"
Separate command and I/O clocks eliminate flight time differences for data (to be latched at controller). Farthest SRAM is clocked for I/O first, returned clocks is used to latch the data.
"
To avoid loading, separate C/C_ clocks can be routed from controller to the 2 SRAMs. Nearest SRAM has maximum skew between command and IO clocks. (Spec for K / to C/)
"
DLL makes Output data coincident with the rising edge of C and C_ clocks. (Low tCO)
Internally DLL clocks are generated tCO time earlier than output clocks. External tCO and tDOH spec gets generated because of slow & fast corner respectively. Because of small positive tCO, tDOH is negative.
"
"
"
DLL Issues
"
DLL needs some amount of clocks cycles to lock up after power up. (1024 cycles).
DLL is fine tuned for a particular frequency range. Puts a min frequency spec on the QDR unlike conventional sync SRAMs. DLL can be turned off by a pin called DOFF, QDR II timing will become similar to QDR.
"
"
"
QDR, hold time is positive (1.2ns). Hence QDR controller can use K_ / to latch the 1st data, K / can be used to latch 2nd data and so on.
"
"
"
CQ & CQ _ ; Free running clocks. Same Frequency as input clks (C - C_ or K-K_) Constant timing relationship with Data coming out of QDR and echo clocks.
"
"
"
Echo clock helps in generation of DLL clocks which are in advance of external clocks.
Aim of DLL : To make CQ/CQ# and Q coincident within tCO (0.45ns) from C/C.
"
"
"
"
CQ/CQ_ are specific to each SRAM. Allows point to point link between controller & SRAMs for data latching.
"
Constant timing relation between data and CQ clocks. Data can be latched with 100 % accuracy. Rising edge of echo clocks always occur 0.1ns/0.2ns before data. Echo clocks can be delayed to arrive centered with data at controller input for equal tS/tH. Allows use of QDR SRAMs from multiple sources.
"
"
"
"
Corresponding to Read address A1, four data are output (Q11, Q12, Q13, Q14) Q11 and Q12 are accessed simultaneously in Read-A1 cycle. Similarly Q13-Q14 are accessed simultaneously in next cycle. (Prefetch) Hence for a X36 (36 output) part, internal datapath contain 72 Read data lines.
"
"
Read address A3 = Write address A2 (Previous match) Q11 = D21 ; Q12 = D22
Read address A1 = Write address A2 (Current match) Q31 = D21 ; Q32 = D22 Both cases write DIN should be routed to DOUT ignoring the memory read.
"
"
I/O Muxing
"
Typically X36, X18, X9 options are provided in a single die (with bond-option). Hence muxing. More options, more muxing logic.
"
"
"
"
Q-Latch
" " "
" "
"
Input(s) change every clock cycle (Prefetch); Outputs change every half cycle (DDR operation). Output driver is CMOS Output enable information is combined before Data goes to Q-latch to prevent junk data being driven on output pins. For tristate, pMOS gate = VDD, nMOS gate = GND Separate latch for pull down and pull up path with separate set/reset logic for tristate. Logic should be added to tristate output during powerup.
Programmable Impedance
"
Output drivers can be configured to have variable impedance (between 35 & 75 Ohms) to match load conditions. Chip configures its output impedance to match the load every 1024 cycles to track supply voltage and tempreture variations.
Output impedance can be changed on the fly during active Read cycles. (pull-down nMOS is configured while output data is '1' and vice versa).
"
"
Output predriver/driver
Ideally Separate I/O should make zero DIN latency possible. Zero cycle latency for write makes it an issue for data coherency (latest data output).
"
"
Tight conditions in case of address match i.e. A1 = A2. D21 will be routed through mux to the output buffer with in half cycle.
A1 = A2 is not match condition requiring DIN information since Q11 comes before D21.
A2 = A3 is valid match ; D21/D22 comes much before being required to be routed to the output.
"
"
"
"
WL corresponding to write cant be turned on in A2 cycle since data is delayed by 1 cycle. Read command can be given with different address (A3) during D21-D22 cycle so write cant be performed in D21-D22 cycle either. Hold the data in registers till next valid write command is given. Write-A3 is done when Write-A4 (i.e. next valid write) command is given.
Interesting scenario
"
Always one unfinished write command. Write address to be written (till next valid write) is stored in a register which gets updated only on a valid write.
"
"
Address match scenario can happen after any number of cycles. (Read address = Unfinished write address)
"
D21-D22 : Written together in 1st half cycle, D23-D24 : Written together in 3rd half cycle.
Write is actually performed at next write cmd i.e. D21-D22 are "actually" written in 1st half of Write-A5 cycle and D22-D23 are written in the 1st half of next DSEL cycle (D51). 2 cycles min. between successive Writes.
"
"
"
Data can be written in the same cycle as given but to simplify design, D21-D22 are written together during Write-A5 half cycle.
One cycle min. between successive write.
"
"
1/2 cycle delay between DIN at the pin and DIN actually being written in B2/B4 respectively ; DIN path is not speed critical.
Connecting BL/BL_ to WRTDRV (2 ways) (a) Through the column pass-gate (nMOS) (b) Directly (No column pass-gate)
"
In case of (a) typically 16 or 32 to 1 Mux is used i.e. 1 WRTDRV for 16/32 bitline pairs
In case of (b) there is WRTDRV per BL-BL_. More circuitry in Core for WRTDRV and related logic in case of (b). WRTDRV layout has to fit in BL-BL_ pitch.
"
"
"
"
"
"
"
"
Since either BL or BL_ is driven to GND to write the cell, nMOS mux is used. During write, 0.7/0.13 nMOS device comes in series with big write-driver pulldown (w=3.4) effectively reducing its strength to flip the cell. At 1.2V/TT/120C, this series nMOS adds about 800ps delay from TBUS \ to BL \. (16 -> 1 CPG and 20u TBUS length). Not much speed gain in 16:1, 8:1, 4:1 mux.
"
"
"
"
"
No pullup in WRTDRV as SEN2 is low during write, ssamp pMOS keeps BL/NBL high. If BL is driven low by turning on MN8, this low passes (weakly) through SEN2 passgate MP58 and turns on pull up of Inverter I4 which keeps NBL to VDD.
"
"
"
"
"
"
Like SAMP, Write-driver has to be laid out in BL-BL_ pitch. Internal nodes are at digital levels (VDD/GND) ; not unlike analog voltages (voltage-differential) in SAMP. Lesser layout constraint/requirements. Since logic is repeated for every BL/BL_ pair, channel lengths are kept > min length to limit leakage current (Standby current spec).
Bitlines are precharged to VDD in between active cycles (RD/WR). During write, either BL or NBL is driven fully to ground. Hence BL swing during Equalisation is very high. Write -> Read equalisation is one of most important critical timing in high speed SRAMs. During Read, because of the pulsed WL, typical BL/NBL splits are around 200+ mv hence precharge after read is not critical.
"
"
"
"
Routing for backend EQ is more, hence smaller pMOS for backend EQ.
Size of LOGIC EQ pMOS is determined by WR -> RD timing. NEQ sees big load. Rise/Fall time is an issue for faster EQ.
"
"
Equalisation takes more time only for the BL or NBL which is driven low.
"
Typically only few bitlines are driven low in a coreseg. (Eg. In ALSC QDR SRAM only 6 out of 192 bitlines are driven low during WR).
CPG selects the BL/NBL to driven low during write hence it can be used to selectively turn on big EQ devices during precharge. Back end EQ devices can be sized to take care of equalisation after read. Hence to save current big EQ devices should be turned on only after Write.
"
"
Big EQ device will turn on only for BL NBL being written. EQ is less loaded.
"
"
Latch should be powered up so that NEQ_CPG = high. 8 transistors per BL/NBL. 2 leakage paths per BL pair.
Difficult for pitched layout.
"
"
Lesser devices. NEQ_CPG goes low only for (CPG = 1) during EQ high. NEQ_CPG floating for BLs (CPG = 0) during EQ high.
"
"
"
During standby NEQ_CPG is floating low for last written bitline but floating high for all other BLs. Only 1 leakage path.
Easy for pitched layout (3 transistors)
"
"
EQ behaviour changed, default is low, self timed pulse after WL \. Ensure EQ is low during power-up. NEQ_CPG floating high for BLs (CPG =0) during EQ high. During Standby, NEQ_CPG is taken solid high (better !)
"
"
"
Separate supply for SAMP to isolate the switching noise on regular VDD rail. Max VDD under minimum current condition.
"
"
"
"
"
"
Decaps kept under Top, (Top-1) metal lines increase the capacitance by < 2% if thin orthogonal Metal1 is used for supply connection of decaps. Higher L for decaps means more decaps can be laid out in a given area but decap effectiveness will reduce because of series resistance due to higher L. Length of decaps should be kept 12 times the minimum channel length to optimise parasitic series resistance of decap transistor and amount of decaps. Put decaps as on VDDQ/VSSQ bus as part of output buffer layout. Keep decaps little away from ESD structures as they tend to store charge during ESD event.
"
"
"
Layout : Drivers
"
"
"
"
"
Avoid keeping too many big drivers near by which are likely to switch on simultaneously.
Top metal has higher thickness hence highest coupling cap. Top metal and (Top-1) metal capacitance differs by 10% for a routing length of 3.5K. Alternatively route Top and (Top-1) metal, use Top metal for relatively longer distance and Top-1 metal for shorter distance. Delayed signals like sense clocks, IO precharge, Address pulse etc can be routed in (Top-1) metal ; Routing delay can be taken into account for overall timing. Set-up time critical signals like redundancy info, WL clocks should be routed in Top metal.
"
"
"
"
Many Thanks !!
To all QDR team members (SQ/SF ; design and layout) for implementing the schemes and thorough simulations.
"
"