Professional Documents
Culture Documents
STMicroelectronics
Supply
Technical Change
2
Classification of Memories
RWMemory
Random Access Non-Random Access
NVRWM
ROM
SRAM
- Merit : high speed or low power - Demerit : expensive, low density
Increasing die size factor 1.5 per generation Combined with reducing cell size factor 2.6 per generation
*MB=Mbytes
Techniques must be applied to reduce production cost Often, memories are the launch vehicles for a technology node
- Leads to volatile nature of prices
7
On-Chip Cache
Registers
Datapath
Speed (ns):
1s
10s Ks
100s Ms
10
Memory Interfaces
Address i/ps
- Maybe latched with strobe signals
Data i/os
- For large memories data i/p and o/p muxed on same pins,
selected with /WE
Refresh signals
12
N words
S1
S2
Word 1
Word 2
1:N decoder
very inefficient design
SN-2
Word N-2
SN-1
Word N-1
13
S0
Row Decoder
C of M bit words
SR-1
C of M bit words
- - - - KxM bits - - - -
C of M bit words
Column Select
N=R*C
Row Decoder
Word Line
Column Decoder
Global Data Bus Control Circuitry Block Selector Global Amplifier/Driver I/O Advantages: 1. Shorter wires within blocks 2. Block address activates only 1 block => power savings
16
18
19
2.Photolithography
a) Photoresists b) Photomask and Reticles c) Patterning
20
Lithography Requirements
21
Power o/p
Pulse Rate
23
24
26
n-well
active area after LOCOS
p-type
Paulo Moreira
Technology
27
28
n+
p-doping
n+
p+
n-doping
p+ n-well
Paulo Moreira
Technology
29
Process enhancements
Up to eight metal levels in modern processes Copper for metal levels 2 and higher Stacked contacts and vias Chemical Metal Polishing for technologies with several metal levels For analog applications some processes offer:
- capacitors - resistors - bipolar transistors (BiCMOS)
Paulo Moreira
Technology
30
Metalisation
Metal deposited first, followed by photoresist Then metal etched away to leave pattern, gaps filled with SiO2
31
Pre-clean
25 nm
Electroplating
+ 100-200 nm
CMP
33
34
35
36
37
FPU
38
39
Synchronous Flow Through / Pipelined Zero Bus Turnaround Double Data Rate Dual Port Interleaved / Linear Burst
40
SRAM Array
SL0
Array Organization
common bit precharge lines
need sense amplifier
SL1
SL2
41
Write Enable is usually active low (WE_L) Din and Dout are combined to save pins:
- A new control signal, output enable (OE_L) is needed - WE_L = 0, OE_L = 1
D serves as the data input pin
- WE_L = 1, OE_L = 0
D is the data output pin
->
WL[2] WL[3]
A0 A0!
write circuitry
43
44
45
46
CAS
RAS-CAS timing
48
49
Read Timing: High Z Junk Data Out Read Address Data Out Read Address
D A OE_L WE_L
50
stable
stable tAA
CS_L
tACS
OE_L
tOH
tAA
DOUT
tOZ
tOE
tOZ
tOE
valid
WE_L = HIGH
valid
valid
52
tAA
/WE controlled
/CS controlled
54
tDH
Write driver
tWP-tDW
55
SRAM Architecture
56
57
-> Write:
set bit lines to new data value b = opposite of b raise word line to high sets cell to new state May need to flip old state
Read:
set bit lines high set word line high see which bit line goes low
58
Inverter Amplifies Negative gain Slope < 1 in middle Saturates at ends Inverter Pair Amplifies Positive gain Slope > 1 in middle Saturates at ends
59
Bistable Element
Stability Require Vin = V2 Stable at endpoints
recover from pertubation
Metastable in middle
Fall out when perturbed
61
6T Bistable Latch
High resistance poly
4T Bistable Latch
62
Reading a Cell
Icell
DV = Icell * t ----Cb
Sense Amplifier
63
Writing a Cell
0 -> 1
1 -> 0
64
Bistable Element
Stability Require Vin = V2 Stable at endpoints
recover from pertubation
Metastable in middle
Fall out when perturbed
AC
Alpha particles Crosstalk Voltage supply ripple Thermal noise
SNM = Maximum Value of Vn Without flipping cell state
66
SNM
2
SNM
1
67
68
VDD
PMOS Pull Up
Q/
Q
NMOS Pull Down
GND SEL
SEL MOSFET Substrate Connection
69
70
T word
word word
VDD
72
M2
T1 T1 Vss Vss
Vss
Vdd T6 Vss
T2 Vss Vss T6
T2
T5
T6 B
B B
4 x 4 array 2 abutment 2x
74
Vdd T4
T3
T1 T1
T3
Vss
T2 Vdd T4
Vss Vdd
GND
T4
T5
Vdd
VDD
BIT
Q
T6
R2 BIT!
T3
Q
T1
T4
T2 T3
Word Line
T2
GND WL
T1 BL
BL
75
6T - 4T Cell Comparison
6T cell
- Merits
Faster Better Noise Immunity Low standby current
4T cell
- Merits
Smaller cell, only 4 transistors HR Poly stacked above transistors
- Demerits
Large size due to 6 transistors
- Demerits
Additional process step due to HR poly Poor noise immunity Large standby current Thermal instability
76
Precharge
Column Decode
Sense Amp
77
Row Decode
2n rows, 2m * k columns
Global Data Bus Control Circuitry Block Selector Global Amplifier/Driver I/O Advantages: 1. Shorter wires within blocks 2. Block address activates only 1 block => power savings
79
80
81
82
83
Partioning summary
Partioning involves a trade off between area, power and speed For high speed designs, use short blocks(e.g 64 rows x 128 columns )
- Keep local bitline heights small
For low power designs use tall narrow blocks (e.g 256 rows x 64 columns)
- Keep the number of columns same as the access width to minimize wasted power
84
Redundancy
Redundant rows Fuse : Bank Redundant columns Memory Array Row Address
Row Decoder
Column Address
85
Column Decoder
Periphery
87
A0
DELAY td
ATD
ATD
A1
DELAY td
88
Row Decoders
89
WL 1
WL 0
A0A1 A0 A1 A0 A1 A0A 1
A 2A3 A2 A3 A2 A3 A2 A3
A1 A 0
A0
A1
A3 A2
A2
A3
Splitting decoder into two or more logic layers produces a faster and cheaper implementation
90
and so forth
A0
A1
A2
A3
91
Dynamic Decoders
Precharge devices GND GND VDD WL 3 VDD WL 2
WL 3
WL 2 WL 1 WL 0 VD D
VDD
VDD
WL 1
WL 0 A0 A0 A1 A1
A0
A0
A1
A1
93
!A0
A0
!A1
A1
Precharge/
Back
94
Decoders
n:2n decoder consists of 2n n-input AND gates
- One needed for each row of memory - Build AND from NAND or NOR gates
A1
A0
Make devices on address line minimal size Scale devices on decoder O/P to drive word lines Static CMOS Pseudo-nMOS
A1 A0
1/2
word0 word1 word2 word3
4 2
16 8
word
1 A1 A0
1 1 1
8 4
word0
word
word1 word2 word3
A0
A1 1 1
95
Decoder Layout
Decoders must be pitch-matched to SRAM cell
- Requires very skinny gates
A3 VDD A3 A2 A2 A1 A1 A0 A0
word
96
Large Decoders
For n > 4, NAND gates become slow
- Break large gates into multiple smaller gates
A3 A2 A1 A0
word0
word1
word2
word3
word15
97
Predecoding
- Group address bits in predecoder - Saves area - Same path effort
A3 A2 A1
A0
word2 word3
word15
98
Column Circuitry
Some circuitry is required for each column
- Bitline conditioning - Sense amplifiers - Column multiplexing
Need hazard-free reading & writing of RAM cell Column decoder drives a MUX the two are often merged
99
100
A1
S3 S2 S1
A0
S0
Data !Data
Advantage: speed since there is only one extra transistor in the signal path Disadvantage: large transistor count
101
B2 B3
B4 B5
B6 B7
B0 B1
B2 B3
B4 B5
B6 B7
Bitline Conditioning
Precharge bitlines high before reads
bit bit_b
BL
!BL
BL
!BL
equalization transistor - speeds up equalization of the two bit lines by allowing the capacitance and pull-up device of the nondischarged bit line to assist in precharging the discharged line
104
Xtor resistance
RCDV t Vdd
Cell current
Cannot easily change R, C, or Vdd, but can change DV i.e. smallest sensed voltage
- Can reliably sense DV as small as <50mV
105
Sense Amplifiers
D t p = Cb V ---------------I cell large make D V as small as possible
small
small transition
input
s.a.
output
106
V DD
PC
VDD
x
y M3 M1 SE BL
BL
EQ
WL i (b) Doubled-ended Current Mirror Amplifier SRAM cell i Diff. Sense x x Amp y y D D (a) SRAM sensing scheme. y x SE (c) Cross-Coupled Amplifier
107
V DD y x
SE
Initialized in its meta-stable point with EQ Once adequate voltage gap created, sense amp enabled with SE Positive feedback quickly forces output to a stable operating point.
108
Sense Amplifier
bit word bit
sense clk
isolation transistor
regenerative amplifier
109
bit
200mV
bit
wordline
wordline
2.5V
sense clk
sense clk
110
111
Twisted Bitlines
Sense amplifiers also amplify noise
- Coupling noise is severe in modern processes - Try to couple equally onto bit and bit_b - Done by twisting bitlines
b0 b0_b b1 b1_b b2 b2_b b3 b3_b
112
Transposed-Bitline Architecture
BL BL BL BL" (a) Straightforward bitline routing. BL BL BL BL" (b) Transposed bitline architecture.
113
Ccross SA
Ccross SA
114
DRAM in a nutshell
Based on capacitive (non-regenerative) storage Highest density (Gb/cm2) Large external memory (Gb) or embedded DRAM for image, graphics, multimedia Needs periodic refresh -> overhead, slower
115
116
row address
Column Address
data
118
119
Memory Systems
address
n
DRAM Controller n/2 Memory Timing Controller DRAM 2^n x 1 chip w Bus Drivers
120
A
9
256K x 8 DRAM
Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low Din and Dout are combined (D):
- WE_L is asserted (Low), OE_L is disasserted (High)
D serves as the data input pin D is the data output pin
DRAM Operations
Write
- Charge bitline HIGH or LOW and set wordline HIGH
Read
- Bit line is precharged to a voltage halfway between HIGH and LOW, and then the word line is set HIGH. - Depending on the charge in the cap, the precharged bitline is pulled slightly higher or lower. - Sense Amp Detects change
Word Line
. . .
Bit Line
Sense Amp
122
DRAM Access
1M DRAM = 1024 x 1024 array of bits
1024 bits are read out 10 column address bits arrive next Column Access Strobe (CAS) Column decoder
123
RAS_L
CAS_L
WE_L
OE_L
256K x 8 DRAM
Row Address
Col Address
Junk
Row Address
Col Address
Junk
WE_L OE_L
High Z
Junk
Read Access Time
Data Out
High Z
Output Enable Delay
Data Out
RAS_L
CAS_L
WE_L
OE_L
A
9
256K x 8 DRAM
Row Address
Col Address
Junk
Row Address
Col Address
Junk
OE_L WE_L
Junk
Data In
WR Access Time
Junk
Data In
WR Access Time
Junk
DRAM Performance
A 60 ns (tRAC) DRAM can
- perform a row access only every 110 ns (tRC) - perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC).
In practice, external address delays and turning around buses make it 40 to 50 ns
These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead.
- Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins - 180 ns to 250 ns latency from processor to memory is good for a 60 ns (tRAC) DRAM
126
Read:
- 1. Precharge bit line - 2.. Select row - 3. Cell and bit line share charges
Very small voltage changes on the bit line Can detect changes of ~1 million electrons bit
Refresh
- 1. Just do a dummy read to every cell.
127
DRAM architecture
128
Cs - VBL ) C s Cb
129
Sense Amplifier
130
131
Refreshing Overhead
Leakage : - junction leakage exponential with temp! - 25 msec @ 800 C - Decreases noise margin, destroys info All columns in a selected row are refreshed when read - Count through all row addresses once per 3 msec. (no write possible then) Overhead @ 10nsec read time for 8192*8192=64Mb: - 8192*1e-8/3e-3= 2.7% Requires additional refresh counter and I/O control
133
Dummy cells
Bitline
Bitline
Vdd/2
Vdd/2 precharge precharge
Wordline
134
Needs
- A method of generating signal swing of bit line
Operation:
- Dummy cell is C - active wordline and dummy wordline on opposite sides of sense amp. - Amplify difference
Dummy Col BL BL
BL Dummy Col BL
DV"1/ 0"
1 1 Cb C s
Vdd 2
Data Col BL
135
Double Bitline
Data
SA outputs D and D pre-charged to VDD through Q1, Q2 (Pr=1)
Dummy
reference capacitor, Cdummy, connected to a pair of matched bit lines and is at 0V (Pr=0) parasitic cap Cp2 on BL is ~ 2 Cp1 on BL, sets up a differential voltage LHS vs. RHS due to rise time difference SA outputs (D, D) become charged, with a small difference LHS vs. RHS Regenerative Action of Latch
136
n
DRAM Controller n/2 Memory Timing Controller DRAM 2^n x 1 chip w Bus Drivers
137
DRAM Performance
Cycle Time Access Time Time
DRAM
Row Address
- Only CAS is needed to access other M-bit blocks on that row N x M SRAM - RAS_L remains asserted while M-bit Output CAS_L is toggled
1st M-bit Access RAS_L CAS_L A Row Address Col Address Col Address Col Address 2nd M-bit 3rd M-bit
Col Address
139
Bandwidth takes into account 110 ns first cycle, 40 ns for CAS cycles Bandwidth for one word = 8 bytes / 110 ns = 69.35 MB/sec Bandwidth for two words = 16 bytes / (110+40 ns) = 101.73 MB/sec Peak bandwidth = 8 bytes / 40 ns = 190.73 MB/sec Maximum sustained bandwidth = (256 words * 8 bytes) / ( 110ns + 256*40ns) = 188.71 MB/sec
140
141
142
143
64K DRAM
Internal Vbbgenerator Boosted Wordline and Active Restore
- eliminate Vtloss for 1
x4 pinout
145
256K DRAM
Folded bitline architecture
- Common mode noise to coupling to B/Ls - Easy Y-access
NMOS 2P1M
- poly 1 plate - poly 2 (polycide) -gate, W/L - metal -B/L
redundancy
146
1M DRAM
Triple poly Planar cell, 3P1M
poly1 -gate, W/L poly2 plate poly3 (polycide) -B/L metal -W/L strap
147
precharge voltage
- e.g VDD/2 for DRAM Bitline .
backgate bias
- reduce leakage
148
+Vin
Charge Phase
Vin
dV +Vin dV Vo
Discharge Phase
Vin = dV Vin + dV +Vo Vo = 2*Vin + 2*dV ~ 2*Vin
149
d Vhi Vhi dV Vcf(0) ~ Vhi VGG=Vhi + VGG ~ Vhi + Vhi CL Cf Vcf ~ Vhi
150
Use charge pump Backgate bias: Increases Vt -> reduces leakage reduces Cj of nMOST when applied to p-well (triple well process!),
151 smaller Cj -> smaller Cb larger readout V
Vdd / 2 Generation
2v
1v 1.5v 0.5v ~1v 0.5v 1 v
0.5v
1v
Vtn = |Vtp|~0.5v uN = 2 uP
152
4M DRAM
3D stacked or trench cell CMOS 4P1M x16 introduced Self Refresh Build cell in vertical dimension -shrink area while maintaining 30fF cell capacitance
153
154
Stacked-Capacitor Cells
Poly plate
Hitachi 64Mbit DRAM Cross Section Samsung 64Mbit DRAM Cross Section
155
156
157
Shallow Trench Isolation -> Replaces LOCOS isolation -> saves area by eliminating Birds Beak
158
256K DRAM
Folded bitline architecture
- Common mode noise to coupling to B/Ls - Easy Y-access
NMOS 2P1M
- poly 1 plate - poly 2 (polycide) -gate, W/L - metal -B/L
redundancy
160
161
162
Transposed-Bitline Architecture
BL BL BL BL" (a) Straightforward bitline routing. BL BL BL BL" (b) Transposed bitline architecture.
163
Ccross SA
Ccross SA
Major Circuits
- Sense amplifier - Dynamic Row Decoder - Wordline Driver
164
165
WL direction (row)
Column predecode
Local WL Decode
166
BL direction (col)
256x256
64
256
168
169
170
171