Professional Documents
Culture Documents
Microprocessor Evolution: 4004 To Pentium-4: Based On The Material Prepared by Krste Asanovic and Arvind
Microprocessor Evolution: 4004 To Pentium-4: Based On The Material Prepared by Krste Asanovic and Arvind
4004 to Pentium-4
Joel Emer
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
November 2, 2005
6.823 L15- 2
First Microprocessor
Emer
• 4-bit
accumulator
architecture
Image removed due to copyright
restrictions. • 8µm pMOS
To view image, visit • 2,300 transistors
http://news.com.com/Images+Moores+L
• 3 x 4 mm2
aw+turns+40/2009-1041_3-5649019-
5.html • 750kHz clock
• 8-16 cycles/inst.
November 2, 2005
6.823 L15- 3
Emer
computers
November 2, 2005
6.823 L15- 4
Emer
November 2, 2005
6.823 L15- 5
Emer
Microprocessor Evolution
Rapid progress in size and speed through 70s
– Fueled by advances in MOSFET technology and expanding markets
Intel i432
– Most ambitious seventies’ micro; started in 1975 - released 1981
– 32-bit capability-based object-oriented architecture
– Instructions variable number of bits long
– Severe performance, complexity, and usability problems
November 2, 2005
IBM PC, 1981
6.823 L15- 7
Emer
Hardware
• Team from IBM building PC prototypes in 1979
• Motorola 68000 chosen initially, but 68000 was late
• IBM builds “stopgap” prototypes using 8088 boards from
Software
• Microsoft negotiates to provide OS for IBM. Later buys and
Open System
• Standard processor, Intel 8088
• Standard interfaces
• Standard OS, MS-DOS
• IBM permits cloning and third-party software
November 2, 2005
The Eighties:
6.823 L15- 8
Emer
November 2, 2005
6.823 L15- 9
November 2, 2005
Intel Pentium 4 (2000)
6.823 L15- 10
Emer
This lecture contains figures and data taken from: “The microarchitecture of the
Pentium 4 processor”, Intel Technology Journal, Q1, 2001
November 2, 2005
6.823 L15- 11
Pentium 4 uOPs
Emer
November 2, 2005
6.823 L15- 12
• Pentium-4 family
– translation in hardware at level 1 instruction
cache refill
• Transmeta Crusoe
November 2, 2005
Pentium 4 Block Diagram
6.823 L15- 13
Emer
System Bus
Execution Units
Level 2 Cache
INTEGER AND FP
MEMORY SUBSYSTEM EXECUTION UNITS
L2 Cache
Fetch Buffer
Single x86 instruction/cycle
November 2, 2005
6.823 L15- 15
Trace Cache
Emer
BR BR BR
BR BR BR
November 2, 2005
6.823 L15- 16
br T2
...
cmp br T1 sub
T2: mov
br T2 mov sub
sub
br T3
br T3 add sub
...
mov br T4 T4:...
T3: add
sub
• Saves energy
– x86 decoder only powered up on trace cache refill
November 2, 2005
P-4 Trace Cache Fetch
6.823 L15- 18
Emer
1 TC Next IP
2 (BTB)
3 Trace Cache
TC Fetch
4
5 Drive (12K uops, 2K lines of 6 uops)
6 Alloc
7
Rename
8 Trace IP
9 Queue Microcode
10 Schedule 1 ROM
11 Schedule 2 Trace BTB
12 Schedule 3 (512 entries)
13 Dispatch 1
16-entry 6 uops every two
14 Dispatch 2
subroutine return CPU cycles
15 Register File 1
address stack
16 Register File 2
17 Execute
uop buffer
18 Flags
19 Branch Check 3 uops/cycle
20 Drive
November 2, 2005
6.823 L15- 19
Line Prediction
Emer
(Alpha 21[234]64)
Instr
Cache
Branch
Predictor
Line PC
Predictor Calc
Return
Stack
Indirect
Branch
Predictor
November 2, 2005
P-III vs. P-4 Renaming 6.823 L15- 20
Emer
1 TC Next IP
2 (BTB)
3 ROB
RF ROB
TC Fetch Data Status
4 Data Status
Frontend RAT
5 Drive EAX
EBX
6 Alloc RAT
ECX
EDX
7 EAX ESL
Rename EBX
ECX
EDL
ESP
8 EDX EBP
ESL
9 Queue EDL
ESP
Retirement RAT
EBP EAX
10 Schedule 1 EBX
ECX
EDX
11 Schedule 2 RRF ESL
EDL
12 Schedule 3 ESP
EBP
13 Dispatch 1
14 Dispatch 2 Pentium R III NetBurstTM
15 Register File 1
Figure by MIT OCW.
16 Register File 2
17 Execute P-4 physical register file separated from ROB status.
ROB entries allocated sequentially as in P6 family.
18 Flags
One of 128 physical registers allocated from free list.
19 Branch Check
No data movement on retire, only Retirement RAT
20 Drive updated.
November 2, 2005
6.823 L15- 21
1 TC Next IP
Allocated/Renamed uops
2 (BTB)
3 3 uops/cycle
TC Fetch
4
5 Drive Memory uop Arithmetic
6 Alloc Queue uop Queue
7
Rename
8
9 Queue Fast
10 Schedule 1 Memory Fast
Scheduler General Simple FP
11 Schedule 2 Scheduler Scheduler
(x2) Scheduler Scheduler
12 Schedule 3 (x2)
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2 Ready uops compete for dispatch ports
17 Execute
18 Flags (Fast schedulers can each dispatch 2 ALU
operations per cycle)
19 Branch Check
20 Drive
November 2, 2005
6.823 L15- 22
Add/Sub FP/SSE Move Add/Sub Shift/Rotate FP/SSE-Add All Loads Store Address
Logic FP/SSE Store FP/SSE-Mul LEA
Store Data FXCH FP/SSE-Div SW Prefetch
Branches MMX
November 2, 2005
6.823 L15- 23
Register
File and
Bypass
Network
L1 Data
Cache
• Fast ALUs and bypass network runs at twice global clock speed
• All “non-essential” circuit paths handled out of loop to reduce circuit
loading (shifts, mults/divs, branches, flag/ops)
• Other bypassing takes multiple clock cycles
November 2, 2005
6.823 L15- 24
Emer
November 2, 2005
6.823 L15- 25
1
TC Next IP
2
3
TC Fetch
4
5 Drive
6 Alloc
7
Rename
8
9 Queue
10 Schedule 1 Long delay from
11 Schedule 2 schedulers to load
12 Schedule 3 hit/miss
13 Dispatch 1
14 Dispatch 2 • P-4 guesses that load will hit in L1 and
15 Register File 1 schedules dependent operations to use value
16 Register File 2
• If load misses, only dependent operations are
17 Load Execute 1
replayed
18 Load Execute 2
19 Branch Check
20 Drive
November 2, 2005
6.823 L15- 26
1
TC Next IP
2
3
TC Fetch
4
5 Drive
6 Alloc 20 cycle branch
7 mispredict penalty
Rename
8
9 Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1 • P-4 uses new “trade secret” branch
18 Flags algorithm
19 Branch Check
20 Drive
November 2, 2005
6.823 L15- 27
(Alpha 21264)
Choice Prediction
PC (4,096x2b)
1 2 3 4 5 6 7 8 9 10
Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex Flgs Br Ck Drive
• Performance Equation:
Time = Instructions * Cycles * Time
Program Program Instruction Cycle
November 2, 2005
6.823 L15- 29
Emer
November 2, 2005
6.823 L15- 30
November 2, 2005
Visible Wire Delay in P-4 Design
6.823 L15- 31
Emer
1
TC Next IP
2
3
TC Fetch
4
5 Drive
6 Alloc
7
Rename
8
9 Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
Pipeline stages dedicated to just
13 Dispatch 1 driving signals across chip!
14 Dispatch 2
15 Register File 1
16 Register File 2
17 Execute
18 Flags
19 Branch Check
20 Drive
November 2, 2005
P-4 Microarchitecture
6.823 L15- 32
Emer
48 GB/s
Microarchitecture Comparison
Emer
In-Order Out-of-Order
Execution Execution
• Both styles of machine can use same branch predictors in front-end fetch
pipeline, and both can execute multiple instructions per cycle
• Common to have 10-30 pipeline stages in either style of design
November 2, 2005
6.823 L15- 34
November 2, 2005