Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu

Modern Computer Architecture
(Processor Design)
Prof. Dan Connors

dconnors@colostate.edu
Computer Architecture
 Historic definition
Computer Architecture =
Instruction Set Architecture +
Computer Organization
 Famous architects: Wright, Fuller, Herzog

 Famous computer architects: Smith, Patt, Hwu, Hennessey,
Patterson
Instruction Set Architecture
 Important acronym: ISA

– Instruction Set Architecture
 The low-level software interface to the machine

– Assembly language of the machine
– Must translate any programming language into this language
– Examples: IA-32 (Intel instruction set), MIPS, SPARC, Alpha, PA-
RISC, PowerPC, …
 Visible to programmer
Differences between ISA’s
 Much more is similar between ISA’s than different.
Compare MIPS & x86:
– Instructions:
• same basic types
• different names and variable-length encodings
• x86 branches use condition codes (like MIPS floating point)
• x86 supports (register + memory) -> (register) format
– Registers:
• Register-based architecture
• different number and names, x86 allows partial reads/writes
– Memory:
• Byte addressable, 32-bit address space
• x86 has additional addressing modes and many instruction
types can access memory
RISC vs. CISC
 MIPS was one of the first RISC architectures. It was started about 20
years ago by John Hennessy, one of the authors of our textbook.
 The architecture is similar to that of other RISC architectures, including
Sun’s SPARC, IBM and Motorola’s PowerPC, and ARM-based
processors.
 Older processors used complex instruction sets, or CISC architectures.
– Many powerful instructions were supported, making the assembly language
programmer’s job much easier.
– But this meant that the processor was more complex, which made the
hardware designer’s life harder.
 Many new processors use reduced instruction sets, or RISC
architectures.
– Only relatively simple instructions are available. But with high-level
languages and compilers, the impact on programmers is minimal.
– On the other hand, the hardware is much easier to design, optimize, and
teach in classes.
 Even most current CISC processors, such as Intel 8086-based chips,
are now implemented using a lot of RISC techniques.
Tracing an Instruction's Execution
 Step 1
 Step 2
to controller
 Step 3
Instruction Set Architecture
Application
Software
Operating
System
Compiler Firmware
Instruction Set
Architecture
Instr. Set Proc. I/O system
Datapath & Control
Digital Design Hardware
Circuit Design
Layout
Computer Organization
 Computer organization is the implementation of the
machine, consisting of two components:
– High-level organization
• Memory system widths, bus structure, CPU internal organization, …
– Hardware
• Precise logic and circuit design, mechanical packaging, …
 Many implementations are possible for an ISA !!!

– Intel i386, i486, Pentium, Pentium II, Core2Duo, QuadCore…
– Performance, complexity, cost differences….
 Invisible to the programmer (mostly)!

Moore’s Law: 2x transistors every 18 months
Technology => dramatic change
 Processor
 logic capacity: about 30% increase per year
 clock rate: about 20% increase per year
Higher logic density gave room for instruction pipeline & cache
 Memory
 DRAM capacity: about 60% increase per year (4x every 3 years)
 Memory speed: about 10% increase per year
 Cost per bit: about 25% improvement per year
Performance optimization no longer implies smaller programs
 Disk
 Capacity: about 60% increase per year
Computers became lighter and more power efficient

Pentium Die Photo
 3,100,000 transistors
 296 mm2
 60 MHz
 Introduced in 1993
– 1st superscalar
implementation of IA32
Four Organization Concepts/Trends
 Pipelining
– Frequency
 Cache memory
– To keep processor running must overcome memory latency
 Superscalar execution
– Out of order execution
 Multi-core
Pipelined Datapath
IF/ID ID/EX EX/MEM MEM/WB
Zero
Test
Adata
Instr datIn
15:0
Xtnd Data
20:0
Xtnd << 2 Mem.
25:21 datOut
regA datA addr
aluA
20:16 regB
P Instr. Reg.
Array
C Mem. datW ALU ALUout
regW datB aluB
20:13
4:0
Wdest
25:21
IncrPC
+4
Wdata
 Pipe Registers
– Inserted between stages
– Labeled by preceding & following stage
Pipelining: Use Multiple Resources
Time (clock cycles)
ALU
Im Reg Dm Reg
n Inst 0
s
ALU
t Inst 1 Im Reg Dm Reg
r.
Inst 2
ALU
O Im Reg Dm Reg
r
d Inst 3
ALU
Im Reg Dm Reg
e
r
Inst 4
ALU
Im Reg Dm Reg
Conventional Pipelined Execution Representation
Time
IFetch Dcd Exec Mem WB

Program Flow
 Improve performance by increasing instruction

throughput
Ideal speedup? N Stages and I instructions
Technology Trends and Performance
1000 1000000
Speed Capacity
Logic 100000 Logic
DRAM
100 10000 DRAM
CPU speed and
2× in 3 years Memory speed 1000 4× in 3 years
10
grow apart 100
1.1× in 3 years 2× in 3 years
10
1 1
80
83
86
89
92
95
98
01
04
07
80
83
86
89
92
95
98
01
04
07
19
19
19
19
19
19
19
20
20
20
19
19
19
19
19
19
19
20
20
20
 Computing capacity: 4× per 3 years

– If we could keep all the transistors busy all the time
– Actual: 3.3× per 3 years
 Moore’s Law: Performance is doubled every ~18
months
– Trend is slowing: process scaling declines, power is up
Processor-DRAM Memory Gap (latency)
1000 µProc
CPU
60%/yr.
“Moore’s Law” (2X/1.5yr)
100 Processor-Memory
Performance
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM 9%/yr.
(2X/10 yrs)
1
1989
1986
1988
1990
1992
1994
1996
1998
1980
1981
1982
1983
1984
1985
1987
1991
1993
1995
1997
1999
2000
Time
Cache Memory
 Small, fast storage used to improve average access time to
slow memory.
 Exploits spatial and temporal locality
 In computer architecture, almost everything is a cache!
– Registers a cache on variables
– First-level cache a cache on second-level cache
– Second-level cache a cache on memory
– Memory a cache on disk (virtual memory)
Proc/Regs
L1-Cache
Bigger L2-Cache Faster
DRAM Memory
Harddrive
Modern Processor
This is an Floating-Point Unit

AMD Operton CPU 64 kB
Load/Store Data
Cache
Total Area: 193 mm2 1 MB Unified
Execution
Bus
Unit
Unit Instruction/Data
Look at the relative Level 2 Cache
Fetch Scan
sizes of each block: 64 kB
Align
Instr.
50% cache Micro-code
Cache
23% I/O
Memory Controller
20% CPU logic
+ extra stuff HyperTransport DDR Memory Interface
CPUs dedicate > 50% area for cache memory.

memory
CPI – Cycles Per Instruction
 CPUs work according to a clock signal
– Clock cycle is measured in nsec (10-9 of a second)
– Clock frequency (= 1/clock cycle) measured in GHz
(109cyc/sec)
 Instruction Count (IC)

– Total number of instructions executed in the program
#cycles required to execute the program
CPI =
IC
 CPI – Cycles Per Instruction
– Average #cycles per Instruction (in a given program)
– IPC (= 1/CPI) : Instructions per cycles
Relating time to system measures
 Iron Law: Performance = 1/execution time
 Suppose that for some program we have:
– T seconds running time (the ultimate performance
measure)
– C clock ticks, I instructions, P seconds/tick
(performance measures of interest to the system
designer)
 T secs = C ticks x P secs/tick
= (I inst/I inst) x C ticks x P secs/tick
 T secs = I inst x (C ticks/I inst) x P secs/tick
running
instruction avg clock
time clock period
count ticks per
instruction
(CPI)
Pentium 4 Block Diagram
Microprocessor Report
Out Of Order Execution
 Look ahead in a window of instructions and find instructions
that are ready to execute
– Don’t depend on data from previous instructions still not
executed
– Resources are available
 Out-of-order execution
– Start instruction execution before execution of a previous
instructions
 Advantages:
– Help exploit Instruction Level Parallelism (ILP)
– Help cover latencies (e.g., L1 data cache miss, divide)
Data Flow Analysis
 Example:
(1) r1 ← r4 / r7 ; assume divide takes 20 cycles
(2) r8 ← r1 + r2
(3) r5 ← r5 + 1
(4) r6 ← r6 - r3 In-order execution
(5) r4 ← r5 + r6 1 2
(6) r7 ← r8 * r4 3 5 6
4
Data Flow Graph
1 3 4
Out-of-order execution
r1 r5 r6 1
2 5 3 5 2 6
r8 r4 4
6
OOOE – General Scheme
Fetch & Instruction Retire
Decode pool (commit)
In-order In-order
Execute
Out-of-order
 Fetch & decode instructions in parallel but in order, to fill inst. pool
 Execute ready instructions from the instructions pool
– All the data required for the instruction is ready
– Execution resources are available
 Once an instruction is executed
– signal all dependant instructions that data is ready
 Commit instructions in parallel but in-order
– Can commit an instruction only after all preceding instructions (in
program order) have committed
Out Of Order Execution – Example
 Assume that executing a divide operation takes 20 cycles

(1) r1 ← r5 / r4 1 5
(2) r3 ← r1 + r8
(3) r8 ← r5 + 1 2
(4) r3 ← r7 - 2
3 4
(5) r6 ← r6 + r7
 Inst2 has a REAL (flow) dependency on r1 with Inst1

– It cannot be executed in parallel with Inst1
 Can successive instructions pass Inst2 ?

– Inst3 cannot since Inst2 must read r8 before Inst3 writes to it
– Inst4 cannot since it must write to r3 after Inst2
– Inst5 can
Overcoming False Dependencies
 OOOE creates new dependencies
– WAR (write after read): write to a register which is read by an
earlier inst.
(1) r3 ← r2 + r1
(2) r2 ← r4 + 3
– WAW (write after write): write to a register which is written by an

earlier inst.
(1) r3 ← r1 + r2
(2) r3 ← r4 + 3
 These are false dependencies

– There is no missing data
– Still prevent executing instructions out-of-order
 Solution: Register Renaming

Pentium-4 Die Photo
 EBL/BBL - Bus logic, Front, Back
 MOB - Memory Order Buffer
 Packed FPU - MMX Fl. Pt. (SSE)
 IEU - Integer Execution Unit
 FAU - Fl. Pt. Arithmetic Unit
 MIU - Memory Interface Unit
 DCU - Data Cache Unit
 PMH - Page Miss Handler
 DTLB - Data TLB
 BAC - Branch Address Calculator
 RAT - Register Alias Table
 SIMD - Packed Fl. Pt.
 RS - Reservation Station
 BTB - Branch Target Buffer
 IFU - Instruction Fetch Unit (+I$)
 ID - Instruction Decode
 ROB - Reorder Buffer
1st Pentium4: 9.5 M transistors  MS - Micro-instruction Sequencer
An Exciting Time for CE - CMPs
Processor cores
L2$
Shared
L2$
L3$
Shared
L3$
Monolithic uniprocessor Chip multi-processor

Computer System Structure
External
Graphics
Card
PCI express ×16

North Bridge
Cache DDRII
CPU BUS On-board Memory Channel 1
CPU Graphics controller Mem BUS
DDRII
Channel 2
South Bridge PCI express ×1
IO Controller
USB IDE SATA PCI
controller controller controller
Sound Lan
Parallel Port
Card Adap
Serial Port
Floppy mouse DVD Hard

keybrd
Drive Drive Disk
speakers LAN

Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu

Uploaded by

Copyright:

Available Formats

You might also like

Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu

Uploaded by

Copyright:

Available Formats

Modern Computer Architecture

Prof. Dan Connors

Instruction Set Architecture +

 Famous architects: Wright, Fuller, Herzog

 Important acronym: ISA

 The low-level software interface to the machine

 Many implementations are possible for an ISA !!!

 Invisible to the programmer (mostly)!

Performance optimization no longer implies smaller programs

Computers became lighter and more power efficient

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

 Improve performance by increasing instruction

 Computing capacity: 4× per 3 years

This is an Floating-Point Unit

CPUs dedicate > 50% area for cache memory.

 Instruction Count (IC)

 Assume that executing a divide operation takes 20 cycles

 Inst2 has a REAL (flow) dependency on r1 with Inst1

 Can successive instructions pass Inst2 ?

– WAW (write after write): write to a register which is written by an

 These are false dependencies

 Solution: Register Renaming

Monolithic uniprocessor Chip multi-processor

PCI express ×16

South Bridge PCI express ×1

Floppy mouse DVD Hard

You might also like