Lecture27connors PDF

Modern Computer Architecture
(Processor Design)
Prof. Dan Connors

dconnors@colostate.edu
Computer Architecture
Historic definition
Computer Architecture =
Instruction Set Architecture +
Computer Organization
Famous architects: Wright, Fuller, Herzog

Famous computer architects: Smith, Patt, Hwu, Hennessey,
Patterson
Instruction Set Architecture
Important acronym: ISA

The low-level software interface to the machine

Assembly language of the machine
Must translate any programming language into this language
Examples: IA-32 (Intel instruction set), MIPS, SPARC, Alpha, PA-
RISC, PowerPC,
Visible to programmer
Differences between ISAs
Much more is similar between ISAs than different.
Compare MIPS & x86:
Instructions:
same basic types
different names and variable-length encodings
x86 branches use condition codes (like MIPS floating point)
x86 supports (register + memory) -> (register) format
Registers:
Register-based architecture
different number and names, x86 allows partial reads/writes
Memory:
Byte addressable, 32-bit address space
x86 has additional addressing modes and many instruction
types can access memory
RISC vs. CISC
MIPS was one of the first RISC architectures. It was started about 20
years ago by John Hennessy, one of the authors of our textbook.
The architecture is similar to that of other RISC architectures, including
Suns SPARC, IBM and Motorolas PowerPC, and ARM-based
processors.
Older processors used complex instruction sets, or CISC architectures.
Many powerful instructions were supported, making the assembly language
programmers job much easier.
But this meant that the processor was more complex, which made the
hardware designers life harder.
Many new processors use reduced instruction sets, or RISC
architectures.
Only relatively simple instructions are available. But with high-level
languages and compilers, the impact on programmers is minimal.
On the other hand, the hardware is much easier to design, optimize, and
teach in classes.
Even most current CISC processors, such as Intel 8086-based chips,
are now implemented using a lot of RISC techniques.
Tracing an Instruction's Execution
Step 1
Step 2
to controller
Step 3
Application
Software
Operating
System
Compiler Firmware
Instruction Set
Architecture
Instr. Set Proc. I/O system
Datapath & Control
Digital Design Hardware
Circuit Design
Layout
Computer Organization
Computer organization is the implementation of the
machine, consisting of two components:
High-level organization
Memory system widths, bus structure, CPU internal organization,
Hardware
Precise logic and circuit design, mechanical packaging,
Many implementations are possible for an ISA !!!

Intel i386, i486, Pentium, Pentium II, Core2Duo, QuadCore
Performance, complexity, cost differences.
Invisible to the programmer (mostly)!

Moores Law: 2x transistors every 18 months
Technology => dramatic change
Processor
logic capacity: about 30% increase per year
clock rate: about 20% increase per year
Higher logic density gave room for instruction pipeline & cache
Memory
DRAM capacity: about 60% increase per year (4x every 3 years)
Memory speed: about 10% increase per year
Cost per bit: about 25% improvement per year
Performance optimization no longer implies smaller programs
Disk
Capacity: about 60% increase per year
Computers became lighter and more power efficient

Pentium Die Photo
3,100,000 transistors
296 mm2
60 MHz
Introduced in 1993
1st superscalar
implementation of IA32
Four Organization Concepts/Trends
Pipelining
Frequency
Cache memory
To keep processor running must overcome memory latency
Superscalar execution
Out of order execution
Multi-core
Pipelined Datapath
IF/ID ID/EX EX/MEM MEM/WB
Zero
Test
Adata
Instr datIn
15:0
Xtnd Data
20:0
Xtnd << 2 Mem.
25:21 datOut
regA datA addr
aluA
20:16 regB
P Instr. Reg.
Array
C Mem. datW ALU ALUout
regW datB aluB
20:13
4:0
Wdest
25:21
IncrPC
+4
Wdata
Pipe Registers
Inserted between stages
Labeled by preceding & following stage
Pipelining: Use Multiple Resources
Time (clock cycles)
ALU
Im Reg Dm Reg
n Inst 0
s
ALU
t Inst 1 Im Reg Dm Reg
r.
Inst 2
ALU
O Im Reg Dm Reg
r
d Inst 3
ALU
Im Reg Dm Reg
e
r
Inst 4
ALU
Im Reg Dm Reg
Conventional Pipelined Execution Representation
Time
IFetch Dcd Exec Mem WB

Program Flow
Improve performance by increasing instruction

throughput
Ideal speedup? N Stages and I instructions
Technology Trends and Performance
1000 1000000
Speed Capacity
Logic 100000 Logic
DRAM
100 10000 DRAM
CPU speed and
2 in 3 years Memory speed 1000 4 in 3 years
10
grow apart 100
1.1 in 3 years 2 in 3 years
10
1 1
80
83
86
89
92
95
98
01
04
07
80
83
86
89
92
95
98
01
04
07
19
19
19
19
19
19
19
20
20
20
19
19
19
19
19
19
19
20
20
20
Computing capacity: 4 per 3 years

If we could keep all the transistors busy all the time
Actual: 3.3 per 3 years
Moores Law: Performance is doubled every ~18
months
Trend is slowing: process scaling declines, power is up
Processor-DRAM Memory Gap (latency)
1000 Proc
CPU
60%/yr.
Moores Law (2X/1.5yr)
100 Processor-Memory
Performance
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM 9%/yr.
(2X/10 yrs)
1
1989
1986
1988
1990
1992
1994
1996
1998
1980
1981
1982
1983
1984
1985
1987
1991
1993
1995
1997
1999
2000
Time
Cache Memory
Small, fast storage used to improve average access time to
slow memory.
Exploits spatial and temporal locality
In computer architecture, almost everything is a cache!
Registers a cache on variables
First-level cache a cache on second-level cache
Second-level cache a cache on memory
Memory a cache on disk (virtual memory)
Proc/Regs
L1-Cache
Bigger L2-Cache Faster
DRAM Memory
Harddrive
Modern Processor
This is an Floating-Point Unit

AMD Operton CPU 64 kB
Load/Store Data
Cache
Total Area: 193 mm2 1 MB Unified
Execution
Bus
Unit
Unit Instruction/Data
Look at the relative Level 2 Cache
Fetch Scan
sizes of each block: 64 kB
Align
Instr.
50% cache Micro-code
Cache
23% I/O
Memory Controller
20% CPU logic
+ extra stuff HyperTransport DDR Memory Interface
CPUs dedicate > 50% area for cache memory.

memory
CPI Cycles Per Instruction
CPUs work according to a clock signal
Clock cycle is measured in nsec (10-9 of a second)
Clock frequency (= 1/clock cycle) measured in GHz
(109cyc/sec)
Instruction Count (IC)

Total number of instructions executed in the program
#cycles required to execute the program
CPI =
IC
CPI Cycles Per Instruction
Average #cycles per Instruction (in a given program)
IPC (= 1/CPI) : Instructions per cycles
Relating time to system measures
Iron Law: Performance = 1/execution time
Suppose that for some program we have:
T seconds running time (the ultimate performance
measure)
C clock ticks, I instructions, P seconds/tick
(performance measures of interest to the system
designer)
T secs = C ticks x P secs/tick
= (I inst/I inst) x C ticks x P secs/tick
T secs = I inst x (C ticks/I inst) x P secs/tick
running
instruction avg clock
time clock period
count ticks per
instruction
(CPI)
Pentium 4 Block Diagram
Microprocessor Report
Out Of Order Execution
Look ahead in a window of instructions and find instructions
that are ready to execute
Dont depend on data from previous instructions still not
executed
Resources are available
Out-of-order execution
Start instruction execution before execution of a previous
instructions
Advantages:
Help exploit Instruction Level Parallelism (ILP)
Help cover latencies (e.g., L1 data cache miss, divide)
Data Flow Analysis
Example:
(1) r1 r4 / r7 ; assume divide takes 20 cycles
(2) r8 r1 + r2
(3) r5 r5 + 1
(4) r6 r6 - r3 In-order execution
(5) r4 r5 + r6 1 2
(6) r7 r8 * r4 3 5 6
4
Data Flow Graph
1 3 4
Out-of-order execution
r1 r5 r6 1
2 5 3 5 2 6
r8 r4 4
6
OOOE General Scheme
Fetch & Instruction Retire
Decode pool (commit)
In-order In-order
Execute
Out-of-order
Fetch & decode instructions in parallel but in order, to fill inst. pool
Execute ready instructions from the instructions pool
All the data required for the instruction is ready
Execution resources are available
Once an instruction is executed
signal all dependant instructions that data is ready
Commit instructions in parallel but in-order
Can commit an instruction only after all preceding instructions (in
program order) have committed
Out Of Order Execution Example
Assume that executing a divide operation takes 20 cycles

(1) r1 r5 / r4 1 5
(2) r3 r1 + r8
(3) r8 r5 + 1 2
(4) r3 r7 - 2
3 4
(5) r6 r6 + r7
Inst2 has a REAL (flow) dependency on r1 with Inst1

It cannot be executed in parallel with Inst1
Can successive instructions pass Inst2 ?

Inst3 cannot since Inst2 must read r8 before Inst3 writes to it
Inst4 cannot since it must write to r3 after Inst2
Inst5 can
Overcoming False Dependencies
OOOE creates new dependencies
WAR (write after read): write to a register which is read by an
earlier inst.
(1) r3 r2 + r1
(2) r2 r4 + 3
WAW (write after write): write to a register which is written by an

earlier inst.
(1) r3 r1 + r2
(2) r3 r4 + 3
These are false dependencies

There is no missing data
Still prevent executing instructions out-of-order
Solution: Register Renaming

Pentium-4 Die Photo
EBL/BBL - Bus logic, Front, Back
MOB - Memory Order Buffer
Packed FPU - MMX Fl. Pt. (SSE)
IEU - Integer Execution Unit
FAU - Fl. Pt. Arithmetic Unit
MIU - Memory Interface Unit
DCU - Data Cache Unit
PMH - Page Miss Handler
DTLB - Data TLB
BAC - Branch Address Calculator
RAT - Register Alias Table
SIMD - Packed Fl. Pt.
RS - Reservation Station
BTB - Branch Target Buffer
IFU - Instruction Fetch Unit (+I$)
ID - Instruction Decode
ROB - Reorder Buffer
1st Pentium4: 9.5 M transistors MS - Micro-instruction Sequencer
An Exciting Time for CE - CMPs
Processor cores
L2$
Shared
L2$
L3$
Shared
L3$
Monolithic uniprocessor Chip multi-processor

Computer System Structure
External
Graphics
Card
PCI express 16
North Bridge
Cache DDRII
CPU BUS On-board Memory Channel 1
CPU Graphics controller Mem BUS
DDRII
Channel 2
South Bridge PCI express 1
IO Controller
USB IDE SATA PCI
controller controller controller
Sound Lan
Parallel Port
Card Adap
Serial Port
Floppy mouse DVD Hard

keybrd
Drive Drive Disk
speakers LAN

Lecture27connors PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture27connors PDF

Uploaded by

Copyright:

Available Formats

Modern Computer Architecture

Prof. Dan Connors

Instruction Set Architecture +

Famous architects: Wright, Fuller, Herzog

Important acronym: ISA

The low-level software interface to the machine

Many implementations are possible for an ISA !!!

Invisible to the programmer (mostly)!

Performance optimization no longer implies smaller programs

Computers became lighter and more power efficient

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

Improve performance by increasing instruction

Computing capacity: 4 per 3 years

This is an Floating-Point Unit

CPUs dedicate > 50% area for cache memory.

Instruction Count (IC)

Assume that executing a divide operation takes 20 cycles

Inst2 has a REAL (flow) dependency on r1 with Inst1

Can successive instructions pass Inst2 ?

WAW (write after write): write to a register which is written by an

These are false dependencies

Solution: Register Renaming

Monolithic uniprocessor Chip multi-processor

South Bridge PCI express 1

Floppy mouse DVD Hard

You might also like