You are on page 1of 46

Computer Architecture 2012 Introduction (lec1)

1
Computer Architecture
(MAMAS, 234267)
Spring 2012

Lecturer: Dan Tsafrir
Reception: Mon 18:30, Taub 611

12/3/2012
Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz
Computer Architecture 2012 Introduction (lec1)
2
General Info
Grade
20% Exercise (mandatory)
80% Final exam

Textbook
Computer Architecture:
A Quantitative Approach (4
th
Edition)
by: Patterson & Hennessy

Other course information
Course web site:
http://webcourse.cs.technion.ac.il/234267/Spring2012
Lectures will be upload to the web a day before the class
Computer Architecture 2012 Introduction (lec1)
3
Computer System Structure
Computer Architecture 2012 Introduction (lec1)
4
Classical Motherboard Diagram
CPU
PCI
North Bridge DDR2 or DDR3
Channel 1
mouse
LAN
Lan
Adap
External
Graphics
Card
Mem BUS
CPU BUS
Cache
Sound
Card
speakers
South Bridge
PCI express 2.0
IO Controller
Hard
Disk
P
a
r
a
l
l
e
l

P
o
r
t

S
e
r
i
a
l

P
o
r
t

Floppy
Drive
keybrd
DDR2 or DDR3
Channel 2
USB
controller
SATA
controller
PCI express 1
Memory
controller
On-board
Graphics
DVD
Drive
IOMMU
More to the north
= closer to the CPU
= faster
Computer Architecture 2012 Introduction (lec1)
5
Intel Core 2
Northbridge = MCH =
mem controller hub
Southbridge = ICH =
I/O controller hub
Notice bandwidths
65 to 45 nm
Computer Architecture 2012 Introduction (lec1)
6
Intel Nehalem Core i3 i5 i7
For high-end i-Series chips,
Northbridge functionality
moved onto processor
(=> made faster)
45 to 32 nm
Computer Architecture 2012 Introduction (lec1)
7
Intel Sandy Bridge Core i3 i5 i7
The trend
continues
32 to 22 nm
Computer Architecture 2012 Introduction (lec1)
8
Computer Architecture 2012 Introduction (lec1)
9
Course Focus
Start from CPU (=processor)
Instruction set, performance
Pipeline, hazards
Branch prediction
Out-of-order execution
Move on to Memory Hierarchy
Caching
Main memory
Virtual Memory
Move on to PC Architecture
Motherboard & chipset, DRAM, I/O, Disk, peripherals
End with some Advanced Topics
Computer Architecture 2012 Introduction (lec1)
10
The Processor
Computer Architecture 2012 Introduction (lec1)
11
Architecture vs. Microarchitecture
Architecture:
= The processor features as seen by its user
= Interface
Instruction set, number of registers, addressing modes,

Microarchitecture:
= Manner by which the processor is implemented
= Implementation details
Caches size and structure, number of execution units,

Note: different processors with different u-archs
can support the same arch
Example: Intel Pentium-IV vs. Intel Core2 Duo

We will address both

Computer Architecture 2012 Introduction (lec1)
12
Why Should We Care?

Abstractions enhance productivity, so:
If we know the arch (=interface),
Why should we care about the u-arch (=internals)?

Same goes for arch
Just details for a programmer of a high-level language

Abstractions only work so long as whats below
works
The taxi story: http://vimeo.com/11478146 (4:50-6:00)

Computer Architecture 2012 Introduction (lec1)
13
Recent Processor Trends
Source: http://www.scidacreview.org/0904/html/multicore.html
Computer Architecture 2012 Introduction (lec1)
14
Well-Known Moores Law
Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm
Computer Architecture 2012 Introduction (lec1)
15
Computer Architecture 2012 Introduction (lec1)
16
The Story in a Nutshell
Transistors
(1000s)
clock speed
(MHz)
power (W)
Instructions/cycle
(ILP)
Computer Architecture 2012 Introduction (lec1)
17
Took the Industry by Surprise
Computer Architecture 2012 Introduction (lec1)
18
Dire Implications: Performance
Computer Architecture 2012 Introduction (lec1)
19
Dire Implications: Sales
Computer Architecture 2012 Introduction (lec1)
20
Dire Implications: Sales
Computer Architecture 2012 Introduction (lec1)
21
Dire Implications: Programmers
Computer Architecture 2012 Introduction (lec1)
22
Supercomputing: Top 500 list
Computer Architecture 2012 Introduction (lec1)
23
Dire Implications: Supercomputing
Computer Architecture 2012 Introduction (lec1)
24
Processor Performance
Computer Architecture 2012 Introduction (lec1)
25
Metrics: IC, CPI, IPC
CPUs work according to a clock signal
Clock cycle: measured in nanoseconds (10
-9
of a second)
Clock frequency = 1/|clock cycle|: in GHz (10
9
cycles/sec)

Instruction Count (IC)
Total number of instructions executed in the program

Cycles Per Instruction (CPI)
Average #cycles per Instruction (in a given program)



IPC (= 1/CPI) : Instructions per cycles.
Can be > 1; see the story in a nutshell slide
CPI =
#cycles required to execute the program
IC
Computer Architecture 2012 Introduction (lec1)
26
Minimizing Execution Time
CPU Time - time required to execute a program

CPU Time = IC CPI clock cycle


Our goal:
minimize CPU Time (any of above components)
Minimize clock cycle: increase GHz (processor design)
Minimize CPI: u-arch (e.g.: more execution units)
Minimize IC: arch + u-arch (e.g.: SSE
TM
)

SSE = streaming SIMD extension (Intel)
Computer Architecture 2012 Introduction (lec1)
27
Alternative Way to Calculate CPI
ICi = #times instruction of type-i is executed in program
IC = #instruction executed in program =

Fi = relative frequency of type-i instruction = ICi/IC
CPI
i
= #cycles to execute type-i instruction
e.g.: CPI
add
= 1, CPI
mul
= 3
#cycles required to execute the program:

CPI:

CPI
cyc
IC
CPI IC
IC
CPI
IC
IC
CPI F
i i
i
n
i
i
i
n
i i
i
n
= =
-
= - = -
=
= =


#
1
1 1
1
#
n
i i
i
cyc CPI IC
=
= -

IC IC
i
i
n
=
=

1
Computer Architecture 2012 Introduction (lec1)
28
Performance Evaluation: How?

No simple answer

Performance depends on
Application
Input

Mathematical analysis
Typically impossible

What to do?
Computer Architecture 2012 Introduction (lec1)
29
Benchmarks

Use benchmarks & measure how long it takes
Use real applications (=> no absolute answers)

Preferably standardized benchmarks (+input), e.g.,
SPEC INT: integer apps
Compression, C complier, Perl, text-processing,
SPEC FP: floating point apps (mostly scientific)
TPC benchmarks: measure transaction throughput (DB)
SPEC JBB: models wholesale company (Java server, DB)

Sometimes you see FLOPS (pick or sustained)
Supercomputers (top500 list), against LINPACK
Computer Architecture 2012 Introduction (lec1)
30
-2%
0%
2%
4%
6%
Evaluating Performance
Use a performance simulator to evaluate the
performance of a new feature / algorithm
Models the uarch to a great detail
Run 100s of representative applications
Produce the performance s-curve
Sort the applications according to the IPC increase
Baseline (0%) is the processor without the new feature

-4%
-3%
-2%
-1%
0%
1%
2%
3%
Negative
outliers
Positive
outliers
Bad S-curve
Small negative
outliers
Positive
outliers
Good S-curve
Computer Architecture 2012 Introduction (lec1)
31
Amdahls Law

Suppose we accelerate the computation such that
P = proportion of computation we make faster
S = speedup experienced by the proportion we improved

For example
If an improvement can speedup 40% of the computation
=> P = 0.4
If the improvement makes the portion run twice as fast
=> S = 2

Then overall speedup =
1
(1 )
P
P
S
+
Computer Architecture 2012 Introduction (lec1)
32
Amdahls Law - Example

FP operations improved to run 2x faster
S = 2, but
P = only affects 10% of the program
Speedup:




Conclusion
Better to make common case fast




1 1 1
1.053
0.1
0.95
(1 ) (1 0.1)
2
P
P
S
= = ~
+ +
Computer Architecture 2012 Introduction (lec1)
33
Amdahls Law Parallelism

When parallelizing a program
P = proportion of program that can be made parallel
1 - P = inherently serial
N = number of processing elements (say, cores)
Speedup:



Serial component imposes a hard limit




1
(1 )
P
P
N
+
1 1
lim
(1 )
(1 )
N
P
P
P
N

| |
|
=
|
+
\ .
Computer Architecture 2012 Introduction (lec1)
34
The ISA is what the user
& compiler see


The HW implements the
ISA
instruction set
software
hardware
Instruction Set Design
Computer Architecture 2012 Introduction (lec1)
35
Considerations in ISA Design
Instruction size
Long instructions take more time to fetch from memory
Longer instructions require a larger memory
Important for small (embedded) devices, e.g., cell phones

Number of instructions (IC)
Reduce IC => reduce runtime (at a given CPI & frequency)

Virtues of instructions simplicity
Simpler HW allows for: higher frequency & lower power
Optimization can be applied better to simpler code
Cheaper HW
Computer Architecture 2012 Introduction (lec1)
36
Basing Design Decisions on Workload
Immediate arguments size in bits (histogram)









1% of data values > 16-bits
Having 16 bits is likely good enough

0%
10%
20%
30%
0

1

2

3

4

5

6

7

8

9

1
0

1
1

1
2

1
3

1
4

1
5

Immediate data bits
Int. Avg.
FP Avg.
Computer Architecture 2012 Introduction (lec1)
37
CISC Processors
CISC - Complex Instruction Set Computer
Example: x86
The idea: a high level machine language
Once people programmed in assembly, CISC supposedly easier

Characteristic
Many instruction types, with a many addressing modes
Some of the instructions are complex
Execute complex tasks
Require many cycles
ALU operations directly on memory (e.g., arr[j] = arr[i]+n)
Registers not used (and, accordingly, only a few registers exist)
Variable length instructions
common instructions get short codes save code length
Computer Architecture 2012 Introduction (lec1)
38
Rank instruction % of total executed
1 load 22%
2 conditional branch 20%
3 compare 16%
4 store 12%
5 add 8%
6 and 6%
7 sub 5%
8 move register-register 4%
9 call 1%
10 return 1%
Total 96%
Simple instructions dominate instruction frequency
But it Turns Out
Computer Architecture 2012 Introduction (lec1)
39
CISC Drawbacks
Complex instructions and complex addressing modes
complicates the processor
slows down the simple, common instructions
contradicts Make The Common Case Fast
Compilers dont use complex instructions / indexing methods
Variable length instructions are real pain in the neck
Difficult to decode few instructions in parallel
As long as instruction is not decoded, its length is unknown
It is unknown where the instruction ends
It is unknown where the next instruction starts
An instruction may be over more than a single cache line
An instruction may be over more than a single page
Computer Architecture 2012 Introduction (lec1)
40
RISC Processors
RISC - Reduced Instruction Set Computer
The idea: simple instructions enable fast hardware
Characteristic
A small instruction set, with only a few instructions formats
Simple instructions
execute simple tasks
Most of them require a single cycle (with pipeline)
A few indexing methods
ALU operations on registers only
Memory is accessed using Load and Store instructions only
Many orthogonal registers
Three address machine: Add dst, src1, src2
Fixed length instructions

Examples: MIPS
TM
, Sparc
TM
, Alpha
TM
, Power
TM
Computer Architecture 2012 Introduction (lec1)
41
RISC Processors (Cont.)
Simple arch => simple u-arch
Room for larger on die caches
Smaller => faster
Easier to design & validate (=> cheaper to manufacture)
Shorten time-to-market
More general-purpose registers (=> less memory refs)

Compiler can be smarter
Better pipeline usage
Better register allocation

Existing RISC processor are not pure RISC
e.g., support division which takes many cycles
Computer Architecture 2012 Introduction (lec1)
42
Compilers and ISA
Ease of compilation
Orthogonality:
no special registers
few special cases
all operand modes available with any data type or instruction
type
Regularity:
no overloading for the meanings of instruction fields
streamlined
resource needs easily determined

Register assignment is critical too
Easier if lots of registers

Computer Architecture 2012 Introduction (lec1)
43
Still, CISC Is Dominant
x86 (CISC) dominates the processor market

Legacy
A vast amount of existing software
Intel, AMD, Microsoft benefit
But put lot of money to compensate for disadvantage

CISC internally arch emulates RISC
Starting at Pentium II and K6, x86 processors translate
CISC instructions into RISC-like operations internally
Inside core looks much like that of a RISC processor
Computer Architecture 2012 Introduction (lec1)
44
Software Specific Extensions
Extend arch to accelerate exec of specific apps

Example: SSE
TM
Streaming SIMD Extensions
128-bit packed (vector) / scalar single precision FP (432)
Introduced on Pentium III on 99
8 new 128 bit registers (XMM0 XMM7)
Accelerates graphics, video, scientific calculations,

Packed: Scalar:
x0 x1 x2 x3
y0 y1 y2 y3
x0+y0 x1+y1 x2+y2 x3+y3
+
128-bits
x0 x1 x2 x3
y0 y1 y2 y3
x0+y0 y1 y2 y3
+
128-bits
Computer Architecture 2012 Introduction (lec1)
45
BACKUP
Computer Architecture 2012 Introduction (lec1)
46
Compatibility
Backward compatibility (HW responsibility)
When buying new hardware, it can run existing software:
i5 can run SW written for Core2 Duo, Pentium4, PentiumM,
Pentium III, Pentium II, Pentium, 486, 386, 268

BTW:

Forward compatibility (SW responsibility)
For example: MS Word 2003 can open MS Word 2010 doc
Commonly supports one or two generations behind

Architecture-independent SW
Run SW on top of VM that does JIT (just in time compiler):
JVM for Java and CLR for .NET
Interpreted languages: Perl, Python