You are on page 1of 27

Trends in High-Performance Computer Architecture

David J. Lilja Department of Electrical Engineering Center for Parallel Computing University of Minnesota Minneapolis E-mail: Phone: 625-5007 FAX: 625-4583

1 -- Lilja

University of Minnesota

April 1996

Trends and Predictions

Trend, n. 1: direction of movement: FLOW 2 a: a prevailing tendency or inclination: DRIFT 2 b: a general movement: SWING 2 c: a current style or preference: VOGUE 2 d: a line of development: APPROACH

Webster’s Dictionary

It is very difficult to make an accurate prediction, especially about the future.

Niels Bohr

Historical Trends and Perspective pre-WW II: Mechanical calculating machines WW II .50’s: Technology improvement relays → vacuum tubes high-level languages 60’s: Miniaturization/packaging transistors integrated circuits 70’s: Semantic gap complex instruction sets language support in hardware microcoding 80’s: Keep It Simple. Stupid RISC vs CISC debate shift complexity to software 90’s: What to do with all of these transistors? large on-chip caches prefetching hardware speculative execution special-purpose instructions multiple processors on-a-chip 2 -.Lilja University of Minnesota April 1996 .

What is Computer Architecture? It has nothing to do with buildings.Lilja University of Minnesota April 1996 .requires close interaction between * HW designer * SW designer 3 -.defines interface between higher levels and software .minimize cost? Use levels of abstraction silicon and metal → transistors → gates → flip-flops → registers → functional units → processors → systems Architecture .control complexity .maximize performance . Goals of a computer designer .

work per unit time → rate .Lilja University of Minnesota April 1996 .8 cycles 10 ns * = 16.used by system managers Execution time .how long to execute your application .used by system designers and users Texec = n instrs * # cycles seconds * # instrs cycle = n * CPI * Tclock Example Texec = 900 M instrs * 1.Performance Metrics System throughput .2 sec instr cycle 4 -.

CPI 5 -.Improving Performance Texec = Tclock * n * CPI Improve clock rate. Tclock Reduce total number of instructions executed. n Reduce average number of cycles per instruction.Lilja University of Minnesota April 1996 .

1) Improving the Clock Rate Use faster technology .overlaps execution of instructions → parallelism Maximum speedup ≤ pipeline depth 6 -. ECL.reduce the amount of work per clock cycle generate instr fetch instr effective decode op addr fetch operand execute write operand Performance improvement . etc .reduces Tclock .Lilja University of Minnesota April 1996 .smaller features to reduce propagation delay Pipelining .BiCMOS.

data needed by instr i+x from instr i has not been calculated Branch hazards .began executing instrs from wrong branch path 7 -.Cost of Pipelining More hardware .Lilja University of Minnesota April 1996 .need registers between each pipe segment Data hazards .

Lilja University of Minnesota April 1996 .i+3 needs data generated by i+2 .branch resolved in stage 5 cycle 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 i i+1 i+2 i+3 i+4 i+5 X j j+1 j+2 j+3 j+4 j+5 j+6 2 i i+1 i+2 i+3 i+4 X X j j+1 j+2 j+3 j+4 j+5 3 i i+1 i+2 i+3 X X X j j+1 j+2 j+3 j+4 pipeline segment # 4 5 6 (start up latency) (start up latency) (start up latency) i (start up latency) i+1 i (start up latency) i+2 i+1 i instruction i finished X i+2 i+1 instruction i+1 finished X X i+2 instruction i+2 finished X X X (branch penalty) X X X (branch penalty) j X X (branch penalty) j+1 j X (branch penalty) j+2 j+1 j instruction j finished j+3 j+2 j+1 instruction j+1 finished Data hazards produce similar pipeline bubbles .data bypassing .branch prediction .delayed branch 8 -.i+3 stalled until i+2 in stage 5 Solutions to hazards .Branch Penalty Instruction i+2 branches to instr j .instruction reordering .

move instructions .small.2) Reduce Number of Instructions Executed Texec = Tclock * n * CPI CISC -.Complex Instruction Set Computer . Tclock RISC -.But may increase cycle time.Lilja University of Minnesota April 1996 .Reduced Instruction Set Computer .But must execute more instructions for same work 9 -. simple instruction set .powerful instrs to reduce instr count complex addressing modes complex loop.simpler implementation → faster clock .

Motorola 68xxx vs MIPS (SGI). Pentium-Pro. Cray Tclock RISC ↓ ↑ n (instrs) ↑ ↓ CISC RISC tends to win .Lilja University of Minnesota April 1996 .HP PA-7100LC has special multimedia instructions reduce total instruction count for MPEG encode/decode exploit pixels < full word width 10 -.RISC vs CISC Debate Pentium.simple instructions → easier pipelining But trade-off is technology dependent Market considerations determine actual winner Special purpose instructions . PowerPC.

instruction-level .processor-level 1 IPC 11 -.3) Reduce Average Cycles per Instruction Texec = Tclock * n * CPI Decreasing CPI ≡ increasing Instructions Per Cycle (IPC) Texec = Tclock * n * CPI < 1 → parallelism .Lilja University of Minnesota April 1996 .

r6 store r6.r4.r5 mult r6. y cmp r6.r2.Lilja University of Minnesota April 1996 .r3 sub r3. r8 DEC Alpha 21164 12 -.Superscalar Processors Almost all microprocessors today use superscalar Use hardware to check for instruction dependences Issue multiple instructions simultaneously Instruction window add r1.r7.#5 load x.

compile-time information is incomplete conservatively assume not parallel .code explosion .Lilja University of Minnesota April 1996 .Very Long Instruction Word Rely on compiler to detect parallel instructions .pack independent instructions into one long instruction .execution stalls REGISTER FILE Functional Units ALU ALU ALU Read Read Wrt Branch Instruction Word 13 -.VLIW -.∼ microcode compaction ∼ Simplifies hardware compared to superscalar But .

75× Limited by slowest component Corollary: Focus on part that produces biggest bang per buck. 14 -. but net improvement = 1. Corollary: Make the most common case fast.Lilja University of Minnesota April 1996 .Amdahl’s Law Limits maximum performance improvement Perf Improvement = Part affected + Part unaffected Improvement factor Travel from Minneapolis to Chicago By car 420 miles = 7hr 60 mi/hr By taxi + plane + taxi 30 miles 360 miles 30 miles + + = 4 hr 20 mi/hr 360 mi/hr 20 mi/hr ⇒ Plane is 6× faster.

DRAM ∼ 7% per year. ∼ .’’ 1000 100 Relative improvement 10 1 1980 1985 1990 1995 2000 Year of introduction Relative performance improvement of CPU and DRAM. .50% per year.Lilja University of Minnesota April 1996 .CPU ∼ 25%.Processor-Memory Speed Gap ‘‘But a processor doth not a system make. ∼ 15 -.

System speedup ≤ 5× 16 -.If TCPU → 0.Lilja University of Minnesota April 1996 .Memory Delay is the Killer Speed ratio of memory to CPU → 100× Texec = TCPU + Tmemory Faster processors reduce only TCPU Memory instructions ∼ 20% of instructions executed ∼ Amdahl’s Law .

vector operations ∼ pipelining memory ∼ Hide the delay .context-switching with multiple independent threads 17 -.caches .data prefetching .Lilja University of Minnesota April 1996 .Reducing Memory Delay Amortize delay over many references .exploit locality of references .

WAN: wide-area network LAN: local-area network PAN: processor-area network . ISDN.I/O is the Killer Texec = TCPU + Tmemory + TI/O I/O delay worse than memory . I/O devices 18 computing Merging of intrasystem and intersystem communication . ATM.Lilja University of Minnesota April 1996 .FDDI.multimedia . Fibre .

3 M 4-way 8K I and D + 96K 2nd-level 345 505 50W 0.Contemporary Microprocessors Avail Tech Clock Trans S’scalar On-chip cache SPECint92 SPECfp92 Power DEC Alpha 21164 1Q95 Sun UltraSPARC-1 1Q96 SGI MIPS R10000 1Q96 HP PA8000 4Q95 0.8 M 4-way 32K I and D 300 600 30W 0.35µm 200 MHz 6.5µm 182 MHz 5.2 M 4-way 16K I and D 260 410 25W 0.5µm 300 MHz 9.55µm 200 MHz 4-way NONE 360 550 19 -.Lilja University of Minnesota April 1996 .

coarse-grained parallelism 20 -.Trends in Clock Cycle Times Cray vs microprocessors Increase IPC .Lilja University of Minnesota April 1996 .fine-grained parallelism Increase number of processors .

Multiple Instruction.Lilja University of Minnesota April 1996 . Multiple Data (SIMD) GLOBAL CONTROL UNIT CPU CPU CPU INTERCONNECTION NETWORK Control-parallel .Data. Multiple Data (MIMD) CPU CONTROL UNIT CPU CONTROL UNIT CPU CONTROL UNIT INTERCONNECTION NETWORK 21 -.vs Control-Parallelism Data-parallel -Single Instruction.

Pyramid.desktop multiprocessors .data mining . Compaq. Convex. IBM.transaction processing .Multiprocessor Systems Parallelism is commonplace . Cray. Tandem.decision support . Intel..Lilja University of Minnesota April 1996 .superscalar Applications .scientific/engineering crash simulation weather modeling oceanography radar . .relational database servers .networks of workstations .medical imaging Manufacturers . Hewlett-Packard.Sun Microsystems. 22 -. Silicon Graphics..

shared-memory .cache coherence problem Task granularity .Lilja University of Minnesota April 1996 .message-passing . but less parallelism Programming complexity .network delay .large tasks → less synch. but more synch .topology Memory delay .latency and bandwidth .small tasks → more parallelism.Multiprocessor Design Issues Interconnection network .automatic compiler parallelization 23 -.

pipelining 2) Reduce the total number of instructions executed.VLIW .speculative execution → But. Tclock .faster technology . memory delay is the killer! → But.Lilja University of Minnesota April 1996 .g. IPC . multimedia support 3) Increase the parallelism.CISC vs RISC debate . I/O delay is the killer! 24 -. n .specialized instructions e.Improving Computer Performance: Summary Texec = Tclock * n * 1 + Tmemory + TI/O IPC 1) Improve the clock rate.superscalar .multiple processors .

Lilja University of Minnesota April 1996 . Those who can count. and those who cannot. high-volume devices ‘‘Even if you are on the right track. you’ll get run over if you just sit there.’’ Will Rogers .parallel software is hard 25 -.’’ Lord Rutherford.the pace of technology is brutal ‘‘A distributed system is one in which I cannot get something done because a machine I’ve never heard of is down.’’ Leslie Lamport .’’ Robert Arthur .Parting Thoughts ‘‘We haven’t much money so we must use our brains. Cavendish Laboratory .the processor is becoming secondary to the network ‘‘There are 3 types of driven by low-cost.

Parting Thoughts ‘‘You know you have achieved perfection in design. but when you have nothing more to take away.Lilja University of Minnesota April 1996 . not when you have nothing more to add. but no simpler.’’ Antoine de Saint Exupery ‘‘Everything should be made as simple as possible.’’ Albert Einstein High-performance requires elegant design 26 -.