You are on page 1of 44

CS 258

Parallel Computer Architecture

CS 258, Spring 99
David E. Culler
Computer Science Division
U.C. Berkeley
Today’s Goal:

• Introduce you to Parallel Computer Architecture


• Answer your questions about CS 258
• Provide you a sense of the trends that shape the
field

02/21/22 CS258 S99 2


What will you get out of CS258?
• In-depth understanding of the design and
engineering of modern parallel computers
– technology forces
– fundamental architectural issues
» naming, replication, communication, synchronization
– basic design techniques
» cache coherence, protocols, networks, pipelining, …
– methods of evaluation
– underlying engineering trade-offs
• from moderate to very large scale
• across the hardware/software boundary

02/21/22 CS258 S99 3


Will it be worthwhile?
• Absolutely!
– even through few of you will become PP designers
• The fundamental issues and solutions translate
across a wide spectrum of systems.
– Crisp solutions in the context of parallel machines.
• Pioneered at the thin-end of the platform
pyramid on the most-demanding applications
– migrate downward with time
• Understand implications SuperServers for
software Departmenatal Servers

Workstations

Personal Computers

02/21/22 CS258 S99 4


Am I going to read my book to you?

• NO!
• Book provides a framework and complete
background, so lectures can be more interactive.
– You do the reading
– We’ll discuss it
• Projects will go “beyond”

02/21/22 CS258 S99 5


What is Parallel Architecture?
• A parallel computer is a collection of processing
elements that cooperate to solve large problems fast
• Some broad issues:
– Resource Allocation:
» how large a collection?
» how powerful are the elements?
» how much memory?
– Data access, Communication and Synchronization
» how do the elements cooperate and communicate?
» how are data transmitted between processors?
» what are the abstractions and primitives for cooperation?
– Performance and Scalability
» how does it all translate into performance?
» how does it scale?

02/21/22 CS258 S99 6


Why Study Parallel Architecture?

Role of a computer architect:


To design and engineer the various levels of a computer system
to maximize performance and programmability within limits of
technology and cost.

Parallelism:
• Provides alternative to faster clock for performance
• Applies at all levels of system design
• Is a fascinating perspective from which to view architecture
• Is increasingly central in information processing

02/21/22 CS258 S99 7


Why Study it Today?
• History: diverse and innovative organizational
structures, often tied to novel programming
models
• Rapidly maturing under strong technological
constraints
– The “killer micro” is ubiquitous
– Laptops and supercomputers are fundamentally similar!
– Technological trends cause diverse approaches to converge
• Technological trends make parallel computing
inevitable
• Need to understand fundamental principles and
design tradeoffs, not just taxonomies
– Naming, Ordering, Replication, Communication performance

02/21/22 CS258 S99 8


Is Parallel Computing Inevitable?
• Application demands: Our insatiable need for
computing cycles
• Technology Trends
• Architecture Trends
• Economics
• Current trends:
– Today’s microprocessors have multiprocessor support
– Servers and workstations becoming MP: Sun, SGI, DEC,
COMPAQ!...
– Tomorrow’s microprocessors are multiprocessors

02/21/22 CS258 S99 9


Application Trends
• Application demand for performance fuels advances
in hardware, which enables new appl’ns, which...
– Cycle drives exponential increase in microprocessor performance
– Drives parallel architecture harder
» most demanding applications

New Applications
More Performance
• Range of performance demands
– Need range of system performance with progressively increasing
cost

02/21/22 CS258 S99 10


Speedup
• Speedup (p processors) = Performance (p processors)
Performance (1 processor)

• For a fixed problem size (input data set),


performance = 1/time

• Speedup fixed problem (p processors) =


Time (1 processor)
Time (p processors)

02/21/22 CS258 S99 11


Commercial Computing
• Relies on parallelism for high end
– Computational power determines scale of business that can
be handled
• Databases, online-transaction processing,
decision support, data mining, data warehousing
...
• TPC benchmarks (TPC-C order entry, TPC-D
decision support)
– Explicit scaling criteria provided
– Size of enterprise scales with size of system
– Problem size not fixed as p increases.
– Throughput is performance measure (transactions per minute
or tpm)

02/21/22 CS258 S99 12


TPC-C Results for March 1996
25,000
 Tandem Himalaya
 DEC Alpha
 SGI PowerChallenge 
20,000  HP PA
 IBM PowerPC
 Other
Throughput (tpmC)

15,000




10,000


 

 
5,000  

 


 

 

  




  




 


0
0 20 40 60 80 100 120
Number of processors
• Parallelism is pervasive
• Small to moderate scale parallelism very important
• Difficult to obtain snapshot to compare across vendor
platforms
02/21/22 CS258 S99 13
Scientific Computing Demand

02/21/22 CS258 S99 14


Engineering Computing Demand
• Large parallel machines a mainstay in many
industries
– Petroleum (reservoir analysis)
– Automotive (crash simulation, drag analysis, combustion
efficiency),
– Aeronautics (airflow analysis, engine efficiency, structural
mechanics, electromagnetism),
– Computer-aided design
– Pharmaceuticals (molecular modeling)
– Visualization
» in all of the above
» entertainment (films like Toy Story)
» architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)
– etc.

02/21/22 CS258 S99 15


Applications: Speech and Image
Processing

10 GIPS 5,000 Words


Continuous
1,000 Words Speech
1 GIPS Continuous Recognition
Speech HDTV Receiver
Telephone Recognition
Number CIF Video
100 MIPS Recognition ISDN-CD Stereo
200 Words Receiver
Isolated Speech
Recognition CELP
10 MIPS Speech Coding
Speaker
Veri¼cation
1 MIPS Sub-Band
Speech Coding

1980 1985 1990 1995

• Also CAD, Databases, . . .


• 100 processors gets you 10 years, 1000 gets you 20 !
02/21/22 CS258 S99 16
Is better parallel arch enough?

• AMBER molecular dynamics simulation program


• Starting point was vector code for Cray-1
• 145 MFLOP on Cray90, 406 for final version on 128-processor
Paragon, 891 on 128-processor Cray T3D
02/21/22 CS258 S99 17
Summary of Application Trends
• Transition to parallel computing has occurred for
scientific and engineering computing
• In rapid progress in commercial computing
– Database and transactions as well as financial
– Usually smaller-scale, but large-scale systems also used
• Desktop also uses multithreaded programs,
which are a lot like parallel programs
• Demand for improving throughput on sequential
workloads
– Greatest use of small-scale multiprocessors
• Solid application demand exists and will
increase

02/21/22 CS258 S99 18


- - - Little break - - -

02/21/22 CS258 S99 19


Technology Trends

100
Supercomputers

10
Performance

Mainframes
Microprocessors
Minicomputers
1

0.1
1965 1970 1975 1980 1985 1990 1995

• Today the natural building-block is also fastest!


02/21/22 CS258 S99 20
Can’t we just wait for it to get faster?
• Microprocessor performance increases 50% - 100% per year
• Transistor count doubles every 3 years
• DRAM size quadruples every 3 years
• Huge investment per generation is carried by huge commodity market
180
160
140
DEC
120
alpha
100 Integer FP
IBM
80 HP 9000
RS6000
750
60 540
MIPS
40 MIPS
Sun 4 M/120 M2000
20 260
0
1987 1988 1989 1990 1991 1992

02/21/22 CS258 S99 21


Technology: A Closer Look
• Basic advance is decreasing feature size ( )
– Circuits become either faster or lower in power
• Die size is growing too
– Clock rate improves roughly proportional to improvement in 
– Number of transistors improves like (or faster)
• Performance > 100x per decade
– clock rate < 10x, rest is transistor count
• How to use more transistors?
– Parallelism in processing
Proc $
» multiple operations per cycle reduces CPI
– Locality in data access
» avoids latency and reduces CPI
» also improves processor utilization Interconnect
– Both need resources, so tradeoff
• Fundamental issue is resource distribution, as in
uniprocessors

02/21/22 CS258 S99 22


Growth Rates
1,000 100,000,000

 
 R10000
100   

10,000,000 
  R10000
 Pentium100
 
  Pentium

 
  
   
1,000,000  
Clock rate (MHz)

 i80386  
 
10  i80286

Transistors
i8086   i80386

 
i80286    R3000
 100,000    R2000
i8080
1   i8086

 i8008 10,000 
i4004  i8080

 i8008
i4004
0.1 1,000
1970 1980 1990 2000 1970 1980 1990 2000
1975 1985 1995 2005 1975 1985 1995 2005

• 30% per year 40% per year


02/21/22 CS258 S99 23
Architectural Trends
• Architecture translates technology’s gifts into
performance and capability
• Resolves the tradeoff between parallelism and
locality
– Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip
connect
– Tradeoffs may change with scale and technology advances

• Understanding microprocessor architectural


trends
=> Helps build intuition about design issues or parallel machines
=> Shows fundamental role of parallelism even in “sequential”
computers

02/21/22 CS258 S99 24


Phases in “VLSI” Generation
Bit-level parallelism Instruction-level Thread-level (?)
100,000,000

10,000,000 
  

 R10000
 
  

 
 

  
  
1,000,000
 
Pentium

Tra ns is tors

 

 i80386

i80286    R3000
100,000
   R2000

 i8086

10,000
 i8080
 i8008

 i4004

1,000
1970 1975 1980 1985 1990 1995 2000 2005

02/21/22 CS258 S99 25


Architectural Trends
• Greatest trend in VLSI generation is increase in
parallelism
– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
» slows after 32 bit
» adoption of 64-bit now under way, 128-bit far (not performance
issue)
» great inflection point when 32-bit micro and cache fit on a chip
– Mid 80s to mid 90s: instruction level parallelism
» pipelining and simple instruction sets, + compiler advances
(RISC)
» on-chip caches and functional units => superscalar execution
» greater sophistication: out of order execution, speculation,
prediction
• to deal with control transfer and latency problems
– Next step: thread level parallelism

02/21/22 CS258 S99 26


How far will ILP go?
30 3
 

25 2.5
Fraction of total cycles (%)

20 2

Speedup
15 1.5

10 1 

5 0.5

0 0
0 1 2 3 4 5 6+ 0 5 10 15
Number of instructions issued Instructions issued per cycle

• Infinite resources and fetch bandwidth, perfect branch


prediction and renaming
– real caches and non-zero miss latencies

02/21/22 CS258 S99 27


Threads Level Parallelism “on board”

Proc Proc Proc Proc

MEM

• Micro on a chip makes it natural to connect many to shared


memory
– dominates server and enterprise market, moving down to desktop
• Faster processors began to saturate bus, then bus technology
advanced
– today, range of sizes for bus-based systems, desktop to large servers

02/21/22 CS258 S99 28


What about Multiprocessor Trends?
70

CRAY CS6400 
Sun
60 E10000

50
Number of processors

40
SGI Challenge

Sequent B2100 Symmetry81 SE60 Sun E6000


30     
SE70

20 Sun SC2000   SC2000E


 SGI PowerChallenge/XL

AS8400
 Sequent B8000 Symmetry21 SE10 
10    SE30
Power  SS1000   SS1000E

SS690MP 140  AS2100  HP K400  P-Pro


SGI PowerSeries  SS690MP 120  SS10   SS20
0
1984 1986 1988 1990 1992 1994 1996 1998

02/21/22 CS258 S99 29


Bus Bandwidth
100,000

Sun E10000

10,000

SGI
 Sun E6000
PowerCh
Shared bus bandwidth (MB/s)

XL  AS8400
SGI Challenge    CS6400
1,000  HPK400
 SC2000E
 SC2000  AS2100
 P-Pro
 SS1000E
SS1000  SS20
SS690MP 120
SS10/   SE70/SE30
SS690MP 140 SE10/
Symmetry81/21 SE60
100

 SGI PowerSeries  Power

  Sequent B2100
Sequent
B8000
10
1984 1986 1988 1990 1992 1994 1996 1998
02/21/22 CS258 S99 30
What about Storage Trends?
• Divergence between memory capacity and speed even more
pronounced
– Capacity increased by 1000x from 1980-95, speed only 2x
– Gigabit DRAM by c. 2000, but gap with processor speed much greater
• Larger memories are slower, while processors get faster
– Need to transfer more data in parallel
– Need deeper cache hierarchies
– How to organize caches?
• Parallelism increases effective size of each level of hierarchy,
without increasing access time
• Parallelism and locality within memory systems too
– New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface
– Buffer caches most recently accessed data
• Disks too: Parallel disks plus caching

02/21/22 CS258 S99 31


Economics
• Commodity microprocessors not only fast but CHEAP
– Development costs tens of millions of dollars
– BUT, many more are sold compared to supercomputers
– Crucial to take advantage of the investment, and use the commodity
building block
• Multiprocessors being pushed by software vendors (e.g.
database) as well as hardware vendors
• Standardization makes small, bus-based SMPs commodity
• Desktop: few smaller processors versus one larger one?
• Multiprocessor on a chip?

02/21/22 CS258 S99 32


Can we see some hard evidence?

02/21/22 CS258 S99 33


Consider Scientific Supercomputing
• Proving ground and driver for innovative architecture and
techniques
– Market smaller relative to commercial as MPs become mainstream
– Dominated by vector machines starting in 70s
– Microprocessors have made huge gains in floating-point performance
» high clock rates
» pipelined floating point units (e.g., multiply-add every cycle)
» instruction-level parallelism
» effective use of caches (e.g., automatic blocking)
– Plus economics

• Large-scale multiprocessors replace vector


supercomputers

02/21/22 CS258 S99 34


Raw Uniprocessor Performance: LINPACK
10,000

 CRAY n = 1,000
 CRAY n = 100
 Micro n = 1,000
 Micro n = 100

1,000 
T94

C90
 

LINPACK (MFLOPS)

 DEC 8200
 Ymp

Xmp/416    
  IBM Power2/990
100  
MIPS R4400
Xmp/14se
  DEC Alpha
 
 HP9000/735
 DEC Alpha AXP
 CRAY 1s
  HP 9000/750
 IBM RS6000/540

10


MIPS M/2000

MIPS M/120

Sun 4/260
1 

1975 1980 1985 1990 1995 2000

02/21/22 CS258 S99 35


Raw Parallel Performance: LINPACK
10,000
 MPP peak
 CRAY peak

1,000 ASCI Red 


Paragon XP/S MP
(6768)

LINPACK (GFLOPS)

Paragon XP/S MP
(1024) 
 T3D
CM-5 
100
T932(32) 
Paragon XP/S

CM-200 
CM-2    C90(16)
10 Delta

  iPSC/860
 nCUBE/2(1024)
Ymp/832(8)

1
Xmp /416(4)

0.1
1985 1987 1989 1991 1993 1995 1996

• Even vector Crays became parallel


– X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
• Since 1993, Cray produces MPPs too (T3D, T3E)
02/21/22 CS258 S99 36
500 Fastest Computers
350
313 319

300  284

239
Number of systems

250
  MPP
200  PVP
 198
187  SMP
150
110 106
100 
 
106
 
50 73
63
0
11/93 11/94 11/95 11/96

02/21/22 CS258 S99 37


Summary: Why Parallel Architecture?
• Increasingly attractive
– Economics, technology, architecture, application demand
• Increasingly central and mainstream
• Parallelism exploited at many levels
– Instruction-level parallelism
– Multiprocessor servers
– Large-scale multiprocessors (“MPPs”)
• Focus of this class: multiprocessor level of parallelism
• Same story from memory system perspective
– Increase bandwidth, reduce average latency with many local
memories
• Spectrum of parallel architectures make sense
– Different cost, performance and scalability

02/21/22 CS258 S99 38


Where is Parallel Arch
Going?
Old view: Divergent architectures, no predictable pattern of growth.

Application Software

Systolic System
Arrays Software SIMD
Architecture
Message Passing
Dataflow
Shared Memory

• Uncertainty of direction paralyzed parallel software development!


02/21/22 CS258 S99 39
Toda
y
• Extension of “computer architecture” to support communication and cooperation
– Instruction Set Architecture plus Communication Architecture

• Defines
– Critical abstractions, boundaries, and primitives (interfaces)
– Organizational structures that implement interfaces (hw or sw)

• Compilers, libraries and OS are important bridges today

02/21/22 CS258 S99 40


Modern Layered Framework

CAD Database Scientific modeling Parallel applications

Multiprogramming Shared Message Data Programming models


address passing parallel

Compilation
or library Communication abstraction
User/system boundary
Operating systems support

Hardware/software boundary
Communication hardware
Physical communication medium

02/21/22 CS258 S99 41


How will we spend out time?

http://www.cs.berkeley.edu/~culler/cs258-s99/schedule.html

02/21/22 CS258 S99 42


How will grading work?
• 30% homeworks (6)
• 30% exam
• 30% project (teams of 2)
• 10% participation

02/21/22 CS258 S99 43


Any other questions?

02/21/22 CS258 S99 44

You might also like