CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley

CS 258
Parallel Computer Architecture
CS 258, Spring 99
David E. Culler
Computer Science Division
U.C. Berkeley
Today’s Goal:
• Introduce you to Parallel Computer Architecture

• Answer your questions about CS 258
• Provide you a sense of the trends that shape the
field
02/21/22 CS258 S99 2

What will you get out of CS258?
• In-depth understanding of the design and
engineering of modern parallel computers
– technology forces
– fundamental architectural issues
» naming, replication, communication, synchronization
– basic design techniques
» cache coherence, protocols, networks, pipelining, …
– methods of evaluation
– underlying engineering trade-offs
• from moderate to very large scale
• across the hardware/software boundary
02/21/22 CS258 S99 3

Will it be worthwhile?
• Absolutely!
– even through few of you will become PP designers
• The fundamental issues and solutions translate
across a wide spectrum of systems.
– Crisp solutions in the context of parallel machines.
• Pioneered at the thin-end of the platform
pyramid on the most-demanding applications
– migrate downward with time
• Understand implications SuperServers for
software Departmenatal Servers
Workstations
Personal Computers
02/21/22 CS258 S99 4

Am I going to read my book to you?
• NO!
• Book provides a framework and complete
background, so lectures can be more interactive.
– You do the reading
– We’ll discuss it
• Projects will go “beyond”
02/21/22 CS258 S99 5

What is Parallel Architecture?
• A parallel computer is a collection of processing
elements that cooperate to solve large problems fast
• Some broad issues:
– Resource Allocation:
» how large a collection?
» how powerful are the elements?
» how much memory?
– Data access, Communication and Synchronization
» how do the elements cooperate and communicate?
» how are data transmitted between processors?
» what are the abstractions and primitives for cooperation?
– Performance and Scalability
» how does it all translate into performance?
» how does it scale?
02/21/22 CS258 S99 6

Why Study Parallel Architecture?
Role of a computer architect:

To design and engineer the various levels of a computer system
to maximize performance and programmability within limits of
technology and cost.
Parallelism:
• Provides alternative to faster clock for performance
• Applies at all levels of system design
• Is a fascinating perspective from which to view architecture
• Is increasingly central in information processing
02/21/22 CS258 S99 7

Why Study it Today?
• History: diverse and innovative organizational
structures, often tied to novel programming
models
• Rapidly maturing under strong technological
constraints
– The “killer micro” is ubiquitous
– Laptops and supercomputers are fundamentally similar!
– Technological trends cause diverse approaches to converge
• Technological trends make parallel computing
inevitable
• Need to understand fundamental principles and
design tradeoffs, not just taxonomies
– Naming, Ordering, Replication, Communication performance
02/21/22 CS258 S99 8

Is Parallel Computing Inevitable?
• Application demands: Our insatiable need for
computing cycles
• Technology Trends
• Architecture Trends
• Economics
• Current trends:
– Today’s microprocessors have multiprocessor support
– Servers and workstations becoming MP: Sun, SGI, DEC,
COMPAQ!...
– Tomorrow’s microprocessors are multiprocessors
02/21/22 CS258 S99 9

Application Trends
• Application demand for performance fuels advances
in hardware, which enables new appl’ns, which...
– Cycle drives exponential increase in microprocessor performance
– Drives parallel architecture harder
» most demanding applications
New Applications
More Performance
• Range of performance demands
– Need range of system performance with progressively increasing
cost
02/21/22 CS258 S99 10

Speedup
• Speedup (p processors) = Performance (p processors)
Performance (1 processor)
• For a fixed problem size (input data set),

performance = 1/time
• Speedup fixed problem (p processors) =

Time (1 processor)
Time (p processors)
02/21/22 CS258 S99 11

Commercial Computing
• Relies on parallelism for high end
– Computational power determines scale of business that can
be handled
• Databases, online-transaction processing,
decision support, data mining, data warehousing
...
• TPC benchmarks (TPC-C order entry, TPC-D
decision support)
– Explicit scaling criteria provided
– Size of enterprise scales with size of system
– Problem size not fixed as p increases.
– Throughput is performance measure (transactions per minute
or tpm)
02/21/22 CS258 S99 12

TPC-C Results for March 1996
25,000
 Tandem Himalaya
 DEC Alpha
 SGI PowerChallenge 
20,000  HP PA
 IBM PowerPC
 Other
Throughput (tpmC)
15,000



10,000


 

 
5,000  

 


 

 

  




  




 


0
0 20 40 60 80 100 120
Number of processors
• Parallelism is pervasive
• Small to moderate scale parallelism very important
• Difficult to obtain snapshot to compare across vendor
platforms
02/21/22 CS258 S99 13
Scientific Computing Demand
02/21/22 CS258 S99 14

Engineering Computing Demand
• Large parallel machines a mainstay in many
industries
– Petroleum (reservoir analysis)
– Automotive (crash simulation, drag analysis, combustion
efficiency),
– Aeronautics (airflow analysis, engine efficiency, structural
mechanics, electromagnetism),
– Computer-aided design
– Pharmaceuticals (molecular modeling)
– Visualization
» in all of the above
» entertainment (films like Toy Story)
» architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)
– etc.
02/21/22 CS258 S99 15

Applications: Speech and Image
Processing
10 GIPS 5,000 Words

Continuous
1,000 Words Speech
1 GIPS Continuous Recognition
Speech HDTV Receiver
Telephone Recognition
Number CIF Video
100 MIPS Recognition ISDN-CD Stereo
200 Words Receiver
Isolated Speech
Recognition CELP
10 MIPS Speech Coding
Speaker
Veri¼cation
1 MIPS Sub-Band
Speech Coding
1980 1985 1990 1995
• Also CAD, Databases, . . .

• 100 processors gets you 10 years, 1000 gets you 20 !
02/21/22 CS258 S99 16
Is better parallel arch enough?
• AMBER molecular dynamics simulation program

• Starting point was vector code for Cray-1
• 145 MFLOP on Cray90, 406 for final version on 128-processor
Paragon, 891 on 128-processor Cray T3D
02/21/22 CS258 S99 17
Summary of Application Trends
• Transition to parallel computing has occurred for
scientific and engineering computing
• In rapid progress in commercial computing
– Database and transactions as well as financial
– Usually smaller-scale, but large-scale systems also used
• Desktop also uses multithreaded programs,
which are a lot like parallel programs
• Demand for improving throughput on sequential
workloads
– Greatest use of small-scale multiprocessors
• Solid application demand exists and will
increase
02/21/22 CS258 S99 18

- - - Little break - - -
02/21/22 CS258 S99 19

Technology Trends
100
Supercomputers
10
Performance
Mainframes
Microprocessors
Minicomputers
1
0.1
1965 1970 1975 1980 1985 1990 1995
• Today the natural building-block is also fastest!

02/21/22 CS258 S99 20
Can’t we just wait for it to get faster?
• Microprocessor performance increases 50% - 100% per year
• Transistor count doubles every 3 years
• DRAM size quadruples every 3 years
• Huge investment per generation is carried by huge commodity market
180
160
140
DEC
120
alpha
100 Integer FP
IBM
80 HP 9000
RS6000
750
60 540
MIPS
40 MIPS
Sun 4 M/120 M2000
20 260
0
1987 1988 1989 1990 1991 1992
02/21/22 CS258 S99 21

Technology: A Closer Look
• Basic advance is decreasing feature size ( )
– Circuits become either faster or lower in power
• Die size is growing too
– Clock rate improves roughly proportional to improvement in 
– Number of transistors improves like (or faster)
• Performance > 100x per decade
– clock rate < 10x, rest is transistor count
• How to use more transistors?
– Parallelism in processing
Proc $
» multiple operations per cycle reduces CPI
– Locality in data access
» avoids latency and reduces CPI
» also improves processor utilization Interconnect
– Both need resources, so tradeoff
• Fundamental issue is resource distribution, as in
uniprocessors
02/21/22 CS258 S99 22

Growth Rates
1,000 100,000,000

 
 R10000
100   

10,000,000 
  R10000
 Pentium100
 
  Pentium

 
  
   
1,000,000  
Clock rate (MHz)
 i80386  
 
10  i80286
Transistors
i8086   i80386

 
i80286    R3000
 100,000    R2000
i8080
1   i8086

 i8008 10,000 
i4004  i8080

 i8008
i4004
0.1 1,000
1970 1980 1990 2000 1970 1980 1990 2000
1975 1985 1995 2005 1975 1985 1995 2005
• 30% per year 40% per year

02/21/22 CS258 S99 23
Architectural Trends
• Architecture translates technology’s gifts into
performance and capability
• Resolves the tradeoff between parallelism and
locality
– Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip
connect
– Tradeoffs may change with scale and technology advances
• Understanding microprocessor architectural

trends
=> Helps build intuition about design issues or parallel machines
=> Shows fundamental role of parallelism even in “sequential”
computers
02/21/22 CS258 S99 24

Phases in “VLSI” Generation
Bit-level parallelism Instruction-level Thread-level (?)
100,000,000
10,000,000 
  

 R10000
 
  

 
 

  
  
1,000,000
 
Pentium

Tra ns is tors
 

 i80386

i80286    R3000
100,000
   R2000
 i8086
10,000
 i8080
 i8008

 i4004
1,000
1970 1975 1980 1985 1990 1995 2000 2005
02/21/22 CS258 S99 25

Architectural Trends
• Greatest trend in VLSI generation is increase in
parallelism
– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
» slows after 32 bit
» adoption of 64-bit now under way, 128-bit far (not performance
issue)
» great inflection point when 32-bit micro and cache fit on a chip
– Mid 80s to mid 90s: instruction level parallelism
» pipelining and simple instruction sets, + compiler advances
(RISC)
» on-chip caches and functional units => superscalar execution
» greater sophistication: out of order execution, speculation,
prediction
• to deal with control transfer and latency problems
– Next step: thread level parallelism
02/21/22 CS258 S99 26

How far will ILP go?
30 3
 

25 2.5
Fraction of total cycles (%)
20 2

Speedup
15 1.5
10 1 
5 0.5
0 0
0 1 2 3 4 5 6+ 0 5 10 15
Number of instructions issued Instructions issued per cycle
• Infinite resources and fetch bandwidth, perfect branch

prediction and renaming
– real caches and non-zero miss latencies
02/21/22 CS258 S99 27

Threads Level Parallelism “on board”
Proc Proc Proc Proc
MEM
• Micro on a chip makes it natural to connect many to shared

memory
– dominates server and enterprise market, moving down to desktop
• Faster processors began to saturate bus, then bus technology
advanced
– today, range of sizes for bus-based systems, desktop to large servers
02/21/22 CS258 S99 28

What about Multiprocessor Trends?
70
CRAY CS6400 
Sun
60 E10000
50
Number of processors
40
SGI Challenge

Sequent B2100 Symmetry81 SE60 Sun E6000

30     
SE70
20 Sun SC2000   SC2000E

 SGI PowerChallenge/XL
AS8400
 Sequent B8000 Symmetry21 SE10 
10    SE30
Power  SS1000   SS1000E
SS690MP 140  AS2100  HP K400  P-Pro

SGI PowerSeries  SS690MP 120  SS10   SS20
0
1984 1986 1988 1990 1992 1994 1996 1998
02/21/22 CS258 S99 29

Bus Bandwidth
100,000
Sun E10000

10,000
SGI
 Sun E6000
PowerCh
Shared bus bandwidth (MB/s)
XL  AS8400
SGI Challenge    CS6400
1,000  HPK400
 SC2000E
 SC2000  AS2100
 P-Pro
 SS1000E
SS1000  SS20
SS690MP 120
SS10/   SE70/SE30
SS690MP 140 SE10/
Symmetry81/21 SE60
100

 SGI PowerSeries  Power
  Sequent B2100
Sequent
B8000
10
1984 1986 1988 1990 1992 1994 1996 1998
02/21/22 CS258 S99 30
What about Storage Trends?
• Divergence between memory capacity and speed even more
pronounced
– Capacity increased by 1000x from 1980-95, speed only 2x
– Gigabit DRAM by c. 2000, but gap with processor speed much greater
• Larger memories are slower, while processors get faster
– Need to transfer more data in parallel
– Need deeper cache hierarchies
– How to organize caches?
• Parallelism increases effective size of each level of hierarchy,
without increasing access time
• Parallelism and locality within memory systems too
– New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface
– Buffer caches most recently accessed data
• Disks too: Parallel disks plus caching
02/21/22 CS258 S99 31

Economics
• Commodity microprocessors not only fast but CHEAP
– Development costs tens of millions of dollars
– BUT, many more are sold compared to supercomputers
– Crucial to take advantage of the investment, and use the commodity
building block
• Multiprocessors being pushed by software vendors (e.g.
database) as well as hardware vendors
• Standardization makes small, bus-based SMPs commodity
• Desktop: few smaller processors versus one larger one?
• Multiprocessor on a chip?
02/21/22 CS258 S99 32

Can we see some hard evidence?
02/21/22 CS258 S99 33

Consider Scientific Supercomputing
• Proving ground and driver for innovative architecture and
techniques
– Market smaller relative to commercial as MPs become mainstream
– Dominated by vector machines starting in 70s
– Microprocessors have made huge gains in floating-point performance
» high clock rates
» pipelined floating point units (e.g., multiply-add every cycle)
» instruction-level parallelism
» effective use of caches (e.g., automatic blocking)
– Plus economics
• Large-scale multiprocessors replace vector

supercomputers
02/21/22 CS258 S99 34

Raw Uniprocessor Performance: LINPACK
10,000
 CRAY n = 1,000
 CRAY n = 100
 Micro n = 1,000
 Micro n = 100

1,000 
T94

C90
 

LINPACK (MFLOPS)
 DEC 8200
 Ymp

Xmp/416    
  IBM Power2/990
100  
MIPS R4400
Xmp/14se
  DEC Alpha
 
 HP9000/735
 DEC Alpha AXP
 CRAY 1s
  HP 9000/750
 IBM RS6000/540
10


MIPS M/2000

MIPS M/120

Sun 4/260
1 

1975 1980 1985 1990 1995 2000
02/21/22 CS258 S99 35

Raw Parallel Performance: LINPACK
10,000
 MPP peak
 CRAY peak
1,000 ASCI Red 

Paragon XP/S MP
(6768)

LINPACK (GFLOPS)
Paragon XP/S MP
(1024) 
 T3D
CM-5 
100
T932(32) 
Paragon XP/S

CM-200 
CM-2    C90(16)
10 Delta
  iPSC/860
 nCUBE/2(1024)
Ymp/832(8)
1
Xmp /416(4)
0.1
1985 1987 1989 1991 1993 1995 1996
• Even vector Crays became parallel

– X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
• Since 1993, Cray produces MPPs too (T3D, T3E)
02/21/22 CS258 S99 36
500 Fastest Computers
350
313 319

300  284

239
Number of systems
250
  MPP
200  PVP
 198
187  SMP
150
110 106
100 
 
106
 
50 73
63
0
11/93 11/94 11/95 11/96
02/21/22 CS258 S99 37

Summary: Why Parallel Architecture?
• Increasingly attractive
– Economics, technology, architecture, application demand
• Increasingly central and mainstream
• Parallelism exploited at many levels
– Instruction-level parallelism
– Multiprocessor servers
– Large-scale multiprocessors (“MPPs”)
• Focus of this class: multiprocessor level of parallelism
• Same story from memory system perspective
– Increase bandwidth, reduce average latency with many local
memories
• Spectrum of parallel architectures make sense
– Different cost, performance and scalability
02/21/22 CS258 S99 38

Where is Parallel Arch
Going?
Old view: Divergent architectures, no predictable pattern of growth.
Application Software
Systolic System
Arrays Software SIMD
Architecture
Message Passing
Dataflow
Shared Memory
• Uncertainty of direction paralyzed parallel software development!

02/21/22 CS258 S99 39
Toda
y
• Extension of “computer architecture” to support communication and cooperation
– Instruction Set Architecture plus Communication Architecture
• Defines
– Critical abstractions, boundaries, and primitives (interfaces)
– Organizational structures that implement interfaces (hw or sw)
• Compilers, libraries and OS are important bridges today
02/21/22 CS258 S99 40

Modern Layered Framework
CAD Database Scientific modeling Parallel applications
Multiprogramming Shared Message Data Programming models

address passing parallel
Compilation
or library Communication abstraction
User/system boundary
Operating systems support
Hardware/software boundary
Communication hardware
Physical communication medium
02/21/22 CS258 S99 41

How will we spend out time?
http://www.cs.berkeley.edu/~culler/cs258-s99/schedule.html
02/21/22 CS258 S99 42

How will grading work?
• 30% homeworks (6)
• 30% exam
• 30% project (teams of 2)
• 10% participation
02/21/22 CS258 S99 43

Any other questions?
02/21/22 CS258 S99 44

CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley

Uploaded by

Copyright:

Available Formats

CS 258

Parallel Computer Architecture

• Introduce you to Parallel Computer Architecture

02/21/22 CS258 S99 2

02/21/22 CS258 S99 3

02/21/22 CS258 S99 4

02/21/22 CS258 S99 5

02/21/22 CS258 S99 6

Role of a computer architect:

02/21/22 CS258 S99 7

02/21/22 CS258 S99 8

02/21/22 CS258 S99 9

02/21/22 CS258 S99 10

• For a fixed problem size (input data set),

• Speedup fixed problem (p processors) =

02/21/22 CS258 S99 11

02/21/22 CS258 S99 12

02/21/22 CS258 S99 14

02/21/22 CS258 S99 15

10 GIPS 5,000 Words

1980 1985 1990 1995

• Also CAD, Databases, . . .

• AMBER molecular dynamics simulation program

02/21/22 CS258 S99 18

02/21/22 CS258 S99 19

• Today the natural building-block is also fastest!

02/21/22 CS258 S99 21

02/21/22 CS258 S99 22

• 30% per year 40% per year

• Understanding microprocessor architectural

02/21/22 CS258 S99 24

02/21/22 CS258 S99 25

02/21/22 CS258 S99 26

• Infinite resources and fetch bandwidth, perfect branch

02/21/22 CS258 S99 27

Proc Proc Proc Proc

• Micro on a chip makes it natural to connect many to shared

02/21/22 CS258 S99 28

Sequent B2100 Symmetry81 SE60 Sun E6000

20 Sun SC2000   SC2000E

SS690MP 140  AS2100  HP K400  P-Pro

02/21/22 CS258 S99 29

02/21/22 CS258 S99 31

02/21/22 CS258 S99 32

02/21/22 CS258 S99 33

• Large-scale multiprocessors replace vector

02/21/22 CS258 S99 34

02/21/22 CS258 S99 35

1,000 ASCI Red 

• Even vector Crays became parallel

02/21/22 CS258 S99 37

02/21/22 CS258 S99 38

• Uncertainty of direction paralyzed parallel software development!

• Compilers, libraries and OS are important bridges today

02/21/22 CS258 S99 40

CAD Database Scientific modeling Parallel applications

Multiprogramming Shared Message Data Programming models

02/21/22 CS258 S99 41

02/21/22 CS258 S99 42

02/21/22 CS258 S99 43

02/21/22 CS258 S99 44

You might also like