You are on page 1of 237

Parallel Computing Explained

Slides Prepared from the CI-Tutor Courses at


NCSA
http://ci-tutor.ncsa.uiuc.edu/
By
S. Masoud Sadjadi
School of Computing and Information Sciences
Florida International University
March 2009
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.3 Parallelism in Computers
1.1.3.4 Disk Parallelism
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
Parallel Computing Overview
Who should read this chapter?
New Users – to learn concepts and terminology.
Intermediate Users – for review or reference.
Management Staff – to understand the basic concepts –
even if you don’t plan to do any programming.
Note: Advanced users may opt to skip this chapter.
Introduction to Parallel Computing
High performance parallel computers
can solve large problems much faster than a desktop
computer
 fast CPUs, large memory, high speed interconnects, and high speed
input/output
 able to speed up computations
 by making the sequential components run faster
 by doing more operations in parallel
High performance parallel computers are in demand
need for tremendous computational capabilities in science,
engineering, and business.
 require gigabytes/terabytes f memory and gigaflops/teraflops of
performance
 scientists are striving for petascale performance
Introduction to Parallel Computing
HPPC are used in a wide variety of disciplines.
Meteorologists: prediction of tornadoes and thunderstorms
Computational biologists: analyze DNA sequences
Pharmaceutical companies: design of new drugs
Oil companies: seismic exploration
Wall Street: analysis of financial markets
NASA: aerospace vehicle design
Entertainment industry: special effects in movies and
commercials
These complex scientific and business applications all
need to perform computations on large datasets or large
equations.
Parallelism in our Daily Lives
There are two types of processes that occur in computers
and in our daily lives:
Sequential processes
 occur in a strict order
 it is not possible to do the next step until the current one is
completed.
 Examples
 The passage of time: the sun rises and the sun sets.
 Writing a term paper: pick the topic, research, and write the paper.
Parallel processes
 many events happen simultaneously
 Examples
 Plant growth in the springtime
 An orchestra
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.2.1 Data Parallelism
1.1.2.2 Task Parallelism
1.1.3 Parallelism in Computers
1.1.3.4 Disk Parallelism
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
Parallelism in Computer Programs
 Conventional wisdom:
Computer programs are sequential in nature
Only a small subset of them lend themselves to parallelism.
Algorithm: the "sequence of steps" necessary to do a computation.
The first 30 years of computer use, programs were run sequentially.
 The 1980's saw great successes with parallel computers.
Dr. Geoffrey Fox published a book entitled Parallel Computing
Works!
many scientific accomplishments resulting from parallel computing
Computer programs are parallel in nature
Only a small subset of them need to be run sequentially
Parallel Computing
 What a computer does when it carries out more than one
computation at a time using more than one processor.
 By using many processors at once, we can speedup the
execution
If one processor can perform the arithmetic in time t.
Then ideally p processors can perform the arithmetic in time t/p.
What if I use 100 processors? What if I use 1000 processors?
 Almost every program has some form of parallelism.
You need to determine whether your data or your program can be
partitioned into independent pieces that can be run
simultaneously.
Decomposition is the name given to this partitioning process.
 Types of parallelism:
data parallelism
task parallelism.
Data Parallelism
The same code segment runs concurrently on each
processor, but each processor is assigned its own part
of the data to work on.
Do loops (in Fortran) define the parallelism.
The iterations must be independent of each other.
Data parallelism is called "fine grain parallelism"
because the computational work is spread into many
small subtasks.
Example
Dense linear algebra, such as matrix multiplication, is a
perfect candidate for data parallelism.
An example of data parallelism
Original Sequential Code Parallel Code
!$OMP PARALLEL DO
DO K=1,N DO K=1,N
DO J=1,N DO J=1,N
DO I=1,N DO I=1,N
C(I,J) = C(I,J) + C(I,J) = C(I,J) +
A(I,K)*B(K,J) A(I,K)*B(K,J)
END DO END DO
END DO END DO
END DO END DO
!$END PARALLEL DO
Quick Intro to OpenMP
OpenMP is a portable standard for parallel directives
covering both data and task parallelism.
More information about OpenMP is available on the
OpenMP website.
We will have a lecture on Introduction to OpenMP later.
With OpenMP, the loop that is performed in parallel is
the loop that immediately follows the Parallel Do
directive.
In our sample code, it's the K loop:
 DO K=1,N
OpenMP Loop Parallelism
Iteration-Processor The code segment running on
Assignments each processor

Iterations Data
Processor
of K Elements

proc0 K=1:5
A(I, 1:5) DO J=1,N
B(1:5 ,J)
DO I=1,N
A(I, 6:10)
proc1 K=6:10 C(I,J) = C(I,J) +
B(6:10 ,J)
A(I, 11:15) A(I,K)*B(K,J)
proc2 K=11:15
B(11:15 ,J) END DO
A(I, 16:20) END DO
proc3 K=16:20
B(16:20 ,J)
OpenMP Style of Parallelism
can be done incrementally as follows:
1. Parallelize the most computationally intensive loop.
2. Compute performance of the code.
3. If performance is not satisfactory, parallelize another
loop.
4. Repeat steps 2 and 3 as many times as needed.
The ability to perform incremental parallelism is
considered a positive feature of data parallelism.
It is contrasted with the MPI (Message Passing
Interface) style of parallelism, which is an "all or
nothing" approach.
Task Parallelism
 Task parallelism may be thought of as the opposite of data
parallelism.
 Instead of the same operations being performed on different
parts of the data, each process performs different operations.
 You can use task parallelism when your program can be split
into independent pieces, often subroutines, that can be
assigned to different processors and run concurrently.
 Task parallelism is called "coarse grain" parallelism because
the computational work is spread into just a few subtasks.
 More code is run in parallel because the parallelism is
implemented at a higher level than in data parallelism.
 Task parallelism is often easier to implement and has less
overhead than data parallelism.
Task Parallelism
The abstract code shown in the diagram is decomposed
into 4 independent code segments that are labeled A,
B, C, and D. The right hand side of the diagram
illustrates the 4 code segments running concurrently.
Task Parallelism
Original Code Parallel Code
program main program main
!$OMP PARALLEL
!$OMP SECTIONS
code segment labeled A code segment labeled A
!$OMP SECTION
code segment labeled B code segment labeled B
!$OMP SECTION
code segment labeled C code segment labeled C
!$OMP SECTION
code segment labeled D code segment labeled D
!$OMP END SECTIONS
!$OMP END PARALLEL
end end
OpenMP Task Parallelism
With OpenMP, the code that follows each
SECTION(S) directive is allocated to a different
processor. In our sample parallel code, the allocation
of code segments to processors is as follows.
Processor Code
code
proc0 segment
labeled A
code
proc1 segment
labeled B
code
proc2 segment
labeled C
code
Parallelism in Computers
How parallelism is exploited and enhanced within the
operating system and hardware components of a
parallel computer:
operating system
arithmetic
memory
disk
Operating System Parallelism
 All of the commonly used parallel computers run a version
of the Unix operating system. In the table below each OS
listed is in fact Unix, but the name of the Unix OS varies
with each vendor.
Parallel Computer OS
SGI Origin2000 IRIX
HP V-Class HP-UX
Cray T3E Unicos
IBM SP AIX
Workstation
Linux
Clusters

 For more information about Unix, a collection of Unix


documents is available.
Two Unix Parallelism Features
background processing facility
With the Unix background processing facility you can run
the executable a.out in the background and simultaneously
view the man page for the etime function in the foreground.
There are two Unix commands that accomplish this:

a.out > results &


man etime

cron feature
With the Unix cron feature you can submit a job that will
run at a later time.
Arithmetic Parallelism
 Multiple execution units
 facilitate arithmetic parallelism.
 The arithmetic operations of add, subtract, multiply, and divide (+ - * /) are
each done in a separate execution unit. This allows several execution units to
be used simultaneously, because the execution units operate independently.
 Fused multiply and add
 is another parallel arithmetic feature.
 Parallel computers are able to overlap multiply and add. This arithmetic is
named MultiplyADD (MADD) on SGI computers, and Fused Multiply Add
(FMA) on HP computers. In either case, the two arithmetic operations are
overlapped and can complete in hardware in one computer cycle.
 Superscalar arithmetic
 is the ability to issue several arithmetic operations per computer cycle.
 It makes use of the multiple, independent execution units. On superscalar
computers there are multiple slots per cycle that can be filled with work. This
gives rise to the name n-way superscalar, where n is the number of slots per
cycle. The SGI Origin2000 is called a 4-way superscalar computer.
Memory Parallelism
 memory interleaving
 memory is divided into multiple banks, and consecutive data elements are
interleaved among them. For example if your computer has 2 memory
banks, then data elements with even memory addresses would fall into
one bank, and data elements with odd memory addresses into the other.
 multiple memory ports
 Port means a bi-directional memory pathway. When the data elements
that are interleaved across the memory banks are needed, the multiple
memory ports allow them to be accessed and fetched in parallel, which
increases the memory bandwidth (MB/s or GB/s).
 multiple levels of the memory hierarchy
 There is global memory that any processor can access. There is memory
that is local to a partition of the processors. Finally there is memory that is
local to a single processor, that is, the cache memory and the memory
elements held in registers.
 Cache memory
 Cache is a small memory that has fast access compared with the larger
main memory and serves to keep the faster processor filled with data.
Memory Parallelism
Memory Hierarchy Cache Memory
Disk Parallelism
RAID (Redundant Array of Inexpensive Disk)
RAID disks are on most parallel computers.
The advantage of a RAID disk system is that it provides a
measure of fault tolerance.
If one of the disks goes down, it can be swapped out, and
the RAID disk system remains operational.
Disk Striping
When a data set is written to disk, it is striped across the
RAID disk system. That is, it is broken into pieces that are
written simultaneously to the different disks in the RAID
disk system. When the same data set is read back in, the
pieces are read in parallel, and the full data set is
reassembled in memory.
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.3 Parallelism in Computers
1.1.3.4 Disk Parallelism
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
Performance Measures
 Peak Performance
 is the top speed at which the computer can operate.
 It is a theoretical upper limit on the computer's performance.
 Sustained Performance
 is the highest consistently achieved speed.
 It is a more realistic measure of computer performance.
 Cost Performance
 is used to determine if the computer is cost effective.
 MHz
 is a measure of the processor speed.
 The processor speed is commonly measured in millions of cycles per
second, where a computer cycle is defined as the shortest time in which
some work can be done.
 MIPS
 is a measure of how quickly the computer can issue instructions.
 Millions of instructions per second is abbreviated as MIPS, where the
instructions are computer instructions such as: memory reads and writes,
logical operations , floating point operations, integer operations, and
branch instructions.
Performance Measures
 Mflops (Millions of floating point operations per second)
 measures how quickly a computer can perform floating-point
operations such as add, subtract, multiply, and divide.
 Speedup
 measures the benefit of parallelism.
 It shows how your program scales as you compute with more
processors, compared to the performance on one processor.
 Ideal speedup happens when the performance gain is linearly
proportional to the number of processors used.
 Benchmarks
 are used to rate the performance of parallel computers and parallel
programs.
 A well known benchmark that is used to compare parallel computers
is the Linpack benchmark.
 Based on the Linpack results, a list is produced of the Top 500
Supercomputer Sites. This list is maintained by the University of
Tennessee and the University of Mannheim.
More Parallelism Issues
 Load balancing
 is the technique of evenly dividing the workload among the processors.
 For data parallelism it involves how iterations of loops are allocated to
processors.
 Load balancing is important because the total time for the program to
complete is the time spent by the longest executing thread.
 The problem size
 must be large and must be able to grow as you compute with more processors.
 In order to get the performance you expect from a parallel computer you need
to run a large application with large data sizes, otherwise the overhead of
passing information between processors will dominate the calculation time.
 Good software tools
 are essential for users of high performance parallel computers.
 These tools include:
 parallel compilers
 parallel debuggers
 performance analysis tools
 parallel math software
 The availability of a broad set of application software is also important.
More Parallelism Issues
 The high performance computing market is risky and chaotic.
Many supercomputer vendors are no longer in business, making
the portability of your application very important.
 A workstation farm
is defined as a fast network connecting heterogeneous workstations.
The individual workstations serve as desktop systems for their
owners.
When they are idle, large problems can take advantage of the unused
cycles in the whole system.
An application of this concept is the SETI project. You can
participate in searching for extraterrestrial intelligence with your
home PC. More information about this project is available at the
SETI Institute.
Condor
 is software that provides resource management services for applications that
run on heterogeneous collections of workstations.
 Miron Livny at the University of Wisconsin at Madison is the director of the
Condor project, and has coined the phrase high throughput computing to
describe this process of harnessing idle workstation cycles. More
information is available at the Condor Home Page.
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.2 Comparison of Parallel Computers
1.2.1 Processors
1.2.2 Memory Organization
1.2.3 Flow of Control
1.2.4 Interconnection Networks
1.2.4.1 Bus Network
1.2.4.2 Cross-Bar Switch Network
1.2.4.3 Hypercube Network
1.2.4.4 Tree Network
1.2.4.5 Interconnection Networks Self-test
1.2.5 Summary of Parallel Computer Characteristics
1.3 Summary
Comparison of Parallel
Computers
Now you can explore the hardware components of
parallel computers:
kinds of processors
types of memory organization
flow of control
interconnection networks

You will see what is common to these parallel


computers, and what makes each one of them unique.
Kinds of Processors
There are three types of parallel computers:
1. computers with a small number of powerful processors
 Typically have tens of processors.
 The cooling of these computers often requires very sophisticated and
expensive equipment, making these computers very expensive for
computing centers.
 They are general-purpose computers that perform especially well on
applications that have large vector lengths.
 The examples of this type of computer are the Cray SV1 and the
Fujitsu VPP5000.
Kinds of Processors
There are three types of parallel computers:
2. computers with a large number of less powerful processors
 Named a Massively Parallel Processor (MPP), typically have
thousands of processors.
 The processors are usually proprietary and air-cooled.
 Because of the large number of processors, the distance between the
furthest processors can be quite large requiring a sophisticated internal
network that allows distant processors to communicate with each other
quickly.
 These computers are suitable for applications with a high degree of
concurrency.
 The MPP type of computer was popular in the 1980s.
 Examples of this type of computer were the Thinking Machines CM-2
computer, and the computers made by the MassPar company.
Kinds of Processors
There are three types of parallel computers:
3. computers that are medium scale in between the two
extremes
 Typically have hundreds of processors.
 The processor chips are usually not proprietary; rather they are
commodity processors like the Pentium III.
 These are general-purpose computers that perform well on a wide
range of applications.
 The most common example of this class is the Linux Cluster.
Trends and Examples
 Processor trends :
Decade Processor Type Computer Example
1970s Pipelined, Proprietary Cray-1
Massively Parallel,
1980s Thinking Machines CM2
Proprietary
Superscalar, RISC,
1990s SGI Origin2000
Commodity
 The2000s
processors on today’s commonly
CISC, Commodity usedClusters
Workstation parallel
computers:
Computer Processor
SGI Origin2000 MIPS RISC R12000
HP V-Class HP PA 8200
Cray T3E Compaq Alpha
IBM SP IBM Power3
Workstation Clusters Intel Pentium III, Intel Itanium
Memory Organization
The following paragraphs describe the three types of
memory organization found on parallel computers:
distributed memory
shared memory
distributed shared memory
Distributed Memory
 In distributed memory computers, the total memory is
partitioned into memory that is private to each processor.
There is a Non-Uniform Memory Access time (NUMA), which
is proportional to the distance between the two communicating
processors.
 On NUMA computers,
data is accessed the
quickest from a private
memory, while data
from the most distant
processor takes the
longest to access.
 Some examples are the
Cray T3E, the IBM SP,
and workstation clusters.
Distributed Memory
When programming distributed memory computers,
the code and the data should be structured such that
the bulk of a processor’s data accesses are to its own
private (local) memory.
This is called having
good data locality.
Today's distributed
memory computers use
message passing such
as MPI to
communicate between
processors as shown in
the following example:
Distributed Memory
One advantage of distributed memory computers is
that they are easy to scale. As the demand for
resources grows, computer centers can easily add more
memory and processors.
This is often called the LEGO block approach.
The drawback is that programming of distributed
memory computers can be quite complicated.
Shared Memory
 In shared memory computers, all processors have access to a
single pool of centralized memory with a uniform address space.
 Any processor can address any memory location at the same
speed so there is Uniform Memory Access time (UMA).
 Processors communicate with each other through the shared
memory.
 The advantages and
disadvantages of shared
memory machines are
roughly the opposite of
distributed memory
computers.
 They are easier to program
because they resemble the
programming of single
processor machines
 But they don't scale like
their distributed memory
counterparts
Distributed Shared Memory
 In Distributed Shared Memory (DSM) computers, a cluster or partition of
processors has access to a common shared memory.
 It accesses the memory of a different processor cluster in a NUMA
fashion.
 Memory is physically distributed but logically shared.
 Attention to data locality again is important.
 Distributed shared memory
computers combine the best
features of both distributed
memory computers and
shared memory computers.
 That is, DSM computers
have both the scalability of
distributed memory
computers and the ease of
programming of shared
memory computers.
 Some examples of DSM
computers are the SGI
Origin2000 and the HP V-
Class computers.
Trends and Examples
Memory
organization
Decade Memory Organization Example
trends: 1970s Shared Memory Cray-1
Thinking Machines
1980s Distributed Memory
CM-2
1990s Distributed Shared Memory SGI Origin2000
2000s Distributed Memory Workstation Clusters

Computer Memory Organization


The memory
SGI Origin2000 DSM
organization of
HP V-Class DSM
today’s commonly
Cray T3E Distributed
used parallel
IBM SP Distributed
computers:
Workstation Clusters Distributed
Flow of Control
When you look at the control of flow you will see
three types of parallel computers:
Single Instruction Multiple Data (SIMD)
Multiple Instruction Multiple Data (MIMD)
Single Program Multiple Data (SPMD)
Flynn’s Taxonomy
 Flynn’s Taxonomy, devised in 1972 by Michael Flynn of
Stanford University, describes computers by how streams of
instructions interact with streams of data.
 There can be single or multiple instruction streams, and there can
be single or multiple data streams. This gives rise to 4 types of
computers as shown in the diagram below:
 Flynn's taxonomy
names the 4 computer
types SISD, MISD,
SIMD and MIMD.
 Of these 4, only SIMD
and MIMD are
applicable to parallel
computers.
 Another computer
type, SPMD, is a
special case of MIMD.
SIMD Computers
 SIMD stands for Single Instruction Multiple Data.
 Each processor follows the same set of instructions.
 With different data elements being allocated to each processor.
 SIMD computers have distributed memory with typically thousands of simple
processors, and the processors run in lock step.
 SIMD computers, popular in the 1980s, are useful for fine grain data parallel
applications, such as neural networks.
 Some examples of SIMD
computers were the Thinking
Machines CM-2 computer and the
computers from the MassPar
company.
 The processors are commanded by
the global controller that sends
instructions to the processors.
 It says add, and they all add.
 It says shift to the right, and they
all shift to the right.
 The processors are like obedient
soldiers, marching in unison.
MIMD Computers
 MIMD stands for Multiple Instruction Multiple Data.
 There are multiple instruction streams with separate code segments
distributed among the processors.
 MIMD is actually a superset of SIMD, so that the processors can run the
same instruction stream or different instruction streams.
 In addition, there are multiple data streams; different data elements are
allocated to each processor.
 MIMD computers can have either distributed memory or shared memory.
 While the processors on SIMD
computers run in lock step, the
processors on MIMD computers
run independently of each other.
 MIMD computers can be used
for either data parallel or task
parallel applications.
 Some examples of MIMD
computers are the SGI
Origin2000 computer and the
HP V-Class computer.
SPMD Computers
 SPMD stands for Single Program Multiple Data.
 SPMD is a special case of MIMD.
 SPMD execution happens when a MIMD computer is programmed to
have the same set of instructions per processor.
 With SPMD computers, while the processors are running the same code
segment, each processor can run that code segment asynchronously.
 Unlike SIMD, the synchronous execution of instructions is relaxed.
 An example is the execution of an if statement on a SPMD computer.
 Because each processor computes with its own partition of the data
elements, it may evaluate the right hand side of the if statement differently
from another processor.
 One processor may take a certain branch of the if statement, and another
processor may take a different branch of the same if statement.
 Hence, even though each processor has the same set of instructions, those
instructions may be evaluated in a different order from one processor to the
next.
 The analogies we used for describing SIMD computers can be modified
for MIMD computers.
 Instead of the SIMD obedient soldiers, all marching in unison, in the
MIMD world the processors march to the beat of their own drummer.
Summary of SIMD versus MIMD
SIMD MIMD
distriuted memory
Memory distributed memory or
shared memory
same
same per
Code Segment or
processor
different
Processors
lock step asynchronously
Run In
Data different per different per
Elements processor processor
data parallel
Applications data parallel or
task parallel
Trends and Examples
Flow of control trends:
Decade Flow of Control Computer Example
Thinking Machines CM-
1980's SIMD
2
1990's MIMD SGI Origin2000
2000's MIMD Workstation Clusters
The flow of control on today:
Computer Flow of Control
SGI Origin2000 MIMD
HP V-Class MIMD
Cray T3E MIMD
IBM SP MIMD
Workstation Clusters MIMD
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.2 Comparison of Parallel Computers
1.2.1 Processors
1.2.2 Memory Organization
1.2.3 Flow of Control
1.2.4 Interconnection Networks
1.2.4.1 Bus Network
1.2.4.2 Cross-Bar Switch Network
1.2.4.3 Hypercube Network
1.2.4.4 Tree Network
1.2.4.5 Interconnection Networks Self-test
1.2.5 Summary of Parallel Computer Characteristics
1.3 Summary
Interconnection Networks
 What exactly is the interconnection network?
 The interconnection network is made up of the wires and cables that define
how the multiple processors of a parallel computer are connected to each
other and to the memory units.
 The time required to transfer data is dependent upon the specific type of the
interconnection network.
 This transfer time is called the communication time.
 What network characteristics are important?
 Diameter: the maximum distance that data must travel for 2 processors to
communicate.
 Bandwidth: the amount of data that can be sent through a network
connection.
 Latency: the delay on a network while a data packet is being stored and
forwarded.
 Types of Interconnection Networks
The network topologies (geometric arrangements of the computer network
connections) are:
 Bus
 Cross-bar Switch
 Hybercube
Interconnection Networks
 The aspects of network issues are:
 Cost
 Scalability
 Reliability
 Suitable Applications
 Data Rate
 Diameter
 Degree
 General Network Characteristics
 Some networks can be compared in terms of their degree and diameter.
 Degree: how many communicating wires are coming out of each
processor.
 A large degree is a benefit because it has multiple paths.
 Diameter: This is the distance between the two processors that are
farthest apart.
 A small diameter corresponds to low latency.
Bus Network
 Bus topology is the original coaxial cable-based Local Area
Network (LAN) topology in which the medium forms a single
bus to which all stations are attached.
 The positive aspects
 It is also a mature technology that is well known and reliable.
 The cost is also very low.
 simple to construct.
 The negative
aspects
 limited data
transmission rate.
 not scalable in
terms of
performance.
 Example: SGI
Power Challenge.
 Only scaled to 18
Cross-Bar Switch Network
 A cross-bar switch is a network that works through a switching
mechanism to access shared memory.
 it scales better than the bus network but it costs significantly more.
 The telephone system uses this type of network. An example of a
computer with this type of network is the HP V-Class.
 Here is a diagram of
a cross-bar switch
network which
shows the processors
talking through the
switchboxes to store
or retrieve data in
memory.
 There are multiple
paths for a processor
to communicate with
a certain memory.
 The switches
determine the
optimal route to take.
Cross-Bar Switch Network
 In a hypercube network, the processors are connected as if
they were corners of a multidimensional cube. Each node
in an N dimensional cube is directly connected to N other
nodes.
 The fact that the number of
directly connected, "nearest
neighbor", nodes increases with
the total size of the network is
also highly desirable for a parallel
computer.
 The degree of a hypercube
network is log n and the diameter
is log n, where n is the number of
processors.
 Examples of computers with this
type of network are the CM-2,
NCUBE-2, and the Intel iPSC860.
Tree Network
 The processors are the bottom nodes of the tree. For a
processor to retrieve data, it must go up in the network and
then go back down.
 This is useful for decision making applications that can be
mapped as trees.
 The degree of a tree network is 1. The diameter of the network
is 2 log (n+1)-2 where n is the number of processors.
 The Thinking Machines CM-5
is an example of a parallel
computer with this type of
network.
 Tree networks are very suitable
for database applications
because it allows multiple
searches through the database
at a time.
Interconnected Networks
Torus Network: A mesh with wrap-around
connections in both the x and y directions.
Multistage Network: A network with more than one
networking unit.
Fully Connected Network: A network where every
processor is connected to every other processor.
Hypercube Network: Processors are connected as if
they were corners of a multidimensional cube.
Mesh Network: A network where each interior
processor is connected to its four nearest neighbors.
Interconnected Networks
Bus Based Network: Coaxial cable based LAN
topology in which the medium forms a single bus to
which all stations are attached.
Cross-bar Switch Network: A network that works
through a switching mechanism to access shared
memory.
Tree Network: The processors are the bottom nodes of
the tree.
Ring Network: Each processor is connected to two
others and the line of connections forms a circle.
Summary of Parallel Computer
Characteristics
How many processors does the computer have?
10s?
100s?
1000s?
How powerful are the processors?
what's the MHz rate
what's the MIPS rate
What's the instruction set architecture?
RISC
CISC
Summary of Parallel Computer
Characteristics
How much memory is available?
total memory
memory per processor
What kind of memory?
distributed memory
shared memory
distributed shared memory
What type of flow of control?
SIMD
MIMD
SPMD
Summary of Parallel Computer
Characteristics
What is the interconnection network?
Bus
Crossbar
Hypercube
Tree
Torus
Multistage
Fully Connected
Mesh
Ring
Hybrid
Design decisions made by some of the
major parallel computer vendors

Programming Flow of
Computer OS Processors Memory Network
Style Control

SGI OpenMP MIPS RISC Crossbar


IRIX DSM MIMD
Origin2000 MPI R10000 Hypercube

OpenMP Crossbar
HP V-Class HP-UX HP PA 8200 DSM MIMD
MPI Ring

Compaq
Cray T3E SHMEM Unicos Distributed MIMD Torus
Alpha
IBM
IBM SP MPI AIX IBM Power3 Distributed MIMD
Switch
Workstatio
Intel Pentium Myrinet
n MPI Linux Distributed MIMD
III Tree
Clusters
Summary
 This completes our introduction to parallel computing.
 You have learned about parallelism in computer programs, and
also about parallelism in the hardware components of parallel
computers.
 In addition, you have learned about the commonly used parallel
computers, and how these computers compare to each other.
 There are many good texts which provide an introductory
treatment of parallel computing. Here are two useful references:

Highly Parallel Computing, Second Edition


George S. Almasi and Allan Gottlieb
Benjamin/Cummings Publishers, 1994
Parallel Computing Theory and Practice
Michael J. Quinn
McGraw-Hill, Inc., 1994
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
2.1 Automatic Compiler Parallelism
2.2 Data Parallelism by Hand
2.3 Mixing Automatic and Hand Parallelism
2.4 Task Parallelism
2.5 Parallelism Issues
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690
How to Parallelize a Code
This chapter describes how to turn a single processor
program into a parallel one, focusing on shared
memory machines.
Both automatic compiler parallelization and
parallelization by hand are covered.
The details for accomplishing both data parallelism
and task parallelism are presented.
Automatic Compiler Parallelism
Automatic compiler parallelism enables you to use
a single compiler option and let the compiler do
the work.
The advantage of it is that it’s easy to use.
The disadvantages are:
The compiler only does loop level parallelism, not task
parallelism.
The compiler wants to parallelize every do loop in
your code. If you have hundreds of do loops this
creates way too much parallel overhead.
Automatic Compiler Parallelism
To use automatic compiler parallelism on a Linux
system with the Intel compilers, specify the following.

ifort -parallel -O2 ... prog.f

The compiler creates conditional code that will run


with any number of threads.
Specify the number of threads and make sure you still
get the right answers with setenv:

setenv OMP_NUM_THREADS 4 a.out > results


Data Parallelism by Hand
 First identify the loops that use most of the CPU time (the
Profiling lecture describes how to do this).
 By hand, insert into the code OpenMP directive(s) just before the
loop(s) you want to make parallel.
 Some code modifications may be needed to remove data
dependencies and other inhibitors of parallelism.
 Use your knowledge of the code and data to assist the compiler.
 For the SGI Origin2000 computer, insert into the code an
OpenMP directive just before the loop that you want to make
parallel.

!$OMP PARALLEL
DO do i=1,n
… lots of computation ...
end do
!$OMP END PARALLEL DO
Data Parallelism by Hand
 Compile with the mp compiler option.
f90 -mp ... prog.f

 As before, the compiler generates conditional code that will run with
any number of threads.
 If you want to rerun your program with a different number of threads,
you do not need to recompile, just re-specify the setenv command.
setenv OMP_NUM_THREADS 8
a.out > results2

 The setenv command can be placed anywhere before the a.out


command.
 The setenv command must be typed exactly as indicated. If you have a
typo, you will not receive a warning or error message. To make sure
that the setenv command is specified correctly, type:
setenv

 It produces a listing of your environment variable settings.


Mixing Automatic and Hand
Parallelism
You can have one source file parallelized
automatically by the compiler, and another source file
parallelized by hand. Suppose you split your code into
two files named prog1.f and prog2.f.

f90 -c -apo … prog1.f (automatic // for prog1.f)


f90 -c -mp … prog2.f (by hand // for
prog2.f)
f90 prog1.o prog2.o (creates one
executable)
a.out > results (runs the executable)
Task Parallelism
 You can accomplish task parallelism as follows:
!$OMP PARALLEL
!$OMP SECTIONS
… lots of computation in part A …
!$OMP SECTION
… lots of computation in part B ...
!$OMP SECTION
… lots of computation in part C ...
!$OMP END SECTIONS
!$OMP END PARALLEL
 Compile with the mp compiler option.
f90 -mp … prog.f
 Use the setenv command to specify the number of threads.
setenv OMP_NUM_THREADS 3
a.out > results
Parallelism Issues
There are some issues to consider when
parallelizing a program.
Should data parallelism or task parallelism be
used?
Should automatic compiler parallelism or
parallelism by hand be used?
Which loop in a nested loop situation should be
the one that becomes parallel?
How many threads should be used?
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
3.1 Recompile
3.2 Word Length
3.3 Compiler Options for Debugging
3.4 Standards Violations
3.5 IEEE Arithmetic Differences
3.6 Math Library Differences
3.7 Compute Order Related Differences
3.8 Optimization Level Too High
3.9 Diagnostic Listings
3.10 Further Information
Porting Issues
 In order to run a computer program that presently runs on a
workstation, a mainframe, a vector computer, or another parallel
computer, on a new parallel computer you must first "port" the
code.
 After porting the code, it is important to have some benchmark
results you can use for comparison.
 To do this, run the original program on a well-defined dataset, and save
the results from the old or “baseline” computer.
 Then run the ported code on the new computer and compare the results.
 If the results are different, don't automatically assume that the new
results are wrong – they may actually be better. There are several
reasons why this might be true, including:
 Precision Differences - the new results may actually be more accurate than the
baseline results.
 Code Flaws - porting your code to a new computer may have uncovered a hidden
flaw in the code that was already there.
 Detection methods for finding code flaws, solutions, and
workarounds are provided in this lecture.
Recompile
 Some codes just need to be recompiled to get accurate results.
 The compilers available on the NCSA computer platforms are
shown in the following table:

Language SGI Origin2000 IA-32 Linux IA-64 Linux


Portland Portland
MIPSpro Intel GNU Intel GNU
Group Group

Fortran 77 f77 ifort g77 pgf77 ifort g77

Fortran 90 f90 ifort pgf90 ifort

Fortran 90 f95 ifort ifort


High
Performanc pghpf pghpf
e Fortran
C cc icc gcc pgcc icc gcc
C++ CC icpc g++ pgCC icpc g++
Word Length
 Code flaws can occur when you are porting your code to a
different word length computer.
 For C, the size of an integer variable differs depending on the
machine and how the variable is generated. On the IA32 and
IA64 Linux clusters, the size of an integer variable is 4 and 8
bytes, respectively. On the SGI Origin2000, the corresponding
value is 4 bytes if the code is compiled with the –n32 flag, and 8
bytes if compiled without any flags or explicitly with the –64
flag.
 For Fortran, the SGI MIPSpro and Intel compilers contain the
following flags to set default variable size.
-in where n is a number: set the default INTEGER to INTEGER*n.
The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linux
clusters.
-rn where n is a number: set the default REAL to REAL*n. The
value of n can be 4 or 8 on SGI, and 4, 8, or 16 on the Linux
clusters.
Compiler Options for Debugging
On the SGI Origin2000, the MIPSpro compilers
include debugging options via the –DEBUG:group.
The syntax is as follows:
-DEBUG:option1[=value1]:option2[=value2]...
Two examples are:
Array-bound checking: check for subscripts out of range
at runtime.
-DEBUG:subscript_check=ON
Force all un-initialized stack, automatic and dynamically
allocated variables to be initialized.
-DEBUG:trap_uninitialized=ON
Compiler Options for Debugging
On the IA32 Linux cluster, the Fortran compiler is
equipped with the following –C flags for runtime
diagnostics:
-CA: pointers and allocatable references
-CB: array and subscript bounds
-CS: consistent shape of intrinsic procedure
-CU: use of uninitialized variables
-CV: correspondence between dummy and
actual arguments
Standards Violations
Code flaws can occur when the program has non-
ANSI standard Fortran coding.
ANSI standard Fortran is a set of rules for compiler
writers that specify, for example, the value of the do loop
index upon exit from the do loop.
Standards Violations Detection
To detect standards violations on the SGI Origin2000
computer use the -ansi flag.
This option generates a listing of warning messages for
the use of non-ANSI standard coding.
On the Linux clusters, the -ansi[-] flag enables/disables
assumption of ANSI conformance.
IEEE Arithmetic Differences
 Code flaws occur when the baseline computer conforms to
the IEEE arithmetic standard and the new computer does not.
The IEEE Arithmetic Standard is a set of rules governing
arithmetic roundoff and overflow behavior.
For example, it prohibits the compiler writer from replacing x/y
with x *recip (y) since the two results may differ slightly for
some operands. You can make your program strictly conform to
the IEEE standard.
 To make your program conform to the IEEE Arithmetic
Standards on the SGI Origin2000 computer use:
f90 -OPT:IEEEarithmetic=n ... prog.f where n is 1, 2, or 3.
 This option specifies the level of conformance to the IEEE
standard where 1 is the most stringent and 3 is the most
liberal.
 On the Linux clusters, the Intel compilers can achieve
conformance to IEEE standard at a stringent level with the –
mp flag, or a slightly relaxed level with the –mp1 flag.
Math Library Differences
 Most high-performance parallel computers are equipped
with vendor-supplied math libraries.
 On the SGI Origin2000 platform, there are SGI/Cray
Scientific Library (SCSL) and Complib.sgimath.
SCSL contains Level 1, 2, and 3 Basic Linear Algebra
Subprograms (BLAS), LAPACK and Fast Fourier Transform
(FFT) routines.
SCSL can be linked with –lscs for the serial version, or –mp –
lscs_mp for the parallel version.
The complib library can be linked with –lcomplib.sgimath for
the serial version, or –mp –lcomplib.sgimath_mp for the
parallel version.
 The Intel Math Kernel Library (MKL) contains the
complete set of functions from BLAS, the extended BLAS
(sparse), the complete set of LAPACK routines, and Fast
Fourier Transform (FFT) routines.
Math Library Differences
On the IA32 Linux cluster, the libraries to link to are:
For BLAS: -L/usr/local/intel/mkl/lib/32 -lmkl -lguide –lpthread
For LAPACK: -L/usr/local/intel/mkl/lib/32 –lmkl_lapack -lmkl
-lguide –lpthread
When calling MKL routines from C/C++ programs, you
also need to link with –lF90.
On the IA64 Linux cluster, the corresponding libraries
are:
For BLAS: -L/usr/local/intel/mkl/lib/64 –lmkl_itp –lpthread
For LAPACK: -L/usr/local/intel/mkl/lib/64 –lmkl_lapack –lmkl_itp
–lpthread
When calling MKL routines from C/C++ programs, you
also need to link with -lPEPCF90 –lCEPCF90 –lF90 -lintrins
Compute Order Related Differences
 Code flaws can occur because of the non-deterministic
computation of data elements on a parallel computer. The compute
order in which the threads will run cannot be guaranteed.
 For example, in a data parallel program, the 50th index of a do loop
may be computed before the 10th index of the loop. Furthermore, the
threads may run in one order on the first run, and in another order on
the next run of the program.
 Note: : If your algorithm depends on data being compared in a specific
order, your code is inappropriate for a parallel computer.
 Use the following method to detect compute order related
differences:
 If your loop looks like
 DO I = 1, N change it to
 DO I = N, 1, -1 The results should not change if the iterations are
independent
Optimization Level Too High
 Code flaws can occur when the optimization level has been set
too high thus trading speed for accuracy.
The compiler reorders and optimizes your code based on
assumptions it makes about your program. This can sometimes
cause answers to change at higher optimization level.
 Setting the Optimization Level
Both SGI Origin2000 computer and IBM Linux clusters provide
Level 0 (no optimization) to Level 3 (most aggressive)
optimization, using the –O{0,1,2, or 3} flag. One should bear in
mind that Level 3 optimization may carry out loop
transformations that affect the correctness of calculations.
Checking correctness and precision of calculation is highly
recommended when –O3 is used.
For example on the Origin 2000
 f90 -O0 … prog.f turns off all optimizations.
Optimization Level Too High
Isolating Optimization Level Problems
You can sometimes isolate optimization level problems
using the method of binary chop.
 To do this, divide your program prog.f into halves. Name them
prog1.f and prog2.f.
 Compile the first half with -O0 and the second half with -O3
f90 -c -O0 prog1.f f90 -c -O3 prog2.f f90 prog1.o
prog2.o a.out > results
 If the results are correct, the optimization problem lies in prog1.f
 Next divide prog1.f into halves. Name them prog1a.f and
prog1b.f
 Compile prog1a.f with -O0 and prog1b.f with -O3
f90 -c -O0 prog1a.f f90 -c -O3 prog1b.f f90 prog1a.o
prog1b.o prog2.o a.out > results
 Continue in this manner until you have isolated the section of
code that is producing incorrect results.
Diagnostic Listings
The SGI Origin 2000 compiler will generate
all kinds of diagnostic warnings and
messages, but not always by default. Some
useful listing options are:
f90 -listing ...
f90 -fullwarn ...
f90 -showdefaults ...
f90 -version ...
f90 -help ...
Further Information
SGI
man f77/f90/cc
man debug_group
man math
man complib.sgimath
MIPSpro 64-Bit Porting and Transition Guide
Online Manuals
Linux clusters pages
ifort/icc/icpc –help (IA32, IA64, Intel64)
Intel Fortran Compiler for Linux
Intel C/C++ Compiler for Linux
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
4.1 Aggressive Compiler Options
4.2 Compiler Optimizations
4.3 Vendor Tuned Code
4.4 Further Information
Scalar Tuning
If you are not satisfied with the performance of your
program on the new computer, you can tune the scalar
code to decrease its runtime.
This chapter describes many of these techniques:
The use of the most aggressive compiler options
The improvement of loop unrolling
The use of subroutine inlining
The use of vendor supplied tuned code
The detection of cache problems, and their solution are
presented in the Cache Tuning chapter.
Aggressive Compiler Options
For the SGI Origin2000 Linux clusters the main
optimization switch is
-On where n ranges from 0 to 3.
-O0 turns off all optimizations.
-O1 and -O2 do beneficial optimizations that will
not effect the accuracy of results.
-O3 specifies the most aggressive optimizations. It
takes the most compile time, may produce
changes in accuracy, and turns on software
pipelining.
Aggressive Compiler Options
It should be noted that –O3 might carry out loop
transformations that produce incorrect results in some
codes.
It is recommended that one compare the answer obtained
from Level 3 optimization with one obtained from a lower-
level optimization.
On the SGI Origin2000 and the Linux clusters, –O3 can be
used together with –OPT:IEEE_arithmetic=n (n=1,2, or 3)
and –mp (or –mp1), respectively, to enforce operation
conformance to IEEE standard at different levels.
On the SGI Origin2000, the option
-Ofast = ip27
is also available. This option specifies the most aggressive
optimizations that are specifically tuned for the Origin2000
computer.
Agenda
 1 Parallel Computing Overview
 2 How to Parallelize a Code
 3 Porting Issues
 4 Scalar Tuning
4.1Aggressive Compiler Options
4.2 Compiler Optimizations
 4.2.1 Statement Level
 4.2.2 Block Level
 4.2.3 Routine Level
 4.2.4 Software Pipelining
 4.2.5 Loop Unrolling
 4.2.6 Subroutine Inlining
 4.2.7 Optimization Report
 4.2.8 Profile-guided Optimization (PGO)
4.3 Vendor Tuned Code
4.4 Further Information
Compiler Optimizations
The various compiler optimizations can be classified
as follows:
Statement Level Optimizations
Block Level Optimizations
Routine Level Optimizations
Software Pipelining
Loop Unrolling
Subroutine Inlining
Each of these are described in the following sections.
Statement Level
Constant Folding
Replace simple arithmetic operations on constants with
the pre-computed result.
      y = 5+7 becomes y = 12
Short Circuiting
Avoid executing parts of conditional tests that are not
necessary.
      if (I.eq.J .or. I.eq.K) expression
      when I=J immediately compute the expression
Register Assignment
Put frequently used variables in registers.
Block Level
Dead Code Elimination
Remove unreachable code and code that is never
executed or used.
Instruction Scheduling
Reorder the instructions to improve memory pipelining.
Routine Level
 Strength Reduction
Replace expressions in a loop with an expression that takes
fewer cycles.
 Common Subexpressions Elimination
Expressions that appear more than once, are computed once,
and the result is substituted for each occurrence of the
expression.
 Constant Propagation
Compile time replacement of variables with constants.
 Loop Invariant Elimination
Expressions inside a loop that don't change with the do loop
index are moved outside the loop.
Software Pipelining
Software pipelining allows the mixing of operations
from different loop iterations in each iteration of the
hardware loop. It is used to get the maximum work
done per clock cycle.

Note: On the R10000s there is out-of-order execution


of instructions, and software pipelining may actually
get in the way of this feature.
Loop Unrolling
 The loops stride (or step) value is increased, and the body of the
loop is replicated. It is used to improve the scheduling of the
loop by giving a longer sequence of straight line code. An
example of loop unrolling follows:

Original Loop Unrolled Loop


do I = 1, 99 do I = 1, 99, 3
c(I) = a(I) + b(I) c(I) = a(I) + b(I)
enddo c(I+1) = a(I+1) + b(I+1)
c(I+2) = a(I+2) + b(I+2)
enddo

There is a limit to the amount of unrolling that can take place


because there are a limited number of registers.
 On the SGI Origin2000, loops are unrolled to a level of 8 by
default. You can unroll to a level of 12 by specifying:
f90 -O3 -OPT:unroll_times_max=12 ... prog.f
 On the IA32 Linux cluster, the corresponding flag is –unroll and
-unroll0 for unrolling and no unrolling, respectively.
Subroutine Inlining
Subroutine inlining replaces a call to a subroutine
with the body of the subroutine itself.
One reason for using subroutine inlining is that
when a subroutine is called inside a do loop that
has a huge iteration count, subroutine inlining
may be more efficient because it cuts down on
loop overhead.
However, the chief reason for using it is that do
loops that contain subroutine calls may not
parallelize.
Subroutine Inlining
 On the SGI Origin2000 computer, there are several options to
invoke inlining:
Inline all routines except those specified to -INLINE:never
f90 -O3 -INLINE:all … prog.f:
Inline no routines except those specified to -INLINE:must
f90 -O3 -INLINE:none … prog.f:
Specify a list of routines to inline at every call
f90 -O3 -INLINE:must=subrname … prog.f:
Specify a list of routines never to inline
f90 -O3 -INLINE:never=subrname … prog.f:
 On the Linux clusters, the following flags can invoke function
inlining:
 inline function expansion for calls defined within the current source file
-ip:
 inline function expansion for calls defined in separate files
-ipo:
Optimization Report
 Intel 9.x and later compilers can generate reports that
provide useful information on optimization done on
different parts of your code.
To generate such optimization reports in a file filename, add
the flag -opt-report-file filename.
If you have a lot of source files to process simultaneously, and
you use a makefile to compile, you can also use make's
"suffix" rules to have optimization reports produced
automatically, each with a unique name. For example,
.f.o:
ifort -c -o $@ $(FFLAGS) -opt-report-file $*.opt $*.f
creates optimization reports that are named identically to the
original Fortran source but with the suffix ".f" replaced by
".opt".
Optimization Report
 To help developers and performance analysts navigate through the
usually lengthy optimization reports, the NCSA program OptView is
designed to provide an easy-to-use and intuitive interface that allows
the user to browse through their own source code, cross-referenced
with the optimization reports.
 OptView is installed on NCSA's IA64 Linux cluster under the
directory /usr/apps/tools/bin. You can either add that directory to your
UNIX PATH or you can invoke optview using an absolute path
name. You'll need to be using the X-Window system and to have set
your DISPLAY environment variable correctly for OptView to work.
 Optview can provide a quick overview of which loops in a source
code or source codes among multiple files are highly optimized and
which might need further work. For a detailed description of use of
OptView, readers see: http://perfsuite.ncsa.uiuc.edu/OptView/
Profile-guided Optimization
(PGO)
Profile-guided optimization allows Intel compilers to
use valuable runtime information to make better
decisions about function inlining and interprocedural
optimizations to generate faster codes. Its methodology
is illustrated as follows:
Profile-guided Optimization
(PGO)
 First, you do an instrumented compilation by adding the -prof-
gen flag in the compile process:
icc -prof-gen -c a1.c a2.c a3.c
icc a1.o a2.o a3.o -lirc
 Then, you run the program with a representative set of data to
generate the dynamic information files given by the .dyn suffix.
 These files contain valuable runtime information for the
compiler to do better function inlining and other optimizations.
 Finally, the code is recompiled again with the -prof-use flag to
use the runtime information.
icc -prof-use -ipo -c a1.c a2.c a3.c
 A profile-guided optimized executable is generated.
Vendor Tuned Code
Vendor math libraries have codes that are optimized
for their specific machine.
On the SGI Origin2000 platform, Complib.sgimath and
SCSL are available.
On the Linux clusters, Intel MKL is available. Ways to
link to these libraries are described in Section 3 - Porting
Issues.
Further Information
 SGI IRIX man and www pages
 man opt
 man lno
 man inline
 man ipa
 man perfex
 Performance Tuning for the Origin2000 at
http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Origin2000
OLD/Doc/
 Linux clusters help and www pages
 ifort/icc/icpc –help (Intel)
 http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Clust
er/ (Intel64)
 http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Clust
er/ (Intel64)
 http://perfsuite.ncsa.uiuc.edu/OptView/
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
5.1 Sequential Code Limitation
5.2 Parallel Overhead
5.3 Load Balance
5.3.1 Loop Schedule Types
5.3.2 Chunk Size
Parallel Code Tuning
This chapter describes several of the most common
techniques for parallel tuning, the type of programs
that benefit, and the details for implementing them.
The majority of this chapter deals with improving load
balancing.
Sequential Code Limitation
 Sequential code is a part of the program that cannot be run
with multiple processors. Some reasons why it cannot be
made data parallel are:
The code is not in a do loop.
The do loop contains a read or write.
The do loop contains a dependency.
The do loop has an ambiguous subscript.
The do loop has a call to a subroutine or a reference to a
function subprogram.
 Sequential Code Fraction
As shown by Amdahl’s Law, if the sequential fraction is too
large, there is a limitation on speedup. If you think too much
sequential code is a problem, you can calculate the sequential
fraction of code using the Amdahl’s Law formula.
Sequential Code Limitation
 Measuring the Sequential Code Fraction
Decide how many processors to use, this is p.
Run and time the program with 1 processor to give T(1).
Run and time the program with p processors to give T(2).
Form a ratio of the 2 timings T(1)/T(p), this is SP.
Substitute SP and p into the Amdahl’s Law formula:
 f=(1/SP-1/p)/(1-1/p), where f is the fraction of sequential code.
Solve for f, this is the fraction of sequential code.
 Decreasing the Sequential Code Fraction
The compilation optimization reports list which loops could
not be parallelized and why. You can use this report as a guide
to improve performance on do loops by:
 Removing dependencies
 Removing I/O
 Removing calls to subroutines and function subprograms
Parallel Overhead
 Parallel overhead is the processing time spent
creating threads
spin/blocking threads
starting and ending parallel regions
synchronizing at the end of parallel regions
When the computational work done by the parallel processes
is too small, the overhead time needed to create and control
the parallel processes can be disproportionately large limiting
the savings due to parallelism.
 Measuring Parallel Overhead
To get a rough under-estimate of parallel overhead:
 Run and time the code using 1 processor.
 Parallelize the code.
 Run and time the parallel code using only 1 processor.
 Subtract the 2 timings.
Parallel Overhead
 Reducing Parallel Overhead
To reduce parallel overhead:
 Don't parallelize all the loops.
 Don't parallelize small loops.
To benefit from parallelization, a loop needs about 1000 floating
point operations or 500 statements in the loop. You can use the IF
modifier in the OpenMP directive to control when loops are
parallelized.
!$OMP PARALLEL DO IF(n > 500)
do i=1,n
... body of loop ...
end do
!$OMP END PARALLEL DO
Use task parallelism instead of data parallelism. It doesn't generate
as much parallel overhead and often more code runs in parallel.
Don't use more threads than you need.
Parallelize at the highest level possible.
Load Balance
 Load balance
is the even assignment of subtasks to processors so as to keep
each processor busy doing useful work for as long as possible.
Load balance is important for speedup because the end of a do
loop is a synchronization point where threads need to catch up
with each other.
If processors have different work loads, some of the
processors will idle while others are still working.
 Measuring Load Balance
On the SGI Origin, to measure load balance, use the perfex
tool which is a command line interface to the R10000
hardware counters. The command
perfex -e16 -mp a.out > results
reports per thread cycle counts. Compare the cycle counts to
determine load balance problems. The master thread (thread 0)
always uses more cycles than the slave threads. If the counts
are vastly different, it indicates load imbalance.
Load Balance
For linux systems, the thread cpu times can be
compared with ps. A thread with unusually high or low
time compared to the others may not be working
efficiently [high cputime could be the result of a thread
spinning while waiting for other threads to catch up].
ps uH
Improving Load Balance
To improve load balance, try changing the way that loop
iterations are allocated to threads by
 changing the loop schedule type
 changing the chunk size
 These methods are discussed in the following sections.
Loop Schedule Types
On the SGI Origin2000 computer, 4 different loop schedule
types can be specified by an OpenMP directive. They are:
Static
Dynamic
Guided
Runtime
If you don't specify a schedule type, the default will be
used.
Default Schedule Type
The default schedule type allocates 20 iterations on 4 threads
as:
Loop Schedule Types
 Static Schedule Type
The static schedule type is used when some of the iterations do
more work than others. With the static schedule type, iterations
are allocated in a round-robin fashion to the threads.

 An Example
Suppose you are computing on the
upper triangle of a 100 x 100
matrix, and you use 2 threads,
named t0 and t1. With default
scheduling, workloads are uneven.
Loop Schedule Types
Whereas with static scheduling, the columns of the
matrix are given to the threads in a round robin
fashion, resulting in better load balance.
Loop Schedule Types
 Dynamic Schedule Type
The iterations are dynamically allocated to threads at runtime.
Each thread is given a chunk of iterations. When a thread
finishes its work, it goes into a critical section where it’s given
another chunk of iterations to work on.
This type is useful when you don’t know the iteration count or
work pattern ahead of time. Dynamic gives good load balance,
but at a high overhead cost.
 Guided Schedule Type
The guided schedule type is dynamic scheduling that starts
with large chunks of iterations and ends with small chunks of
iterations. That is, the number of iterations given to each
thread depends on the number of iterations remaining. The
guided schedule type reduces the number of entries into the
critical section, compared to the dynamic schedule type.
Guided gives good load balancing at a low overhead cost.
Chunk Size
 The word chunk refers to a grouping of iterations. Chunk size
means how many iterations are in the grouping. The static
and dynamic schedule types can be used with a chunk size. If
a chunk size is not specified, then the chunk size is 1.
 Suppose you specify a chunk size of 2 with the static
schedule type. Then 20 iterations are allocated on 4 threads:

 The schedule type and chunk size are specified as follows:


!$OMP PARALLEL DO SCHEDULE(type, chunk)

!$OMP END PARALLEL DO
 Where type is STATIC, or DYNAMIC, or GUIDED and
chunk is any positive integer.
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
6.1 Timing
6.1.1 Timing a Section of Code
6.1.1.1 CPU Time
6.1.1.2 Wall clock Time
6.1.2 Timing an Executable
6.1.3 Timing a Batch Job
6.2 Profiling
6.2.1 Profiling Tools
6.2.2 Profile Listings
6.2.3 Profiling Analysis
6.3 Further Information
Timing and Profiling
Now that your program has been ported to the new
computer, you will want to know how fast it runs.
This chapter describes how to measure the speed of a
program using various timing routines.
The chapter also covers how to determine which parts
of the program account for the bulk of the
computational load so that you can concentrate your
tuning efforts on those computationally intensive parts
of the program.
Timing
 In the following sections, we’ll discuss timers and review the
profiling tools ssrun and prof on the Origin and vprof and
gprof on the Linux Clusters. The specific timing functions
described are:
Timing a section of code
FORTRAN
 etime, dtime, cpu_time for CPU time
 time and f_time for wallclock time
C
 clock for CPU time
 gettimeofday for wallclock time
Timing an executable
 time a.out
Timing a batch run
 busage
 qstat
 qhist
CPU Time
etime
A section of code can be timed using etime.
It returns the elapsed CPU time in seconds since the
program started.

real*4 tarray(2),time1,time2,timeres
… beginning of program
time1=etime(tarray)
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
time2=etime(tarray)
timeres=time2-time1
CPU Time
dtime
A section of code can also be timed using dtime.
It returns the elapsed CPU time in seconds since the last
call to dtime.

real*4 tarray(2),timeres
… beginning of program
timeres=dtime(tarray)
… start of section of code to be timed
… lots of computation …
end of section of code to be timed
timeres=dtime(tarray)
… rest of program
CPU Time
The etime and dtime Functions
 User time.
This is returned as the first element of tarray.
It’s the CPU time spent executing user code.
 System time.
This is returned as the second element of tarray.
It’s the time spent executing system calls on behalf of your
program.
 Sum of user and system time.
This is the function value that is returned.
It’s the time that is usually reported.
 Metric.
Timings are reported in seconds.
Timings are accurate to 1/100th of a second.
CPU Time
Timing Comparison Warnings
 For the SGI computers:
The etime and dtime functions return the MAX time over all
threads for a parallel program.
This is the time of the longest thread, which is usually the
master thread.
 For the Linux Clusters:
The etime and dtime functions are contained in the VAX
compatibility library of the Intel FORTRAN Compiler.
To use this library include the compiler flag -Vaxlib.

 Another warning: Do not put calls to etime and dtime


inside a do loop. The overhead is too large.
CPU Time
cpu_time
 The cpu_time routine is available only on the Linux clusters
as it is a component of the Intel FORTRAN compiler library.
 It provides substantially higher resolution and has
substantially lower overhead than the older etime and dtime
routines.
 It can be used as an elapsed timer.

real*8 time1, time2, timeres


… beginning of program
call cpu_time (time1)
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
call cpu_time(time2)
timeres=time2-time1
… rest of program
CPU Time
clock
 For C programmers, one can call the cpu_time routine
using a FORTRAN wrapper or call the intrinsic function
clock that can be used to determine elapsed CPU time.

include <time.h>
static const double iCPS =
1.0/(double)CLOCKS_PER_SEC;
double time1, time2, timres;

time1=(clock()*iCPS);

/* do some work */

time2=(clock()*iCPS);
timers=time2-time1;
Wall clock Time
time
For the Origin, the function time returns the time since
00:00:00 GMT, Jan. 1, 1970.
It is a means of getting the elapsed wall clock time.
The wall clock time is reported in integer seconds.
external time integer*4 time1,time2,timeres

… beginning of program
time1=time( )
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
time2=time( )
timeres=time2 - time1
Wall clock Time
f_time
 For the Linux clusters, the appropriate FORTRAN function for
elapsed time is f_time.
integer*8 f_time
external f_time
integer*8 time1,time2,timeres
… beginning of program
time1=f_time()
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
time2=f_time()
timeres=time2 - time1

 As above for etime and dtime, the f_time function is in the VAX
compatibility library of the Intel FORTRAN Compiler. To use
this library include the compiler flag -Vaxlib.
Wall clock Time
gettimeofday
 For C programmers, wallclock time can be obtained by using the
very portable routine gettimeofday.
#include <stddef.h> /* definition of NULL */
#include <sys/time.h> /* definition of timeval struct and
protyping of gettimeofday */
double t1,t2,elapsed;
struct timeval tp;
int rtn;
....
....
rtn=gettimeofday(&tp, NULL);
t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
....
/* do some work */
....
rtn=gettimeofday(&tp, NULL);
t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
elapsed=t2-t1;
Timing an Executable
To time an executable (if using a csh or tcsh shell,
explicitly call /usr/bin/time)

time …options… a.out

where options can be ‘-p’ for a simple output or ‘-f


format’
which allows the user to display more than just
time related information.
Consult the man pages on the time command for
format options.
Timing a Batch Job
Time of a batch job running or completed.

Origin
busage jobid

Linux clusters
qstat jobid # for a running job
qhist jobid # for a completed job
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
6.1 Timing
6.1.1 Timing a Section of Code
6.1.1.1 CPU Time
6.1.1.2 Wall clock Time
6.1.2 Timing an Executable
6.1.3 Timing a Batch Job
6.2 Profiling
6.2.1 Profiling Tools
6.2.2 Profile Listings
6.2.3 Profiling Analysis
6.3 Further Information
Profiling
Profiling determines where a program spends its time.
It detects the computationally intensive parts of the code.

Use profiling when you want to focus attention and


optimization efforts on those loops that are responsible
for the bulk of the computational load.

Most codes follow the 90-10 Rule.


That is, 90% of the computation is done in 10% of the
code.
Profiling Tools
Profiling Tools on the Origin
On the SGI Origin2000 computer there are profiling tools named
ssrun and prof.
Used together they do profiling, or what is called hot spot
analysis.
They are useful for generating timing profiles.
 ssrun
The ssrun utility collects performance data for an executable that
you specify.
The performance data is written to a file named
"executablename.exptype.id".
 prof
The prof utility analyzes the data file created by ssrun and
produces a report.
 Example
ssrun -fpcsamp a.out
prof -h a.out.fpcsamp.m12345 > prof.list
Profiling Tools
Profiling Tools on the Linux Clusters
 On the Linux clusters the profiling tools are still maturing.
There are currently several efforts to produce tools comparable
to the ssrun, prof and perfex tools. .
 gprof
Basic profiling information can be generated using the OS utility
gprof.
First, compile the code with the compiler flags -qp -g for the Intel
compiler (-g on the Intel compiler does not change the
optimization level) or -pg for the GNU compiler.
Second, run the program.
Finally analyze the resulting gmon.out file using the gprof utility :
gprof executable gmon.out.

efc -O -qp -g -o foo foo.f


./foo
gprof foo gmon.out
Profiling Tools
Profiling Tools on the Linux Clusters
vprof
On the IA32 platform there is a utility called vprof that
provides performance information using the PAPI
instrumentation library.
To instrument the whole application requires
recompiling and linking to vprof and PAPI libraries.

setenv VMON PAPI_TOT_CYC


ifc -g -O -o md md.f
/usr/apps/tools/vprof/lib/vmonauto_gcc.o
-L/usr/apps/tools/lib -lvmon -lpapi
./md
/usr/apps/tools/vprof/bin/cprof -e md vmon.out
Profile Listings
Profile Listings on the Origin
Prof Output First Listing
Cycles % Cum% Secs Proc
-------- ----- ----- ---- ----
42630984 58.47 58.47 0.57 VSUB
6498294 8.91 67.38 0.09 PFSOR
6141611 8.42 75.81 0.08 PBSOR
3654120 5.01 80.82 0.05 PFSOR1
2615860 3.59 84.41 0.03 VADD
1580424 2.17 86.57 0.02 ITSRCG
1144036 1.57 88.14 0.02 ITSRSI
886044 1.22 89.36 0.01 ITJSI
861136 1.18 90.54 0.01 ITJCG

The first listing gives the number of cycles executed in


each procedure (or subroutine). The procedures are
listed in descending order of cycle count.
Profile Listings
Profile Listings on the Origin
Prof Output Second Listing
Cycles % Cum% Line Proc
-------- ----- ----- ---- ----
36556944 50.14 50.14 8106 VSUB
5313198 7.29 57.43 6974 PFSOR
4968804 6.81 64.24 6671 PBSOR
2989882 4.10 68.34 8107 VSUB
2564544 3.52 71.86 7097 PFSOR1
1988420 2.73 74.59 8103 VSUB
1629776 2.24 76.82 8045 VADD
994210 1.36 78.19 8108 VSUB
969056 1.33 79.52 8049 VADD
483018 0.66 80.18 6972 PFSOR
The second listing gives the number of cycles per
source code line.
The lines are listed in descending order of cycle count.
Profile Listings
Profile Listings on the Linux Clusters
 gprof Output First Listing
Flat profile:

Each sample counts as 0.000976562 seconds.


% cumulative self self total
time seconds seconds calls us/call us/call name
----- ---------- ------- ----- ------- ------- -----------
38.07 5.67 5.67 101 56157.18 107450.88 compute_
34.72 10.84 5.17 25199500 0.21 0.21 dist_
25.48 14.64 3.80 SIND_SINCOS
1.25 14.83 0.19 sin
0.37 14.88 0.06 cos
0.05 14.89 0.01 50500 0.15 0.15 dotr8_
0.05 14.90 0.01 100 68.36 68.36 update_
0.01 14.90 0.00 f_fioinit
0.01 14.90 0.00 f_intorange
0.01 14.90 0.00 mov
0.00 14.90 0.00 1 0.00 0.00 initialize_

 The listing gives a 'flat' profile of functions and routines


encountered, sorted by 'self seconds' which is the number
of seconds accounted for by this function alone.
Profile Listings
Profile Listings on the Linux Clusters
 gprof Output Second Listing
Call graph:

index % time self children called name


----- ------ ---- -------- ---------------- ----------------
[1] 72.9 0.00 10.86 main [1]
5.67 5.18 101/101 compute_ [2]
0.01 0.00 100/100 update_ [8]
0.00 0.00 1/1 initialize_ [12]
---------------------------------------------------------------------
5.67 5.18 101/101 main [1]
[2] 72.8 5.67 5.18 101 compute_ [2]
5.17 0.00 25199500/25199500 dist_ [3]
0.01 0.00 50500/50500 dotr8_ [7]
---------------------------------------------------------------------
5.17 0.00 25199500/25199500 compute_ [2]
[3] 34.7 5.17 0.00 25199500 dist_ [3]
---------------------------------------------------------------------
<spontaneous>
[4] 25.5 3.80 0.00 SIND_SINCOS [4]


 The second listing gives a 'call-graph' profile of functions and routines
encountered. The definitions of the columns are specific to the line in question.
Detailed information is contained in the full output from gprof.
Profile Listings
Profile Listings on the Linux Clusters
 vprof Listing

Columns correspond to the following events:


PAPI_TOT_CYC - Total cycles (1956 events)

File Summary:
100.0% /u/ncsa/gbauer/temp/md.f

Function Summary:
84.4% compute
15.6% dist

Line Summary:
67.3% /u/ncsa/gbauer/temp/md.f:106
13.6% /u/ncsa/gbauer/temp/md.f:104
9.3% /u/ncsa/gbauer/temp/md.f:166
2.5% /u/ncsa/gbauer/temp/md.f:165
1.5% /u/ncsa/gbauer/temp/md.f:102
1.2% /u/ncsa/gbauer/temp/md.f:164
0.9% /u/ncsa/gbauer/temp/md.f:107
0.8% /u/ncsa/gbauer/temp/md.f:169
0.8% /u/ncsa/gbauer/temp/md.f:162
0.8% /u/ncsa/gbauer/temp/md.f:105
 The above listing from (using the -e option to cprof), displays not only cycles consumed
by functions (a flat profile) but also the lines in the code that contribute to those
functions.
Profile Listings
Profile Listings on the Linux Clusters
 vprof Listing (cont.)

0.7% /u/ncsa/gbauer/temp/md.f:149
0.5% /u/ncsa/gbauer/temp/md.f:163
0.2% /u/ncsa/gbauer/temp/md.f:109
0.1% /u/ncsa/gbauer/temp/md.f:100



100 0.1% do j=1,np
101 if (i .ne. j) then
102 1.5% call dist(nd,box,pos(1,i),pos(1,j),rij,d)
103 ! attribute half of the potential energy to particle 'j'
104 13.6% pot = pot + 0.5*v(d)
105 0.8% do k=1,nd
106 67.3% f(k,i) = f(k,i) - rij(k)*dv(d)/d
107 0.9% enddo
108 endif
109 0.2% enddo
Profiling Analysis
 The program being analyzed in the previous Origin example has
approximately 10000 source code lines, and consists of many
subroutines.
 The first profile listing shows that over 50% of the computation is
done inside the VSUB subroutine.
 The second profile listing shows that line 8106 in subroutine VSUB
accounted for 50% of the total computation.
 Going back to the source code, line 8106 is a line inside a do loop.
 Putting an OpenMP compiler directive in front of that do loop you can
get 50% of the program to run in parallel with almost no work on your
part.
 Since the compiler has rearranged the source lines the line numbers
given by ssrun/prof give you an area of the code to inspect.
 To view the rearranged source use the option
f90 … -FLIST:=ON
cc … -CLIST:=ON
 For the Intel compilers, the appropriate options are
ifort … –E …
icc … -E …
Further Information
 SGI Irix
man etime
man 3 time
man 1 time
man busage
man timers
man ssrun
man prof
Origin2000 Performance Tuning and Optimization Guide
 Linux Clusters
man 3 clock
man 2 gettimeofday
man 1 time
man 1 gprof
man 1B qstat
Intel Compilers Vprof on NCSA Linux Cluster
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scaler Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690
Agenda
7 Cache Tuning
7.1 Cache Concepts
7.1.1 Memory Hierarchy
7.1.2 Cache Mapping
7.1.3 Cache Thrashing
7.1.4 Cache Coherence
7.2 Cache Specifics
7.3 Code 0ptimization
7.4 Measuring Cache Performance
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Cache Concepts
 The CPU time required to perform an operation is the sum
of the clock cycles executing instructions and the clock
cycles waiting for memory.
 The CPU cannot be performing useful work if it is waiting
for data to arrive from memory.
 Clearly then, the memory system is a major factor in
determining the performance of your program and a large
part is your use of the cache.
 The following sections will discuss the key concepts of
cache including:
Memory subsystem hierarchy
Cache mapping
Cache thrashing
Cache coherence
Memory Hierarchy
 The different subsystems in the memory hierarchy have
different speeds, sizes, and costs.
 Smaller memory is faster
 Slower memory is cheaper
 The hierarchy is set up so that the fastest memory is closest
to the CPU, and the slower memories are further away from
the CPU.
Memory Hierarchy
 It's a hierarchy because every level is a subset of a level further away.
 All data in one level is found in the level below.
 The purpose of cache is to improve the memory access time to the processor.
 There is an overhead associated with it, but the benefits outweigh the cost.
 Registers
Registers are the sources and destinations of CPU data operations.
They hold one data element each and are 32 bits or 64 bits wide.
They are on-chip and built from SRAM.
Computers usually have 32 or 64 registers.
The Origin MIPS R10000 has 64 physical 64-bit registers of which 32
are available for floating-point operations.
The Intel IA64 has 328 registers for general-purpose (64 bit), floating-
point (80 bit), predicate (1 bit), branch and other functions.
Register access speeds are comparable to processor speeds.
Memory Hierarchy
 Main Memory Improvements
 A hardware improvement called interleaving reduces main memory
access time.
 In interleaving, memory is divided into partitions or segments called
memory banks.
 Consecutive data elements are spread across the banks.
 Each bank supplies one data element per bank cycle.
 Multiple data elements are read in parallel, one from each bank.
 The problem with interleaving is that the memory interleaving
improvement assumes that memory is accessed sequentially.
 If there is 2-way memory interleaving, but the code accesses every
other location, there is no benefit.
 The bank cycle time is 4-8 times the CPU clock cycle time so the main
memory can’t keep up with the fast CPU and keep it busy with data.
 Large main memory with a cycle time comparable to the processor is
not affordable.
Memory Hierarchy
 Principle of Locality
 The way your program operates follows the Principle of Locality.
 Temporal Locality: When an item is referenced, it will be referenced again soon.
 Spatial Locality: When an item is referenced, items whose addresses are
nearby will tend to be referenced soon.
 Cache Line
 The overhead of the cache can be reduced by fetching a chunk or block
of data elements.
 When a main memory access is made, a cache line of data is brought
into the cache instead of a single data element.
 A cache line is defined in terms of a number of bytes.
 For example, a cache line is typically 32 or 128 bytes.
 This takes advantage of spatial locality.
 The additional elements in the cache line will most likely be needed
soon.
 The cache miss rate falls as the size of the cache line increases, but there
is a point of negative returns on the cache line size.
 When the cache line size becomes too large, the transfer time increases.
Memory Hierarchy
 Cache Hit
A cache hit occurs when the data element requested by the
processor is in the cache.
You want to maximize hits.
The Cache Hit Rate is defined as the fraction of cache hits.
It is the fraction of the requested data that is found in the cache.
 Cache Miss
A cache miss occurs when the data element requested by the
processor is NOT in the cache.
You want to minimize cache misses. Cache Miss Rate is defined as
1.0 - Hit Rate
Cache Miss Penalty, or miss time, is the time needed to retrieve the
data from a lower level (downstream) of the memory hierarchy.
(Recall that the lower levels of the hierarchy have a slower access
time.)
Memory Hierarchy
Levels of Cache
It used to be that there were two levels of cache: on-chip and
off-chip.
 L1/L2 is still true for the Origin MIPS and the Intel IA-32 processors.
 Caches closer to the CPU are called Upstream. Caches further from the CPU are
called Downstream.
 The on-chip cache is called First level, L1, or primary cache.
 An on-chip cache performs the fastest but the computer designer makes a trade-off
between die size and cache size. Hence, on-chip cache has a small size. When the on-
chip cache has a cache miss the time to access the slower main memory is very large.
 The off-chip cache is called Second Level, L2, or secondary cache.
 A cache miss is very costly. To solve this problem, computer designers have
implemented a larger, slower off-chip cache. This chip speeds up the on-chip cache
miss time. L1 cache misses are handled quickly. L2 cache misses have a larger
performance penalty.
 The cache external to the chip is called Third Level, L3.
 The newer Intel IA-64 processor has 3 levels of cache
Memory Hierarchy
 Split or Unified Cache
In unified cache, typically L2, the cache is a combined
instruction-data cache.
 A disadvantage of a unified cache is that when the data access and
instruction access conflict with each other, the cache may be thrashed, e.g. a
high cache miss rate.
In split cache, typically L1, the cache is split into 2 parts:
 one for the instructions, called the instruction cache
 another for the data, called the data cache.
 The 2 caches are independent of each other, and they can have independent
properties.
 Memory Hierarchy Sizes
Memory hierarchy sizes are specified in the following units:
 Cache Line: bytes
 L1 Cache: Kbytes
 L2 Cache: Mbytes
 Main Memory: Gbytes
Cache Mapping
 Cache mapping determines which cache location should be
used to store a copy of a data element from main memory.
There are 3 mapping strategies:
Direct mapped cache
Set associative cache
Fully associative cache
 Direct Mapped Cache
In direct mapped cache, a line of main memory is mapped to
only a single line of cache.
Consequently, a particular cache line can be filled from (size of
main memory mod size of cache) different lines from main
memory.
Direct mapped cache is inexpensive but also inefficient and
very susceptible to cache thrashing.
Cache Mapping
Direct Mapped Cache

http://larc.ee.nthu.edu.tw/~cthuang/courses/ee3450/lectures/07_memory.htm
l
Cache Mapping
 Fully Associative Cache
 For fully associative cache, any line of cache can be loaded with any
line from main memory.
 This technology is very fast but also very expensive.

http://www.xbitlabs.com/images/video/radeon-
Cache Mapping
 Set Associative Cache
 For N-way set associative cache, you can think of cache as being divided
into N sets (usually N is 2 or 4).
 A line from main memory can then be written to its cache line in any of the
N sets.
 This is a trade-off between direct mapped and fully associative cache.

http://www.alasir.com/articles/cache_principles/cache_way.
Cache Mapping
 Cache Block Replacement
With direct mapped cache, a cache line can only be mapped to one
unique place in the cache. The new cache line replaces the cache block at
that address. With set associative cache there is a choice of 3 strategies:
1. Random
 There is a uniform random replacement within the set of cache blocks. The
advantage of random replacement is that it’s simple and inexpensive to
implement.
2. LRU (Least Recently Used)
 The block that gets replaced is the one that hasn’t been used for the longest time.
The principle of temporal locality tells us that recently used data blocks are likely
to be used again soon. An advantage of LRU is that it preserves temporal locality.
A disadvantage of LRU is that it’s expensive to keep track of cache access
patterns. In empirical studies, there was little performance difference between
LRU and Random.
3. FIFO (First In First Out)
 Replace the block that was brought in N accesses ago, regardless of the usage
pattern. In empirical studies, Random replacement generally outperformed FIFO.
Cache Thrashing
 Cache thrashing is a problem that happens when a frequently
used cache line gets displaced by another frequently used
cache line.
Cache thrashing can happen for both instruction and data caches.
The CPU can’t find the data element it wants in the cache and
must make another main memory cache line access.
The same data elements are repeatedly fetched into and
displaced from the cache.
 Cache thrashing happens because the computational code
statements have too many variables and arrays for the needed
data elements to fit in cache.
Cache lines are discarded and later retrieved.
The arrays are dimensioned too large to fit in cache. The arrays
are accessed with indirect addressing, e.g. a(k(j)).
Cache Coherence
Cache coherence
is maintained by an agreement between data stored in
cache, other caches, and main memory.
When the same data is being manipulated by different
processors, they must inform each other of their
modification of data.
The term Protocol is used to describe how caches and
main memory communicate with each other.
It is the means by which all the memory subsystems
maintain data coherence.
Cache Coherence
Snoop Protocol
All processors monitor the bus traffic to determine cache
line status.
Directory Based Protocol
Cache lines contain extra bits that indicate which other
processor has a copy of that cache line, and the status of
the cache line – clean (cache line does not need to be sent
back to main memory) or dirty (cache line needs to
update main memory with content of cache line).
Hardware Cache Coherence
Cache coherence on the Origin computer is maintained
in the hardware, transparent to the programmer.
Cache Coherence
False sharing
happens in a multiprocessor system as a result of
maintaining cache coherence.
Both processor A and processor B have the same cache
line.
A modifies the first word of the cache line.
B wants to modify the eighth word of the cache line.
But A has sent a signal to B that B’s cache line is invalid.
B must fetch the cache line again before writing to it.
Cache Coherence
A cache miss creates a processor stall.
The processor is stalled until the data is retrieved from
the memory.
The stall is minimized by continuing to load and execute
instructions, until the data that is stalling is retrieved.
These techniques are called:
 Prefetching
 Out of order execution
 Software pipelining
Typically, the compiler will do these at -O3
optimization.
Cache Coherence
 The following is an example of software pipelining:
 Suppose you compute
Do I=1,N
y(I)=y(I) + a*x(I)
End Do
 In pseudo-assembly language, this is what the Origin compiler will
do:
cycle t+0 ld y(I+3)
cycle t+1 ld x(I+3)
cycle t+2 st y(I-4)
cycle t+3 st y(I-3)
cycle t+4 st y(I-2)
cycle t+5 st y(I-1)
cycle t+6 ld y(I+4)
cycle t+7 ld x(I+4)
cycle t+8 ld y(I+5) madd I
cycle t+9 ld x(I+5) madd I+1
cycle t+10 ld y(I+6) madd I+2
cycle t+11 ld x(I+6) madd I+3
Cache Coherence
Since the Origin processor can only execute 1 load or
1 store at a time, the compiler places loads in the
instruction pipeline well before the data is needed.
It is then able to continue loading while
simultaneously performing a fused multiply-add
(a+b*c).
The code above gets 8 flops in 12 clock cycles.
The peak is 24 flops in 12 clock cycles for the Origin.
The Intel Pentium III (IA-32) and the Itanium (IA-64)
will have differing versions of the code above but the
same concepts apply.
Agenda
7 Cache Tuning
7.1 Cache Concepts
7.2Cache Specifics
7.2.1 Cache on the SGI Origin2000
7.2.2 Cache on the Intel Pentium III
7.2.3 Cache on the Intel Itanium
7.2.4 Cache Summary
7.3 Code 0ptimization
7.4 Measuring Cache Performance
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Cache on the SGI Origin2000
L1 Cache (on-chip primary cache)
Cache size: 32KB floating point data
32KB integer data and instruction
Cache line size: 32 bytes
Associativity: 2-way set associative
L2 Cache (off-chip secondary cache)
Cache size: 4MB per processor
Cache line size: 128 bytes
Associativity: 2-way set associative
Replacement: LRU
Coherence: Directory based 2-way interleaved (2 banks)
Cache on the SGI Origin2000
Bandwidth L1 cache-to-processor
1.6 GB/s/bank
3.2 GB/sec overall possible
Latency: 1 cycle
Bandwidth between L1 and L2 cache
1GB/s
Latency: 11 cycles
Bandwidth between L2 cache and local memory
.5 GB/s
Latency: 61 cycles
Average 32 processor remote memory
Latency: 150 cycles
Cache on the Intel Pentium III
L1 Cache (on-chip primary cache)
Cache size: 16KB floating point data
16KB integer data and instruction
Cache line size: 16 bytes
Associativity: 4-way set associative
L2 Cache (off-chip secondary cache)
Cache size: 256 KB per processor
Cache line size: 32 bytes
Associativity: 8-way set associative
Replacement: pseudo-LRU
Coherence: interleaved (8 banks)
Cache on the Intel Pentium III
Bandwidth L1 cache-to-processor
16 GB/s
Latency: 2 cycles
Bandwidth between L1 and L2 cache
11.7 GB/s
Latency: 4-10 cycles
Bandwidth between L2 cache and local memory
1.0 GB/s
Latency: 15-21 cycles
Cache on the Intel Itanium
 L1 Cache (on-chip primary cache)
Cache size: 16KB floating point data
16KB integer data and instruction
Cache line size: 32 bytes
Associativity: 4-way set associative
 L2 Cache (off-chip secondary cache)
Cache size: 96KB unified data and instruction
Cache line size: 64 bytes
Associativity: 6-way set associative
Replacement: LRU
 L3 Cache (off-chip tertiary cache)
Cache size: 4MB per processor
Cache line size: 64 bytes
Associativity: 4-way set associative
Replacement: LRU
Cache on the Intel Itanium
Bandwidth L1 cache-to-processor
25.6 GB/s
Latency: 1 - 2 cycle
Bandwidth between L1 and L2 cache
25.6 GB/sec
Latency: 6 - 9 cycles
Bandwidth between L2 and L3 cache
11.7 GB/sec
Latency: 21 - 24 cycles
Bandwidth between L3 cache and main memory
2.1 GB/sec
Latency: 50 cycles
Cache Summary
MIPS
Chip Pentium III Itanium
R10000
#Caches 2 2 3
Associativit
2/2 4/8 4/6/4
y
Replacement LRU Pseudo-LRU LRU
CPU MHz 195/250 1000 800
Peak Mflops 390/500 1000 3200
1 LD or 1 1 LD and 1 2 LD or 2
LD,ST/cycle
ST ST ST

 Only one load or store may be performed each CPU cycle on the
R10000.
 This indicates that loads and stores may be a bottleneck.
 Efficient use of cache is extremely important.
Agenda
7 Cache Tuning
7.1 Cache Concepts
7.2 Cache Specifics
7.3Code 0ptimization
7.4 Measuring Cache Performance
7.4.1 Measuring Cache Performance on the SGI Origin2000
7.4.2 Measuring Cache Performance on the Linux Clusters
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Code 0ptimization
Gather statistics to find out where the bottlenecks are in your
code so you can identify what you need to optimize.
The following questions can be useful to ask:
How much time does the program take to execute?
 Use /usr/bin/time a.out for CPU time
Which subroutines use the most time?
 Use ssrun and prof on the Origin or gprof and vprof on the Linux clusters.
Which loop uses the most time?
 Put etime/dtime or other recommended timer calls around loops for CPU
time.
 For more information on timers see Timing and Profiling section.
What is contributing to the cpu time?
 Use the Perfex utility on the Origin or perfex or hpmcount on the Linux
clusters.
Code 0ptimization
Some useful optimizing and profiling tools are
etime/dtime/time
perfex
ssusage
ssrun/prof
gprof cvpav, cvd
See the NCSA web pages on Compiler, Performance, and
Productivity Tools
http://www.ncsa.uiuc.edu/UserInfo/Resources/Software/Too
ls/ for information on which tools are available on NCSA
platforms.
Measuring Cache Performance on the
SGI Origin2000
The R10000 processors of NCSA’s Origin2000
computers have hardware performance counters.
There are 32 events that are measured and each event is
numbered.
0 = cycles
1 = Instructions issued
...
26 = Secondary data cache misses
...
View man perfex for more information.
The Perfex Utility
The hardware performance counters can be measured
using the perfex utility.
perfex [options] command [arguments]
Measuring Cache Performance on the
SGI Origin2000
 where the options are:
-e counter1-e counter2
This specifies which events are to be counted. You enter the
number of the event you want counted. (Remember to have a
space in between the "e" and the event number.)
-a
sample ALL the events
-mp
Report all results on a per thread basis.
-y
Report the results in seconds, not cycles.
-x
Gives extra summary info including Mflops command Specify
the name of the executable file. arguments Specify the input
and output arguments to the executable file.
Measuring Cache Performance on the
SGI Origin2000
Examples
perfex -e 25 -e 26 a.out
- outputs the L1 and L2 cache misses
- the output is reported in cycles
perfex -a -y a.out > results
- outputs ALL the hardware performance counters
- - the output is reported in seconds
Measuring Cache Performance on the
Linux Clusters
The Intel Pentium III and Itanium processors
provide hardware event counters that can be
accessed from several tools.

perfex for the Pentium III and pfmon for the


Itanium
To view usage and options for perfex and
pfmon:
perfex -h
pfmon --help
To measure L2 cache misses:
perfex –eP6_L2_LINES_IN a.out
pfmon –-events=L2_MISSES a.out
Measuring Cache Performance on the
Linux Clusters
psrun [soft add +perfsuite]
Another tool that provides access to the
hardware event counter and also provides
derived statistics is perfsuite.
To add perfsuite's psrun to the current shell
environment :
soft add +perfsuite
To measure cache misses:
psrun a.out
psprocess a.out*.xml
Agends
7 Cache Tuning
7.1 Cache Concepts
7.2 Cache Specifics
7.3 Code 0ptimization
7.4 Measuring Cache Performance
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Locating the Cache Problem
For the Origin, the perfex output is a first-pass detection
of a cache problem.
If you then use the CaseVision tools, you can locate the
cache problem in your code.
The CaseVision tools are
cvpav for performance analysis
cvd for debugging
CaseVision is not available on the Linux clusters.
Tools like vprof and libhpm provide routines for users to
instrument their code.
Using vprof with the PAPI cache events can provide
detailed information about where poor cache utilization is
occurring.
Cache Tuning Strategy
The strategy for performing cache tuning on your code
is based on data reuse.
Temporal Reuse
 Use the same data elements on more than one iteration of the
loop.
Spatial Reuse
 Use data that is encached as a result of fetching nearby data
elements from downstream memory.

Strategies that take advantage of the Principle of


Locality will improve performance.
Preserve Spatial Locality
 Check loop nesting to ensure stride-one memory access.
 The following code does not preserve spatial locality:
do I=1,n
do K=1,n
do J=1,n
C(I,J)=C(I,J) + A(I,K) * B(K,J)
end do …
 It is not wrong but runs much slower than it could.
 To ensure stride-one access modify the code using loop
interchange.
do J=1,n
do K=1,n
do I=1,n
C(I,J)=C(I,J) + A(I,K) * B(K,J)
end do …
 For Fortran the innermost loop index should be the leftmost index
of the arrays. The code has been modified for spatial reuse.
Locality Problem
Suppose your code looks like:
DO J=1,N
DO I=1,N
A(I,J)=B(J,I)
ENDDO
ENDDO
The loop as it is typed above does not have unit-stride
access on loads.
If you interchange the loops, the code doesn’t have
unit-stride access on stores.
Use the optimized, intrinsic-function transpose from
the FORTRAN compiler instead of hand-coding it.
Grouping Data Together
 Consider the following code segment:
d=0.0
do I=1,n
j=index(I)
d = d + sqrt(x(j)*x(j) + y(j)*y(j) + z(j)*z(j))
 Since the arrays are accessed with indirect accessing, it is
likely that 3 new cache lines need to be brought into the
cache for each iteration of the loop. Modify the code by
grouping together x, y, and z into a 2-dimensional array
named r.
d=0.0
do I=1,n
j=index(I)
d = d + sqrt(r(1,j)*r(1,j) + r(2,j)*r(2,j) +
r(3,j)*r(3,j))
 Since r(1,j), r(2,j), and r(3,j) are contiguous in memory, it is
likely they will be in one cache line. Hence, 1 cache line,
rather than 3, is brought in for each iteration of I. The code
has been modified for cache reuse.
Cache Thrashing Example
 This example thrashes a 4MB direct mapped cache.
parameter (max = 1024*1024)
common /xyz/ a(max), b(max)
do I=1,max
something = a(I) + b(I)
enddo
 The cache lines for both a and b have the same cache address.
 To avoid cache thrashing in this example, pad common with
the size of a cache line.
parameter (max = 1024*1024)
common /xyz/ a(max),extra(32),b(max)
do I=1,max
something=a(I) + b(I)
enddo
 Improving cache utilization is often the key to getting good
performance.
Not Enough Cache
Ideally you want the inner loop’s arrays and variables
to fit into cache.
If a scalar program won’t fit in cache, its parallel
version may fit in cache with a large enough number
of processors.
This often results in super-linear speedup.
Loop Blocking
This technique is useful when the arrays are too large to fit
into the cache.
Loop blocking uses strip mining of loops and loop
interchange.
A blocked loop accesses array elements in sections that
optimally fit in the cache.
It allows for spatial and temporal reuse of data, thus
minimizing cache misses.
The following example (next slide) illustrates loop
blocking of matrix multiplication.
The code in the PRE column depicts the original code, the
POST column depicts the code when it is blocked.
Loop Blocking
PRE POST
do k=1,n do kk=1,n,iblk
do j=1,n do jj=1,n,iblk
do i=1,n do ii=1,n,iblk
c(i,j)=c(i,j) do j=jj,jj+iblk-1
+a(i,k)*b(k,j) do k=kk,kk+iblk-1
enddo do i=ii,ii+iblk-1
enddo c(i,j)=c(i,j)
enddo +a(i,k)*b(k,j)
enddo
enddo
enddo
enddo
enddo
enddo
Further Information
 Computer Organization and Design
The Hardware/Software Interface, David A. Patterson and
John L. Hennessy, Morgan Kaufmann Publishers, Inc.
 Computer Architecture
A Quantitative Approach, John L. Hennessy and David A.
Patterson, Morgan Kaufmann Publishers, Inc.
The Cache Memory Book, Jim Handy, Academic Press
High Performance Computing, Charles Severance, O’Reilly
and Associates, Inc.
A Practitioner’s Guide to RISC Microprocessor Architecture,
Patrick H. Stakem, John Wiley & Sons, Inc.
Tutorial on Optimization of Fortran, John Levesque,
Applied Parallel Research
Intel® Architecture Optimization Reference Manual
Intel® Itanium® Processor Manuals
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
8.1 Speedup
8.2 Speedup Extremes
8.3 Efficiency
8.4 Amdahl's Law
8.5 Speedup Limitations
8.6 Benchmarks
8.7 Summary
9 About the IBM Regatta P690
Parallel Performance Analysis
Now that you have parallelized your code, and have
run it on a parallel computer using multiple processors
you may want to know the performance gain that
parallelization has achieved.
This chapter describes how to compute parallel code
performance.
Often the performance gain is not perfect, and this
chapter also explains some of the reasons for
limitations on parallel performance.
Finally, this chapter covers the kinds of information
you should provide in a benchmark, and some sample
benchmarks are given.
Speedup
 The speedup of your code tells you how much performance
gain is achieved by running your program in parallel on
multiple processors.
A simple definition is that it is the length of time it takes a
program to run on a single processor, divided by the time it
takes to run on a multiple processors.
Speedup generally ranges between 0 and p, where p is the
number of processors.
 Scalability
When you compute with multiple processors in a parallel
environment, you will also want to know how your code scales.
The scalability of a parallel code is defined as its ability to
achieve performance proportional to the number of processors
used.
As you run your code with more and more processors, you
want to see the performance of the code continue to improve.
Computing speedup is a good way to measure how a program
scales as more processors are used.
Speedup
Linear Speedup
If it takes one processor an amount of time t to do a task
and if p processors can do the task in time t / p, then you
have perfect or linear speedup (Sp= p).
 That is, running with 4 processors improves the time by a factor
of 4, running with 8 processors improves the time by a factor of
8, and so on.
 This is shown in the following illustration.
Speedup Extremes
 The extremes of speedup happen when speedup is
greater than p, called super-linear speedup,
less than 1.
 Super-Linear Speedup
You might wonder how super-linear speedup can occur. How
can speedup be greater than the number of processors used?
 The answer usually lies with the program's memory use. When using
multiple processors, each processor only gets part of the problem
compared to the single processor case. It is possible that the smaller
problem can make better use of the memory hierarchy, that is, the
cache and the registers. For example, the smaller problem may fit in
cache when the entire problem would not.
 When super-linear speedup is achieved, it is often an indication that
the sequential code, run on one processor, had serious cache miss
problems.
 The most common programs that achieve super-linear
speedup are those that solve dense linear algebra problems.
Speedup Extremes
Parallel Code Slower than Sequential Code
When speedup is less than one, it means that the parallel
code runs slower than the sequential code.
This happens when there isn't enough computation to be
done by each processor.
The overhead of creating and controlling the parallel
threads outweighs the benefits of parallel computation,
and it causes the code to run slower.
To eliminate this problem you can try to increase the
problem size or run with fewer processors.
Efficiency
Efficiency is a measure of parallel performance that is
closely related to speedup and is often also presented in
a description of the performance of a parallel program.
Efficiency with p processors is defined as the ratio of
speedup with p processors to p.

Efficiency is a fraction that usually ranges between 0


and 1.
Ep=1 corresponds to perfect speedup of Sp= p.
You can think of efficiency as describing the average
speedup per processor.
Amdahl's Law
 An alternative formula for speedup is named Amdahl's Law
attributed to Gene Amdahl, one of America's great computer
scientists.
 This formula, introduced in the 1980s, states that no matter how many
processors are used in a parallel run, a program's speedup will be limited by
its fraction of sequential code.
 That is, almost every program has a fraction of the code that doesn't lend
itself to parallelism.
 This is the fraction of code that will have to be run with just one processor,
even in a parallel run.
 Amdahl's Law defines speedup with p processors as follows:

 Where the term f stands for the fraction of operations done


sequentially with just one processor, and the term (1 - f) stands for the
fraction of operations done in perfect parallelism with p processors.
Amdahl's Law
The sequential fraction of code, f, is a unitless
measure ranging between 0 and 1.
When f is 0, meaning there is no sequential code, then
speedup is p, or perfect parallelism. This can be seen by
substituting f = 0 in the formula above, which results in
Sp = p.
When f is 1, meaning there is no parallel code, then
speedup is 1, or there is no benefit from parallelism. This
can be seen by substituting f = 1 in the formula above,
which results in Sp = 1.
This shows that Amdahl's speedup ranges between
1 and p, where p is the number of processors used
in a parallel processing run.
Amdahl's Law
The interpretation of Amdahl's Law is that speedup is
limited by the fact that not all parts of a code can be
run in parallel.
Substituting in the formula, when the number of processors
goes to infinity, your code's speedup is still limited by 1 / f.
Amdahl's Law shows that the sequential fraction of
code has a strong effect on speedup.
This helps to explain the need for large problem sizes when
using parallel computers.
It is well known in the parallel computing community, that
you cannot take a small application and expect it to show good
performance on a parallel computer.
To get good performance, you need to run large applications,
with large data array sizes, and lots of computation.
The reason for this is that as the problem size increases the
opportunity for parallelism grows, and the sequential fraction
shrinks, and it shrinks in its importance for speedup.
Agenda
8 Parallel Performance Analysis
8.1 Speedup
8.2 Speedup Extremes
8.3 Efficiency
8.4 Amdahl's Law
8.5Speedup Limitations
8.5.1 Memory Contention Limitation
8.5.2 Problem Size Limitation
8.6 Benchmarks
8.7 Summary
Speedup Limitations
This section covers some of the reasons why a
program doesn't get perfect Speedup. Some of the
reasons for limitations on speedup are:
Too much I/O
 Speedup is limited when the code is I/O bound.
 That is, when there is too much input or output compared to the
amount of computation.
Wrong algorithm
 Speedup is limited when the numerical algorithm is not suitable
for a parallel computer.
 You need to replace it with a parallel algorithm.
Too much memory contention
 Speedup is limited when there is too much memory contention.
 You need to redesign the code with attention to data locality.
 Cache reutilization techniques will help here.
Speedup Limitations
Wrong problem size
 Speedup is limited when the problem size is too small to take best
advantage of a parallel computer.
 In addition, speedup is limited when the problem size is fixed.
 That is, when the problem size doesn't grow as you compute with
more processors.
Too much sequential code
 Speedup is limited when there's too much sequential code.
 This is shown by Amdahl's Law.
Too much parallel overhead
 Speedup is limited when there is too much parallel overhead
compared to the amount of computation.
 These are the additional CPU cycles accumulated in creating parallel
regions, creating threads, synchronizing threads, spin/blocking
threads, and ending parallel regions.
Load imbalance
 Speedup is limited when the processors have different workloads.
 The processors that finish early will be idle while they are waiting for
the other processors to catch up.
Memory Contention Limitation
 Gene Golub, a professor of Computer Science at Stanford
University, writes in his book on parallel computing that the best
way to define memory contention is with the word delay.
 When different processors all want to read or write into the main
memory, there is a delay until the memory is free.
 On the SGI Origin2000 computer, you can determine whether your
code has memory contention problems by using SGI's perfex utility.
 The perfex utility is covered in the Cache Tuning lecture in this course.
 You can also refer to SGI's manual page, man perfex, for more details.
 On the Linux clusters, you can use the hardware performance
counter tools to get information on memory performance.
 On the IA32 platform, use perfex, vprof, hmpcount, psrun/perfsuite.
 On the IA64 platform, use vprof, pfmon, psrun/perfsuite.
Memory Contention Limitation
 Many of these tools can be used with the PAPI performance
counter interface.
 Be sure to refer to the man pages and webpages on the NCSA
website for more information.
 If the output of the utility shows that memory contention is a
problem, you will want to use some programming techniques for
reducing memory contention.
 A good way to reduce memory contention is to access elements
from the processor's cache memory instead of the main memory.
 Some programming techniques for doing this are:
 Access arrays with unit `.
 Order nested do loops (in Fortran) so that the innermost loop index is the
leftmost index of the arrays in the loop. For the C language, the order is the
opposite of Fortran.
 Avoid specific array sizes that are the same as the size of the data cache or
that are exact fractions or exact multiples of the size of the data cache.
 Pad common blocks.
 These techniques are called cache tuning optimizations. The
details for performing these code modifications are covered in
the section on Cache Optimization of this lecture.
Problem Size Limitation
Small Problem Size
Speedup is almost always an increasing function of
problem size.
If there's not enough work to be done by the available
processors, the code will show limited speedup.
The effect of small problem size on speedup is shown in
the following illustration.
Problem Size Limitation
Fixed Problem Size
When the problem size is fixed, you can reach a point of
negative returns when using additional processors.
As you compute with more and more processors, each
processor has less and less amount of computation to
perform.
The additional parallel overhead, compared to the
amount of computation, causes the speedup curve to start
turning downward as shown in the following figure.
Benchmarks
It will finally be time to report the parallel
performance of your application code.
You will want to show a speedup graph with the
number of processors on the x axis, and speedup
on the y axis.
Some other things you should report and record
are:
the date you obtained the results
the problem size
the computer model
the compiler and the version number of the compiler
any special compiler options you used
Benchmarks
When doing computational science, it is often helpful to
find out what kind of performance your colleagues are
obtaining.
In this regard, NCSA has a compilation of parallel
performance benchmarks online at
http://www.ncsa.uiuc.edu/UserInfo/Perf/NCSAbench/.
You might be interested in looking at these benchmarks
to see how other people report their parallel performance.
In particular, the NAMD benchmark is a report about the
performance of the NAMD program that does molecular
dynamics simulations.
Summary
There are many good texts on parallel computing
which treat the subject of parallel performance
analysis. Here are two useful references:
Scientific Computing An Introduction with Parallel
Computing, Gene Golub and James Ortega, Academic
Press, Inc.
Parallel Computing Theory and Practice, Michael J.
Quinn, McGraw-Hill, Inc.
Agenda
 1 Parallel Computing Overview
 2 How to Parallelize a Code
 3 Porting Issues
 4 Scalar Tuning
 5 Parallel Code Tuning
 6 Timing and Profiling
 7 Cache Tuning
 8 Parallel Performance Analysis
 9 About the IBM Regatta P690
9.1 IBM p690 General Overview
9.2 IBM p690 Building Blocks
9.3 Features Performed by the Hardware
9.4 The Operating System
9.5 Further Information
About the IBM Regatta P690
To obtain your program’s top performance, it is
important to understand the architecture of the
computer system on which the code runs.
This chapter describes the architecture of NCSA's IBM
p690.
Technical details on the size and design of the
processors, memory, cache, and the interconnect
network are covered along with technical
specifications for the compute rate, memory size and
speed, and interconnect bandwidth.
IBM p690 General Overview
The p690 is IBM's latest Symmetric Multi-Processor
(SMP) machine with Distributed Shared Memory (DSM).
This means that memory is physically distributed and
logically shared.
It is based on the Power4 architecture and is a successor to
the Power3-II based RS/6000 SP system.
IBM p690 Scalability
The IBM p690 is a flexible, modular, and scalable
architecture.
It scales in these terms:
 Number of processors
 Memory size
 I/O and memory bandwidth and the Interconnect bandwidth
Agenda
9 About the IBM Regatta P690
9.1 IBM p690 General Overview
9.2 IBM p690 Building Blocks
9.2.1 Power4 Core
9.2.2 Multi-Chip Modules
9.2.3 The Processor
9.2.4 Cache Architecture
9.2.5 Memory Subsystem
9.3 Features Performed by the Hardware
9.4 The Operating System
9.5 Further Information
IBM p690 Building Blocks
An IBM p690 system is built from a number of
fundamental building blocks.
The first of these building blocks is the Power4 Core,
which includes the processors and L1 and L2 caches.
At NCSA, four of these Power4 Cores are linked to form
a Multi-Chip Module.
This module includes the L3 cache and four Multi-Chip
Modules are linked to form a 32 processor system (see
figure on the next slide).
Each of these components will be described in the
following sections.
32-processor IBM p690 configuration
(Image courtesy of IBM)
Power4 Core
The Power4 Chip contains:
Two processors
Local caches (L1)
External cache for each processor (L2)
I/O and Interconnect interfaces
The POWER4 chip
(Image curtsey of IBM)
Multi-Chip Modules
Four Power4 Chips are assembled to form a Multi-Chip
Module (MCM) that contains 8 processors.
Each MCM also supports the L3 cache for each Power4
chip.

Multiple MCM interconnection (Image courtesy of IBM)


The Processor
The processors at the heart of the Power4 Core are
speculative superscalar out of order execution chips.
The Power4 is a 4-way superscalar RISC architecture
running instructions on its 8 pipelined execution units.
Speed of the Processor
The NCSA IBM p690 has CPUs running at 1.3 GHz.
64-Bit Processor Execution Units
There are 8 independent fully pipelined execution units.
 2 load/store units for memory access
 2 identical floating point execution units capable of fused multiply/add
 2 fixed point execution units
 1 branch execution unit
 1 logic operation unit
The Processor
The units are capable of 4 floating point operations, fetching
8 instructions and completing 5 instructions per cycle.
It is capable of handling up to 200 in-flight instructions.
Performance Numbers
Peak Performance:
 4 floating point instructions per cycle
 1.3 Gcycles/sec * 4 flop/cycle yields 5.2 GFLOPS
MIPS Rating:
 5 instructions per cycle
 1.3 Gcycles/sec * 5 instructions/cycle yields 65 MIPS

Instruction Set
The instruction set (ISA) on the IBM p690 is the PowerPC
AS Instruction set.
Cache Architecture
 Each Power4 Core has both a primary (L1) cache associated with each
processor and a secondary (L2) cache shared between the two processors.
In addition, each Multi-Chip Module has a L3 cache.
 Level 1 Cache
 The Level 1 cache is in the processor core. It has split instruction and data
caches.
 L1 Instruction Cache
 The properties of the Instruction Cache are:
 64KB in size
 direct mapped
 cache line size is 128 bytes
 L1 Data Cache
 The properties of the L1 Data Cache are:
 32KB in size
 2-way set associative
 FIFO replacement policy
 2-way interleaved
 cache line size is 128 bytes
 Peak speed is achieved when the data accessed in a loop is entirely contained
in the L1 data cache.
Cache Architecture
Level 2 Cache on the Power4 Chip
When the processor can't find a data element in the
L1 cache, it looks in the L2 cache. The properties of
the L2 Cache are:
external from the processor
unified instruction and data cache
1.41MB per Power4 chip (2 processors)
8-way set associative
split between 3 controllers
cache line size is 128 bytes
pseudo LRU replacement policy for cache coherence
124.8 GB/s peak bandwidth from L2
Cache Architecture
Level 3 Cache on the Multi-Chip Module
When the processor can't find a data element in the
L2 cache, it looks in the L3 cache. The properties of
the L3 Cache are:
external from the Power4 Core
unified instruction and data cache
128MB per Multi-Chip Module (8 processors)
8-way set associative
cache line size is 512 bytes
55.5 GB/s peak bandwidth from L2
Memory Subsystem
The total memory is physically distributed among
the Multi-Chip Modules of the p690 system (see
the diagram in the next slide).
Memory Latencies
The latency penalties for each of the levels of
the memory hierarchy are:
L1 Cache - 4 cycles
L2 Cache - 14 cycles
L3 Cache - 102 cycles
Main Memory - 400 cycles
Memory distribution within an MCM
Agenda
9 About the IBM Regatta P690
9.1 IBM p690 General Overview
9.2 IBM p690 Building Blocks
9.3 Features Performed by the Hardware
9.4 The Operating System
9.5 Further Information
Features Performed by the Hardware
The following is done completely by the hardware,
transparent to the user:
Global memory addressing (makes the system memory
shared)
Address resolution
Maintaining cache coherency
Automatic page migration from remote to local memory
(to reduce interconnect memory transactions)
The Operating System
The operating system is AIX. NCSA's p690 system is
currently running version 5.1 of AIX. Version 5.1 is a
full 64-bit file system.
Compatibility
AIX 5.1 is highly compatible to both BSD and System V
Unix
Further Information
Computer Architecture: A Quantitative Approach
John Hennessy, et al. Morgan Kaufman Publishers, 2nd
Edition, 1996
Computer Hardware and Design: The
Hardware/Software Interface
David A. Patterson, et al. Morgan Kaufman Publishers,
2nd Edition, 1997
IBM P Series [595] at the URL:
 http://www-03.ibm.com/systems/p/hardware/highend/590/index.html

IBM p690 Documentation at NCSA at the URL:


 http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/IBMp690/

You might also like