Professional Documents
Culture Documents
Iterations Data
Processor
of K Elements
proc0 K=1:5
A(I, 1:5) DO J=1,N
B(1:5 ,J)
DO I=1,N
A(I, 6:10)
proc1 K=6:10 C(I,J) = C(I,J) +
B(6:10 ,J)
A(I, 11:15) A(I,K)*B(K,J)
proc2 K=11:15
B(11:15 ,J) END DO
A(I, 16:20) END DO
proc3 K=16:20
B(16:20 ,J)
OpenMP Style of Parallelism
can be done incrementally as follows:
1. Parallelize the most computationally intensive loop.
2. Compute performance of the code.
3. If performance is not satisfactory, parallelize another
loop.
4. Repeat steps 2 and 3 as many times as needed.
The ability to perform incremental parallelism is
considered a positive feature of data parallelism.
It is contrasted with the MPI (Message Passing
Interface) style of parallelism, which is an "all or
nothing" approach.
Task Parallelism
Task parallelism may be thought of as the opposite of data
parallelism.
Instead of the same operations being performed on different
parts of the data, each process performs different operations.
You can use task parallelism when your program can be split
into independent pieces, often subroutines, that can be
assigned to different processors and run concurrently.
Task parallelism is called "coarse grain" parallelism because
the computational work is spread into just a few subtasks.
More code is run in parallel because the parallelism is
implemented at a higher level than in data parallelism.
Task parallelism is often easier to implement and has less
overhead than data parallelism.
Task Parallelism
The abstract code shown in the diagram is decomposed
into 4 independent code segments that are labeled A,
B, C, and D. The right hand side of the diagram
illustrates the 4 code segments running concurrently.
Task Parallelism
Original Code Parallel Code
program main program main
!$OMP PARALLEL
!$OMP SECTIONS
code segment labeled A code segment labeled A
!$OMP SECTION
code segment labeled B code segment labeled B
!$OMP SECTION
code segment labeled C code segment labeled C
!$OMP SECTION
code segment labeled D code segment labeled D
!$OMP END SECTIONS
!$OMP END PARALLEL
end end
OpenMP Task Parallelism
With OpenMP, the code that follows each
SECTION(S) directive is allocated to a different
processor. In our sample parallel code, the allocation
of code segments to processors is as follows.
Processor Code
code
proc0 segment
labeled A
code
proc1 segment
labeled B
code
proc2 segment
labeled C
code
Parallelism in Computers
How parallelism is exploited and enhanced within the
operating system and hardware components of a
parallel computer:
operating system
arithmetic
memory
disk
Operating System Parallelism
All of the commonly used parallel computers run a version
of the Unix operating system. In the table below each OS
listed is in fact Unix, but the name of the Unix OS varies
with each vendor.
Parallel Computer OS
SGI Origin2000 IRIX
HP V-Class HP-UX
Cray T3E Unicos
IBM SP AIX
Workstation
Linux
Clusters
cron feature
With the Unix cron feature you can submit a job that will
run at a later time.
Arithmetic Parallelism
Multiple execution units
facilitate arithmetic parallelism.
The arithmetic operations of add, subtract, multiply, and divide (+ - * /) are
each done in a separate execution unit. This allows several execution units to
be used simultaneously, because the execution units operate independently.
Fused multiply and add
is another parallel arithmetic feature.
Parallel computers are able to overlap multiply and add. This arithmetic is
named MultiplyADD (MADD) on SGI computers, and Fused Multiply Add
(FMA) on HP computers. In either case, the two arithmetic operations are
overlapped and can complete in hardware in one computer cycle.
Superscalar arithmetic
is the ability to issue several arithmetic operations per computer cycle.
It makes use of the multiple, independent execution units. On superscalar
computers there are multiple slots per cycle that can be filled with work. This
gives rise to the name n-way superscalar, where n is the number of slots per
cycle. The SGI Origin2000 is called a 4-way superscalar computer.
Memory Parallelism
memory interleaving
memory is divided into multiple banks, and consecutive data elements are
interleaved among them. For example if your computer has 2 memory
banks, then data elements with even memory addresses would fall into
one bank, and data elements with odd memory addresses into the other.
multiple memory ports
Port means a bi-directional memory pathway. When the data elements
that are interleaved across the memory banks are needed, the multiple
memory ports allow them to be accessed and fetched in parallel, which
increases the memory bandwidth (MB/s or GB/s).
multiple levels of the memory hierarchy
There is global memory that any processor can access. There is memory
that is local to a partition of the processors. Finally there is memory that is
local to a single processor, that is, the cache memory and the memory
elements held in registers.
Cache memory
Cache is a small memory that has fast access compared with the larger
main memory and serves to keep the faster processor filled with data.
Memory Parallelism
Memory Hierarchy Cache Memory
Disk Parallelism
RAID (Redundant Array of Inexpensive Disk)
RAID disks are on most parallel computers.
The advantage of a RAID disk system is that it provides a
measure of fault tolerance.
If one of the disks goes down, it can be swapped out, and
the RAID disk system remains operational.
Disk Striping
When a data set is written to disk, it is striped across the
RAID disk system. That is, it is broken into pieces that are
written simultaneously to the different disks in the RAID
disk system. When the same data set is read back in, the
pieces are read in parallel, and the full data set is
reassembled in memory.
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.3 Parallelism in Computers
1.1.3.4 Disk Parallelism
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
Performance Measures
Peak Performance
is the top speed at which the computer can operate.
It is a theoretical upper limit on the computer's performance.
Sustained Performance
is the highest consistently achieved speed.
It is a more realistic measure of computer performance.
Cost Performance
is used to determine if the computer is cost effective.
MHz
is a measure of the processor speed.
The processor speed is commonly measured in millions of cycles per
second, where a computer cycle is defined as the shortest time in which
some work can be done.
MIPS
is a measure of how quickly the computer can issue instructions.
Millions of instructions per second is abbreviated as MIPS, where the
instructions are computer instructions such as: memory reads and writes,
logical operations , floating point operations, integer operations, and
branch instructions.
Performance Measures
Mflops (Millions of floating point operations per second)
measures how quickly a computer can perform floating-point
operations such as add, subtract, multiply, and divide.
Speedup
measures the benefit of parallelism.
It shows how your program scales as you compute with more
processors, compared to the performance on one processor.
Ideal speedup happens when the performance gain is linearly
proportional to the number of processors used.
Benchmarks
are used to rate the performance of parallel computers and parallel
programs.
A well known benchmark that is used to compare parallel computers
is the Linpack benchmark.
Based on the Linpack results, a list is produced of the Top 500
Supercomputer Sites. This list is maintained by the University of
Tennessee and the University of Mannheim.
More Parallelism Issues
Load balancing
is the technique of evenly dividing the workload among the processors.
For data parallelism it involves how iterations of loops are allocated to
processors.
Load balancing is important because the total time for the program to
complete is the time spent by the longest executing thread.
The problem size
must be large and must be able to grow as you compute with more processors.
In order to get the performance you expect from a parallel computer you need
to run a large application with large data sizes, otherwise the overhead of
passing information between processors will dominate the calculation time.
Good software tools
are essential for users of high performance parallel computers.
These tools include:
parallel compilers
parallel debuggers
performance analysis tools
parallel math software
The availability of a broad set of application software is also important.
More Parallelism Issues
The high performance computing market is risky and chaotic.
Many supercomputer vendors are no longer in business, making
the portability of your application very important.
A workstation farm
is defined as a fast network connecting heterogeneous workstations.
The individual workstations serve as desktop systems for their
owners.
When they are idle, large problems can take advantage of the unused
cycles in the whole system.
An application of this concept is the SETI project. You can
participate in searching for extraterrestrial intelligence with your
home PC. More information about this project is available at the
SETI Institute.
Condor
is software that provides resource management services for applications that
run on heterogeneous collections of workstations.
Miron Livny at the University of Wisconsin at Madison is the director of the
Condor project, and has coined the phrase high throughput computing to
describe this process of harnessing idle workstation cycles. More
information is available at the Condor Home Page.
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.2 Comparison of Parallel Computers
1.2.1 Processors
1.2.2 Memory Organization
1.2.3 Flow of Control
1.2.4 Interconnection Networks
1.2.4.1 Bus Network
1.2.4.2 Cross-Bar Switch Network
1.2.4.3 Hypercube Network
1.2.4.4 Tree Network
1.2.4.5 Interconnection Networks Self-test
1.2.5 Summary of Parallel Computer Characteristics
1.3 Summary
Comparison of Parallel
Computers
Now you can explore the hardware components of
parallel computers:
kinds of processors
types of memory organization
flow of control
interconnection networks
Programming Flow of
Computer OS Processors Memory Network
Style Control
OpenMP Crossbar
HP V-Class HP-UX HP PA 8200 DSM MIMD
MPI Ring
Compaq
Cray T3E SHMEM Unicos Distributed MIMD Torus
Alpha
IBM
IBM SP MPI AIX IBM Power3 Distributed MIMD
Switch
Workstatio
Intel Pentium Myrinet
n MPI Linux Distributed MIMD
III Tree
Clusters
Summary
This completes our introduction to parallel computing.
You have learned about parallelism in computer programs, and
also about parallelism in the hardware components of parallel
computers.
In addition, you have learned about the commonly used parallel
computers, and how these computers compare to each other.
There are many good texts which provide an introductory
treatment of parallel computing. Here are two useful references:
!$OMP PARALLEL
DO do i=1,n
… lots of computation ...
end do
!$OMP END PARALLEL DO
Data Parallelism by Hand
Compile with the mp compiler option.
f90 -mp ... prog.f
As before, the compiler generates conditional code that will run with
any number of threads.
If you want to rerun your program with a different number of threads,
you do not need to recompile, just re-specify the setenv command.
setenv OMP_NUM_THREADS 8
a.out > results2
An Example
Suppose you are computing on the
upper triangle of a 100 x 100
matrix, and you use 2 threads,
named t0 and t1. With default
scheduling, workloads are uneven.
Loop Schedule Types
Whereas with static scheduling, the columns of the
matrix are given to the threads in a round robin
fashion, resulting in better load balance.
Loop Schedule Types
Dynamic Schedule Type
The iterations are dynamically allocated to threads at runtime.
Each thread is given a chunk of iterations. When a thread
finishes its work, it goes into a critical section where it’s given
another chunk of iterations to work on.
This type is useful when you don’t know the iteration count or
work pattern ahead of time. Dynamic gives good load balance,
but at a high overhead cost.
Guided Schedule Type
The guided schedule type is dynamic scheduling that starts
with large chunks of iterations and ends with small chunks of
iterations. That is, the number of iterations given to each
thread depends on the number of iterations remaining. The
guided schedule type reduces the number of entries into the
critical section, compared to the dynamic schedule type.
Guided gives good load balancing at a low overhead cost.
Chunk Size
The word chunk refers to a grouping of iterations. Chunk size
means how many iterations are in the grouping. The static
and dynamic schedule types can be used with a chunk size. If
a chunk size is not specified, then the chunk size is 1.
Suppose you specify a chunk size of 2 with the static
schedule type. Then 20 iterations are allocated on 4 threads:
real*4 tarray(2),time1,time2,timeres
… beginning of program
time1=etime(tarray)
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
time2=etime(tarray)
timeres=time2-time1
CPU Time
dtime
A section of code can also be timed using dtime.
It returns the elapsed CPU time in seconds since the last
call to dtime.
real*4 tarray(2),timeres
… beginning of program
timeres=dtime(tarray)
… start of section of code to be timed
… lots of computation …
end of section of code to be timed
timeres=dtime(tarray)
… rest of program
CPU Time
The etime and dtime Functions
User time.
This is returned as the first element of tarray.
It’s the CPU time spent executing user code.
System time.
This is returned as the second element of tarray.
It’s the time spent executing system calls on behalf of your
program.
Sum of user and system time.
This is the function value that is returned.
It’s the time that is usually reported.
Metric.
Timings are reported in seconds.
Timings are accurate to 1/100th of a second.
CPU Time
Timing Comparison Warnings
For the SGI computers:
The etime and dtime functions return the MAX time over all
threads for a parallel program.
This is the time of the longest thread, which is usually the
master thread.
For the Linux Clusters:
The etime and dtime functions are contained in the VAX
compatibility library of the Intel FORTRAN Compiler.
To use this library include the compiler flag -Vaxlib.
include <time.h>
static const double iCPS =
1.0/(double)CLOCKS_PER_SEC;
double time1, time2, timres;
…
time1=(clock()*iCPS);
…
/* do some work */
…
time2=(clock()*iCPS);
timers=time2-time1;
Wall clock Time
time
For the Origin, the function time returns the time since
00:00:00 GMT, Jan. 1, 1970.
It is a means of getting the elapsed wall clock time.
The wall clock time is reported in integer seconds.
external time integer*4 time1,time2,timeres
… beginning of program
time1=time( )
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
time2=time( )
timeres=time2 - time1
Wall clock Time
f_time
For the Linux clusters, the appropriate FORTRAN function for
elapsed time is f_time.
integer*8 f_time
external f_time
integer*8 time1,time2,timeres
… beginning of program
time1=f_time()
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
time2=f_time()
timeres=time2 - time1
As above for etime and dtime, the f_time function is in the VAX
compatibility library of the Intel FORTRAN Compiler. To use
this library include the compiler flag -Vaxlib.
Wall clock Time
gettimeofday
For C programmers, wallclock time can be obtained by using the
very portable routine gettimeofday.
#include <stddef.h> /* definition of NULL */
#include <sys/time.h> /* definition of timeval struct and
protyping of gettimeofday */
double t1,t2,elapsed;
struct timeval tp;
int rtn;
....
....
rtn=gettimeofday(&tp, NULL);
t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
....
/* do some work */
....
rtn=gettimeofday(&tp, NULL);
t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
elapsed=t2-t1;
Timing an Executable
To time an executable (if using a csh or tcsh shell,
explicitly call /usr/bin/time)
Origin
busage jobid
Linux clusters
qstat jobid # for a running job
qhist jobid # for a completed job
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
6.1 Timing
6.1.1 Timing a Section of Code
6.1.1.1 CPU Time
6.1.1.2 Wall clock Time
6.1.2 Timing an Executable
6.1.3 Timing a Batch Job
6.2 Profiling
6.2.1 Profiling Tools
6.2.2 Profile Listings
6.2.3 Profiling Analysis
6.3 Further Information
Profiling
Profiling determines where a program spends its time.
It detects the computationally intensive parts of the code.
File Summary:
100.0% /u/ncsa/gbauer/temp/md.f
Function Summary:
84.4% compute
15.6% dist
Line Summary:
67.3% /u/ncsa/gbauer/temp/md.f:106
13.6% /u/ncsa/gbauer/temp/md.f:104
9.3% /u/ncsa/gbauer/temp/md.f:166
2.5% /u/ncsa/gbauer/temp/md.f:165
1.5% /u/ncsa/gbauer/temp/md.f:102
1.2% /u/ncsa/gbauer/temp/md.f:164
0.9% /u/ncsa/gbauer/temp/md.f:107
0.8% /u/ncsa/gbauer/temp/md.f:169
0.8% /u/ncsa/gbauer/temp/md.f:162
0.8% /u/ncsa/gbauer/temp/md.f:105
The above listing from (using the -e option to cprof), displays not only cycles consumed
by functions (a flat profile) but also the lines in the code that contribute to those
functions.
Profile Listings
Profile Listings on the Linux Clusters
vprof Listing (cont.)
0.7% /u/ncsa/gbauer/temp/md.f:149
0.5% /u/ncsa/gbauer/temp/md.f:163
0.2% /u/ncsa/gbauer/temp/md.f:109
0.1% /u/ncsa/gbauer/temp/md.f:100
…
…
100 0.1% do j=1,np
101 if (i .ne. j) then
102 1.5% call dist(nd,box,pos(1,i),pos(1,j),rij,d)
103 ! attribute half of the potential energy to particle 'j'
104 13.6% pot = pot + 0.5*v(d)
105 0.8% do k=1,nd
106 67.3% f(k,i) = f(k,i) - rij(k)*dv(d)/d
107 0.9% enddo
108 endif
109 0.2% enddo
Profiling Analysis
The program being analyzed in the previous Origin example has
approximately 10000 source code lines, and consists of many
subroutines.
The first profile listing shows that over 50% of the computation is
done inside the VSUB subroutine.
The second profile listing shows that line 8106 in subroutine VSUB
accounted for 50% of the total computation.
Going back to the source code, line 8106 is a line inside a do loop.
Putting an OpenMP compiler directive in front of that do loop you can
get 50% of the program to run in parallel with almost no work on your
part.
Since the compiler has rearranged the source lines the line numbers
given by ssrun/prof give you an area of the code to inspect.
To view the rearranged source use the option
f90 … -FLIST:=ON
cc … -CLIST:=ON
For the Intel compilers, the appropriate options are
ifort … –E …
icc … -E …
Further Information
SGI Irix
man etime
man 3 time
man 1 time
man busage
man timers
man ssrun
man prof
Origin2000 Performance Tuning and Optimization Guide
Linux Clusters
man 3 clock
man 2 gettimeofday
man 1 time
man 1 gprof
man 1B qstat
Intel Compilers Vprof on NCSA Linux Cluster
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scaler Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690
Agenda
7 Cache Tuning
7.1 Cache Concepts
7.1.1 Memory Hierarchy
7.1.2 Cache Mapping
7.1.3 Cache Thrashing
7.1.4 Cache Coherence
7.2 Cache Specifics
7.3 Code 0ptimization
7.4 Measuring Cache Performance
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Cache Concepts
The CPU time required to perform an operation is the sum
of the clock cycles executing instructions and the clock
cycles waiting for memory.
The CPU cannot be performing useful work if it is waiting
for data to arrive from memory.
Clearly then, the memory system is a major factor in
determining the performance of your program and a large
part is your use of the cache.
The following sections will discuss the key concepts of
cache including:
Memory subsystem hierarchy
Cache mapping
Cache thrashing
Cache coherence
Memory Hierarchy
The different subsystems in the memory hierarchy have
different speeds, sizes, and costs.
Smaller memory is faster
Slower memory is cheaper
The hierarchy is set up so that the fastest memory is closest
to the CPU, and the slower memories are further away from
the CPU.
Memory Hierarchy
It's a hierarchy because every level is a subset of a level further away.
All data in one level is found in the level below.
The purpose of cache is to improve the memory access time to the processor.
There is an overhead associated with it, but the benefits outweigh the cost.
Registers
Registers are the sources and destinations of CPU data operations.
They hold one data element each and are 32 bits or 64 bits wide.
They are on-chip and built from SRAM.
Computers usually have 32 or 64 registers.
The Origin MIPS R10000 has 64 physical 64-bit registers of which 32
are available for floating-point operations.
The Intel IA64 has 328 registers for general-purpose (64 bit), floating-
point (80 bit), predicate (1 bit), branch and other functions.
Register access speeds are comparable to processor speeds.
Memory Hierarchy
Main Memory Improvements
A hardware improvement called interleaving reduces main memory
access time.
In interleaving, memory is divided into partitions or segments called
memory banks.
Consecutive data elements are spread across the banks.
Each bank supplies one data element per bank cycle.
Multiple data elements are read in parallel, one from each bank.
The problem with interleaving is that the memory interleaving
improvement assumes that memory is accessed sequentially.
If there is 2-way memory interleaving, but the code accesses every
other location, there is no benefit.
The bank cycle time is 4-8 times the CPU clock cycle time so the main
memory can’t keep up with the fast CPU and keep it busy with data.
Large main memory with a cycle time comparable to the processor is
not affordable.
Memory Hierarchy
Principle of Locality
The way your program operates follows the Principle of Locality.
Temporal Locality: When an item is referenced, it will be referenced again soon.
Spatial Locality: When an item is referenced, items whose addresses are
nearby will tend to be referenced soon.
Cache Line
The overhead of the cache can be reduced by fetching a chunk or block
of data elements.
When a main memory access is made, a cache line of data is brought
into the cache instead of a single data element.
A cache line is defined in terms of a number of bytes.
For example, a cache line is typically 32 or 128 bytes.
This takes advantage of spatial locality.
The additional elements in the cache line will most likely be needed
soon.
The cache miss rate falls as the size of the cache line increases, but there
is a point of negative returns on the cache line size.
When the cache line size becomes too large, the transfer time increases.
Memory Hierarchy
Cache Hit
A cache hit occurs when the data element requested by the
processor is in the cache.
You want to maximize hits.
The Cache Hit Rate is defined as the fraction of cache hits.
It is the fraction of the requested data that is found in the cache.
Cache Miss
A cache miss occurs when the data element requested by the
processor is NOT in the cache.
You want to minimize cache misses. Cache Miss Rate is defined as
1.0 - Hit Rate
Cache Miss Penalty, or miss time, is the time needed to retrieve the
data from a lower level (downstream) of the memory hierarchy.
(Recall that the lower levels of the hierarchy have a slower access
time.)
Memory Hierarchy
Levels of Cache
It used to be that there were two levels of cache: on-chip and
off-chip.
L1/L2 is still true for the Origin MIPS and the Intel IA-32 processors.
Caches closer to the CPU are called Upstream. Caches further from the CPU are
called Downstream.
The on-chip cache is called First level, L1, or primary cache.
An on-chip cache performs the fastest but the computer designer makes a trade-off
between die size and cache size. Hence, on-chip cache has a small size. When the on-
chip cache has a cache miss the time to access the slower main memory is very large.
The off-chip cache is called Second Level, L2, or secondary cache.
A cache miss is very costly. To solve this problem, computer designers have
implemented a larger, slower off-chip cache. This chip speeds up the on-chip cache
miss time. L1 cache misses are handled quickly. L2 cache misses have a larger
performance penalty.
The cache external to the chip is called Third Level, L3.
The newer Intel IA-64 processor has 3 levels of cache
Memory Hierarchy
Split or Unified Cache
In unified cache, typically L2, the cache is a combined
instruction-data cache.
A disadvantage of a unified cache is that when the data access and
instruction access conflict with each other, the cache may be thrashed, e.g. a
high cache miss rate.
In split cache, typically L1, the cache is split into 2 parts:
one for the instructions, called the instruction cache
another for the data, called the data cache.
The 2 caches are independent of each other, and they can have independent
properties.
Memory Hierarchy Sizes
Memory hierarchy sizes are specified in the following units:
Cache Line: bytes
L1 Cache: Kbytes
L2 Cache: Mbytes
Main Memory: Gbytes
Cache Mapping
Cache mapping determines which cache location should be
used to store a copy of a data element from main memory.
There are 3 mapping strategies:
Direct mapped cache
Set associative cache
Fully associative cache
Direct Mapped Cache
In direct mapped cache, a line of main memory is mapped to
only a single line of cache.
Consequently, a particular cache line can be filled from (size of
main memory mod size of cache) different lines from main
memory.
Direct mapped cache is inexpensive but also inefficient and
very susceptible to cache thrashing.
Cache Mapping
Direct Mapped Cache
http://larc.ee.nthu.edu.tw/~cthuang/courses/ee3450/lectures/07_memory.htm
l
Cache Mapping
Fully Associative Cache
For fully associative cache, any line of cache can be loaded with any
line from main memory.
This technology is very fast but also very expensive.
http://www.xbitlabs.com/images/video/radeon-
Cache Mapping
Set Associative Cache
For N-way set associative cache, you can think of cache as being divided
into N sets (usually N is 2 or 4).
A line from main memory can then be written to its cache line in any of the
N sets.
This is a trade-off between direct mapped and fully associative cache.
http://www.alasir.com/articles/cache_principles/cache_way.
Cache Mapping
Cache Block Replacement
With direct mapped cache, a cache line can only be mapped to one
unique place in the cache. The new cache line replaces the cache block at
that address. With set associative cache there is a choice of 3 strategies:
1. Random
There is a uniform random replacement within the set of cache blocks. The
advantage of random replacement is that it’s simple and inexpensive to
implement.
2. LRU (Least Recently Used)
The block that gets replaced is the one that hasn’t been used for the longest time.
The principle of temporal locality tells us that recently used data blocks are likely
to be used again soon. An advantage of LRU is that it preserves temporal locality.
A disadvantage of LRU is that it’s expensive to keep track of cache access
patterns. In empirical studies, there was little performance difference between
LRU and Random.
3. FIFO (First In First Out)
Replace the block that was brought in N accesses ago, regardless of the usage
pattern. In empirical studies, Random replacement generally outperformed FIFO.
Cache Thrashing
Cache thrashing is a problem that happens when a frequently
used cache line gets displaced by another frequently used
cache line.
Cache thrashing can happen for both instruction and data caches.
The CPU can’t find the data element it wants in the cache and
must make another main memory cache line access.
The same data elements are repeatedly fetched into and
displaced from the cache.
Cache thrashing happens because the computational code
statements have too many variables and arrays for the needed
data elements to fit in cache.
Cache lines are discarded and later retrieved.
The arrays are dimensioned too large to fit in cache. The arrays
are accessed with indirect addressing, e.g. a(k(j)).
Cache Coherence
Cache coherence
is maintained by an agreement between data stored in
cache, other caches, and main memory.
When the same data is being manipulated by different
processors, they must inform each other of their
modification of data.
The term Protocol is used to describe how caches and
main memory communicate with each other.
It is the means by which all the memory subsystems
maintain data coherence.
Cache Coherence
Snoop Protocol
All processors monitor the bus traffic to determine cache
line status.
Directory Based Protocol
Cache lines contain extra bits that indicate which other
processor has a copy of that cache line, and the status of
the cache line – clean (cache line does not need to be sent
back to main memory) or dirty (cache line needs to
update main memory with content of cache line).
Hardware Cache Coherence
Cache coherence on the Origin computer is maintained
in the hardware, transparent to the programmer.
Cache Coherence
False sharing
happens in a multiprocessor system as a result of
maintaining cache coherence.
Both processor A and processor B have the same cache
line.
A modifies the first word of the cache line.
B wants to modify the eighth word of the cache line.
But A has sent a signal to B that B’s cache line is invalid.
B must fetch the cache line again before writing to it.
Cache Coherence
A cache miss creates a processor stall.
The processor is stalled until the data is retrieved from
the memory.
The stall is minimized by continuing to load and execute
instructions, until the data that is stalling is retrieved.
These techniques are called:
Prefetching
Out of order execution
Software pipelining
Typically, the compiler will do these at -O3
optimization.
Cache Coherence
The following is an example of software pipelining:
Suppose you compute
Do I=1,N
y(I)=y(I) + a*x(I)
End Do
In pseudo-assembly language, this is what the Origin compiler will
do:
cycle t+0 ld y(I+3)
cycle t+1 ld x(I+3)
cycle t+2 st y(I-4)
cycle t+3 st y(I-3)
cycle t+4 st y(I-2)
cycle t+5 st y(I-1)
cycle t+6 ld y(I+4)
cycle t+7 ld x(I+4)
cycle t+8 ld y(I+5) madd I
cycle t+9 ld x(I+5) madd I+1
cycle t+10 ld y(I+6) madd I+2
cycle t+11 ld x(I+6) madd I+3
Cache Coherence
Since the Origin processor can only execute 1 load or
1 store at a time, the compiler places loads in the
instruction pipeline well before the data is needed.
It is then able to continue loading while
simultaneously performing a fused multiply-add
(a+b*c).
The code above gets 8 flops in 12 clock cycles.
The peak is 24 flops in 12 clock cycles for the Origin.
The Intel Pentium III (IA-32) and the Itanium (IA-64)
will have differing versions of the code above but the
same concepts apply.
Agenda
7 Cache Tuning
7.1 Cache Concepts
7.2Cache Specifics
7.2.1 Cache on the SGI Origin2000
7.2.2 Cache on the Intel Pentium III
7.2.3 Cache on the Intel Itanium
7.2.4 Cache Summary
7.3 Code 0ptimization
7.4 Measuring Cache Performance
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Cache on the SGI Origin2000
L1 Cache (on-chip primary cache)
Cache size: 32KB floating point data
32KB integer data and instruction
Cache line size: 32 bytes
Associativity: 2-way set associative
L2 Cache (off-chip secondary cache)
Cache size: 4MB per processor
Cache line size: 128 bytes
Associativity: 2-way set associative
Replacement: LRU
Coherence: Directory based 2-way interleaved (2 banks)
Cache on the SGI Origin2000
Bandwidth L1 cache-to-processor
1.6 GB/s/bank
3.2 GB/sec overall possible
Latency: 1 cycle
Bandwidth between L1 and L2 cache
1GB/s
Latency: 11 cycles
Bandwidth between L2 cache and local memory
.5 GB/s
Latency: 61 cycles
Average 32 processor remote memory
Latency: 150 cycles
Cache on the Intel Pentium III
L1 Cache (on-chip primary cache)
Cache size: 16KB floating point data
16KB integer data and instruction
Cache line size: 16 bytes
Associativity: 4-way set associative
L2 Cache (off-chip secondary cache)
Cache size: 256 KB per processor
Cache line size: 32 bytes
Associativity: 8-way set associative
Replacement: pseudo-LRU
Coherence: interleaved (8 banks)
Cache on the Intel Pentium III
Bandwidth L1 cache-to-processor
16 GB/s
Latency: 2 cycles
Bandwidth between L1 and L2 cache
11.7 GB/s
Latency: 4-10 cycles
Bandwidth between L2 cache and local memory
1.0 GB/s
Latency: 15-21 cycles
Cache on the Intel Itanium
L1 Cache (on-chip primary cache)
Cache size: 16KB floating point data
16KB integer data and instruction
Cache line size: 32 bytes
Associativity: 4-way set associative
L2 Cache (off-chip secondary cache)
Cache size: 96KB unified data and instruction
Cache line size: 64 bytes
Associativity: 6-way set associative
Replacement: LRU
L3 Cache (off-chip tertiary cache)
Cache size: 4MB per processor
Cache line size: 64 bytes
Associativity: 4-way set associative
Replacement: LRU
Cache on the Intel Itanium
Bandwidth L1 cache-to-processor
25.6 GB/s
Latency: 1 - 2 cycle
Bandwidth between L1 and L2 cache
25.6 GB/sec
Latency: 6 - 9 cycles
Bandwidth between L2 and L3 cache
11.7 GB/sec
Latency: 21 - 24 cycles
Bandwidth between L3 cache and main memory
2.1 GB/sec
Latency: 50 cycles
Cache Summary
MIPS
Chip Pentium III Itanium
R10000
#Caches 2 2 3
Associativit
2/2 4/8 4/6/4
y
Replacement LRU Pseudo-LRU LRU
CPU MHz 195/250 1000 800
Peak Mflops 390/500 1000 3200
1 LD or 1 1 LD and 1 2 LD or 2
LD,ST/cycle
ST ST ST
Only one load or store may be performed each CPU cycle on the
R10000.
This indicates that loads and stores may be a bottleneck.
Efficient use of cache is extremely important.
Agenda
7 Cache Tuning
7.1 Cache Concepts
7.2 Cache Specifics
7.3Code 0ptimization
7.4 Measuring Cache Performance
7.4.1 Measuring Cache Performance on the SGI Origin2000
7.4.2 Measuring Cache Performance on the Linux Clusters
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Code 0ptimization
Gather statistics to find out where the bottlenecks are in your
code so you can identify what you need to optimize.
The following questions can be useful to ask:
How much time does the program take to execute?
Use /usr/bin/time a.out for CPU time
Which subroutines use the most time?
Use ssrun and prof on the Origin or gprof and vprof on the Linux clusters.
Which loop uses the most time?
Put etime/dtime or other recommended timer calls around loops for CPU
time.
For more information on timers see Timing and Profiling section.
What is contributing to the cpu time?
Use the Perfex utility on the Origin or perfex or hpmcount on the Linux
clusters.
Code 0ptimization
Some useful optimizing and profiling tools are
etime/dtime/time
perfex
ssusage
ssrun/prof
gprof cvpav, cvd
See the NCSA web pages on Compiler, Performance, and
Productivity Tools
http://www.ncsa.uiuc.edu/UserInfo/Resources/Software/Too
ls/ for information on which tools are available on NCSA
platforms.
Measuring Cache Performance on the
SGI Origin2000
The R10000 processors of NCSA’s Origin2000
computers have hardware performance counters.
There are 32 events that are measured and each event is
numbered.
0 = cycles
1 = Instructions issued
...
26 = Secondary data cache misses
...
View man perfex for more information.
The Perfex Utility
The hardware performance counters can be measured
using the perfex utility.
perfex [options] command [arguments]
Measuring Cache Performance on the
SGI Origin2000
where the options are:
-e counter1-e counter2
This specifies which events are to be counted. You enter the
number of the event you want counted. (Remember to have a
space in between the "e" and the event number.)
-a
sample ALL the events
-mp
Report all results on a per thread basis.
-y
Report the results in seconds, not cycles.
-x
Gives extra summary info including Mflops command Specify
the name of the executable file. arguments Specify the input
and output arguments to the executable file.
Measuring Cache Performance on the
SGI Origin2000
Examples
perfex -e 25 -e 26 a.out
- outputs the L1 and L2 cache misses
- the output is reported in cycles
perfex -a -y a.out > results
- outputs ALL the hardware performance counters
- - the output is reported in seconds
Measuring Cache Performance on the
Linux Clusters
The Intel Pentium III and Itanium processors
provide hardware event counters that can be
accessed from several tools.
Instruction Set
The instruction set (ISA) on the IBM p690 is the PowerPC
AS Instruction set.
Cache Architecture
Each Power4 Core has both a primary (L1) cache associated with each
processor and a secondary (L2) cache shared between the two processors.
In addition, each Multi-Chip Module has a L3 cache.
Level 1 Cache
The Level 1 cache is in the processor core. It has split instruction and data
caches.
L1 Instruction Cache
The properties of the Instruction Cache are:
64KB in size
direct mapped
cache line size is 128 bytes
L1 Data Cache
The properties of the L1 Data Cache are:
32KB in size
2-way set associative
FIFO replacement policy
2-way interleaved
cache line size is 128 bytes
Peak speed is achieved when the data accessed in a loop is entirely contained
in the L1 data cache.
Cache Architecture
Level 2 Cache on the Power4 Chip
When the processor can't find a data element in the
L1 cache, it looks in the L2 cache. The properties of
the L2 Cache are:
external from the processor
unified instruction and data cache
1.41MB per Power4 chip (2 processors)
8-way set associative
split between 3 controllers
cache line size is 128 bytes
pseudo LRU replacement policy for cache coherence
124.8 GB/s peak bandwidth from L2
Cache Architecture
Level 3 Cache on the Multi-Chip Module
When the processor can't find a data element in the
L2 cache, it looks in the L3 cache. The properties of
the L3 Cache are:
external from the Power4 Core
unified instruction and data cache
128MB per Multi-Chip Module (8 processors)
8-way set associative
cache line size is 512 bytes
55.5 GB/s peak bandwidth from L2
Memory Subsystem
The total memory is physically distributed among
the Multi-Chip Modules of the p690 system (see
the diagram in the next slide).
Memory Latencies
The latency penalties for each of the levels of
the memory hierarchy are:
L1 Cache - 4 cycles
L2 Cache - 14 cycles
L3 Cache - 102 cycles
Main Memory - 400 cycles
Memory distribution within an MCM
Agenda
9 About the IBM Regatta P690
9.1 IBM p690 General Overview
9.2 IBM p690 Building Blocks
9.3 Features Performed by the Hardware
9.4 The Operating System
9.5 Further Information
Features Performed by the Hardware
The following is done completely by the hardware,
transparent to the user:
Global memory addressing (makes the system memory
shared)
Address resolution
Maintaining cache coherency
Automatic page migration from remote to local memory
(to reduce interconnect memory transactions)
The Operating System
The operating system is AIX. NCSA's p690 system is
currently running version 5.1 of AIX. Version 5.1 is a
full 64-bit file system.
Compatibility
AIX 5.1 is highly compatible to both BSD and System V
Unix
Further Information
Computer Architecture: A Quantitative Approach
John Hennessy, et al. Morgan Kaufman Publishers, 2nd
Edition, 1996
Computer Hardware and Design: The
Hardware/Software Interface
David A. Patterson, et al. Morgan Kaufman Publishers,
2nd Edition, 1997
IBM P Series [595] at the URL:
http://www-03.ibm.com/systems/p/hardware/highend/590/index.html