Admin 10

Introduction to Parallel
Processing
Dr. Guy Tel-Zur
Lecture 10
Agenda
• Administration
• Final presentations
• Demos
• Theory
• Next week plan
• Home assignment #4 (last)
Final Projects
• Next Sunday: Groups 1-16 will present
• Next Monday: Groups 17+ will present
• 10 minutes presentation per group
• All group members should present
• Send to: gtelzur@gmail.com your
presentation by midnight of the previous day
‫נוכחות חובה‬
‫‪Final Presentations‬‬
‫החלוקה לקבוצות הינה קשיחה‬ ‫•‬
‫קבוצה שלא תציג תאבד ‪ 5‬נקודות בציון‬ ‫•‬
‫יש לבצע חזרה ולוודא עמידה בזמנים‬ ‫•‬
‫המצגת צריכה לכלול‪ :‬שם הפרויקט‪ ,‬מטרתו‪,‬‬ ‫•‬
‫האתגר בבעיה מבחינת החישוב המקבילי‪ ,‬דרכים‬
‫לפתרון‪.‬‬
‫לא תתקבלנה מצגות בזמן השיעור! יש להקפיד‬ ‫•‬
‫לשלוח אותן אל המרצה מבעוד מועד‬
The Course Roadmap
Introductio
n
HPC HTC
GPU Computing !New
Condor
Message Passing Shared Memory
MPI ++Cilk OpenMP

Grid Computing Today
d ay
To Cloud
Today
Computing
Advanced Parallel Computing and
Distributed Computing course
• A new course at the department:
Distributed Computing: Advanced Parallel
Processing course + Grid Computing + Cloud
Computing
Course Number: 361-1-4691
• If you are interested in this course please send

me an email
Today
• Algorithms – Numerical Algorithms
(“slides11.ppt”)
• Introduction to Grid Computing
• Some demos
• Home assignment #4
Futuristic A-Symmetric Multi-Core Chip
SACC Sequential Accelerator

Theory
• Numerical Algorithms
– Slides from:
UNIVERSITY OF NORTH CAROLINA AT CHARLOTTE
Department of Computer Science
ITCS 4145/5145 Parallel Programming
Spring 2009
Dr. Barry Wilkinson
Matrix multiplication, solving a system of linear
equations, iterative methods
Here URL is
Demos
• Hybrid Parallel • StarHPC

Programming – MPI + • Cilk++
OpenMP • GPU Computing (a
• Cloud Computing separate presentation)
– Setting a HPC cluster • Eclipse PTP
– Setting a Condor
• Kepler workflow
machine
(a separate presentation)
Hybrid MPI + OpenMP Demo
:Machine File Each hobbit has 8 cores
hobbit1
hobbit2
hobbit3
hobbit4
MPI
mpicc -o mpi_out mpi_test.c -fopenmp

OpenMP
An Idea for a
!!!final project
cd ~/mpi program name: hybridpi.c

:MPI is not installed yet on the hobbits, in the meanwhile
vdwarf5
vdwarf6
vdwarf7
vdwarf8
top -u tel-zur -H -d 0.05
H – show threads, d – delay for refresh, u - user

Hybrid MPI+OpenMP continued
Hybrid Pi (MPI+OpenMP
#include <stdio.h>
#include <mpi.h>
#include <omp.h>
#define NBIN 100000
#define MAX_THREADS 8
int main(int argc,char **argv) {

int nbin,myid,nproc,nthreads,tid;
double step,sum[MAX_THREADS]={0.0},pi=0.0,pig;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Comm_size(MPI_COMM_WORLD,&nproc);
nbin = NBIN/nproc;
step = 1.0/(nbin*nproc);
#pragma omp parallel private(tid)
{
int i;
double x;
nthreads = omp_get_num_threads();
tid = omp_get_thread_num();
for (i=nbin*myid+tid; i<nbin*(myid+1); i+=nthreads) {
x = (i+0.5)*step;
sum[tid] += 4.0/(1.0+x*x);
}
printf("rank tid sum = %d %d %e\n",myid,tid,sum[tid]);
}
for(tid=0; tid<nthreads; tid++)
pi += sum[tid]*step;
MPI_Allreduce(&pi,&pig,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
if (myid==0) printf("PI = %f\n",pig);
MPI_Finalize();
return 0;
}
++Cilk
Simple, powerful expression of task parallelism:

cilk_for – Parallelize for loops
cilk_spawn – Specify the start of parallel execution
cilk_sync – Specify the end of parallel execution
http://software.intel.com/en-us/articles/intel-cilk-plus/
17/8/2011
Fibonachi (Fibonacci)
Try:
http://www.wolframalpha.com/input/?i=fibonacci+number
Fibonachi Numbers
serial version
// 1, 1, 2, 3, 5, 8, 13, 21, 34, ...

// Serial version
// Credit: http://myxman.org/dp/node/182
long fib_serial(long n) {
if (n < 2) return n;
return fib_serial(n-1) + fib_serial(n-2);
}
Cilk++ Fibonachi (Fibonacci)
#include <cilk.h>
#include <stdio.h>
long fib_parallel(long n)
{
long x, y;
x = cilk_spawn fib_parallel(n-1);
y = fib_parallel(n-2);
cilk_sync;
return (x+y);
}
int cilk_main()
{
int N=50;
long result;
result = fib_parallel(N);
printf("fib of %d is %d\n",N,result);
return 0;
}
Cilk_spawn
ADD PARALLELISM USING CILK_SPAWN

We are now ready to introduce parallelism into our qsort program.
The cilk_spawn keyword indicates that a function )the child( may be
executed in parallel with the code that follows the cilk_spawn
statement )the parent(. Note that the keyword allows but does not
require parallel operation. The Cilk++ scheduler will dynamically
determine what actually gets executed in parallel when multiple
processors are available. The cilk_sync statement indicates that the
function may not continue until all cilk_spawn requests in the same
function have completed. cilk_sync does not affect parallel strands
spawned in other functions.
Cilkview Fn(30)
Strands and Knots
A Cilk++ program fragments
...
do_stuff_1(); // execute strand 1
cilk_spawn func_3(); // spawn strand 3 at knot A
do_stuff_2(); // execute strand 2
cilk_sync; // sync at knot B
do_stuff_4(); // execute strand 4 ...
DAG with two spawns (labeled A and B)

and one sync (labeled C) 
a more complex Cilk++ program (DAG):
Let's add labels to the strands to indicate the number of milliseconds it takes to
execute each strand
In ideal circumstances (e.g., if there is no scheduling overhead) then, if an unlimited

number of processors are available, this program should run for 68 milliseconds.
Work and Span
Work
The total amount of processor time required to complete the program is the sum of all
the numbers. We call this the work.
In this DAG, the work is 181 milliseconds for the 25 strands shown, and if the program is
run on a single processor, the program should run for 181 milliseconds.
Span
Another useful concept is the span, sometimes called the critical path length. The span is
the most expensive path that goes from the beginning to the end of the program. In this
DAG, the span is 68 milliseconds, as shown below:
divide-and-conquer strategy
cilk_for
Shown here: 8
threads and 8
iterations
Here is the DAG for a serial loop that spawns each iteration. In this case, the
work is not well balanced, because each child does the work of only one
iteration before incurring the scheduling overhead inherent in entering a sync.
Race conditions
Check the “qsort-race” program with cilkscreen:
StarHPC on the Cloud
?Will be ready for PP201X

Eclipse PTP
Parallel Tools Platform
/http://www.eclipse.org/ptp
Will be ready for PP201X?

Recursion in OpenMP
long fib_parallel(long n) {
long x, y;
#pragma omp task default(none) shared(x,n)
{
x = fib_parallel(n-1); The task pragma can be
useful for parallelizing
}
irregular algorithms such
y = fib_parallel(n-2); as recursive algorithms
#pragma omp taskwait for which other OpenMP
return (x+y); workshare constructs are
} .inadequate
#pragma omp parallel

#pragma omp single
{ Use the taskwait pragma to
specify a wait for child tasks
r = fib_parallel(n);
to be completed that are
} generated by the current
.task
Reference: http://myxman.org/dp/node/182
Intel® Parallel Studio
• Use Parallel Composer
to create and compile a parallel application
• Use Parallel Inspector
to improve reliability by finding memory and
threading errors
• Use Parallel Amplifier
to improve parallel performance by tuning
threaded code
Intel® Parallel Studio
Parallel Studio add new features to Visual
Studio
Intel’s Parallel Amplifier –
Execution Bottlenecks
Intel’s Parallel Inspector –
Threading Errors
Intel’s Parallel Inspector –
Threading Errors
Error – Data Race
Intel Parallel Studio - Composer
The installation of this part failed for me.
Probably because I didn’t install before Intel’s C++
compiler.
Sorry I can’t make a demo here…

Admin 10

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Admin 10

Uploaded by

Copyright:

Available Formats

Introduction to Parallel

GPU Computing !New

MPI ++Cilk OpenMP

• If you are interested in this course please send

SACC Sequential Accelerator

• Hybrid Parallel • StarHPC

mpicc -o mpi_out mpi_test.c -fopenmp

cd ~/mpi program name: hybridpi.c

H – show threads, d – delay for refresh, u - user

int main(int argc,char **argv) {

Simple, powerful expression of task parallelism:

// 1, 1, 2, 3, 5, 8, 13, 21, 34, ...

return fib_serial(n-1) + fib_serial(n-2);

ADD PARALLELISM USING CILK_SPAWN

DAG with two spawns (labeled A and B)

In ideal circumstances (e.g., if there is no scheduling overhead) then, if an unlimited

?Will be ready for PP201X

Will be ready for PP201X?

#pragma omp parallel

You might also like