Professional Documents
Culture Documents
PARC++: A Parallel C + +
K A I TODTER AND CARSTEN HAMMER
Siemens A G , ZFE T S E 3, 81739 Munich, Otto-Hahn-Ring 6, Germany (email:
{ kai. toedter, hammer} @zfe.siemens.de)
AND
WERNER STRUCKMANN
Institute of Programming Languages and Information Systems, Technical University
Braunschweig, P. 0. Box 3329, Germany (email: struck@ips.cs. tu-bs.de)
SUMMARY
PARC+ + is a system that supports object-oriented parallel programming in C+ +. PARC+ + provides
the user with a set of predefined C + + classes that can easily be used for the construction of parallel
C + + programs. With the help of PARC++ objects, the programmer is able to create and start new
processes (threads), to synchronize their activities (Blocklock, Monitor) and to manage communication
via message passing (Mailbox). PARC+ + is written in C + + and currently runs on top of the EMEX
operating system on a FORCE machine with 11 processing elements and an EDS (European Declarative
System) with 28 processing elements. The paper also contains information about the run-time system
model, the implementation and some performance measurements.
KEY WORDS: C+ + ; object-oriented programming; parallel computing; European Declarative System (EDS)
INTRODUCTION
The objective of this project is the design of a general-purpose C + + class library
to provide the user with a programming system for writing object-oriented parallel
programs. The primary design objectives were:
1. Portability. PARC+ + should be portable among computer architectures that
provide shared or virtually shared memory.
2. User-friendliness. The constructs should be user-friendly and easy to under-
stand.
3. Broad range of MIMD architectures. PARC+ + should provide message passing
as well as shared memory constructs (the target architecture includes shared
memory machines as well as distributed memory machines with virtually shared
memory).
4. Compiler independence. The constructs should be implementable in pure C+ +
without changes in the compiler.
We chose C + + because we wanted a widely used programming language that
provides data abstraction and object-oriented programming. In addition, C+ + is
relatively easy to learn for C programmers. For further understanding of this article,
basic knowledge of C+ + is useful. Some concepts have been taken from PRESTO
l y 2
system. But the user can explicitly set the target node where a thread should run
(int set-target(int node)).
There are further member functions to handle thread objects, for example, to
get the processing element on which a thread was started or running, the level in
the process tree, the parent process etc.
buffer *buff;
public:
ProdConO
{
*buff = new buffer(); // new buffer monitor
I
void producer0 {. ..} // produces a character and puts it in the
buffer
void consumer0 { ...} // consumes a character from the buffer
1
main0
{
ProdCon *e = new ProdConO; // new producer/consumer
Thread *p = new ThreadO; // creates two threads
Thread *c = new ThreadO;
p- >start(e,e- >producer); // starts prodlucer and consumer
c- >start(e,e- >consumer);
}
Figure 1. Visit
PARC+ -t: A PARALLEL C+ + 631
state of a thread (busy, terminated, waiting for mail, waiting for a lock, holding a
lock etc.).
Furthermore, there are displays for message queues (mailboxes), thread queues
(locks), bar charts for processor utilization and sequence charts for the time stamps
of the trace events.
The second trace format that P A RC+ + provides can be visualized with Para-
Graph,6 a visualization tool developed by M. T. Heath and J. A. Etheridge at Oak
Ridge National Laboratory. Whereas Visit is used for debugging (it helps the
programmer to understand the relationship between the parallel constructs), Para-
Graph is more suitable for performance analyses and tuning. ParaGraph provides
many displays for process utilization and communication. To analyse the perform-
ance, ParaGraph displays the three processor states ‘busy’, ‘idle’ and ‘overhead’.
There is one handicap* in visualizing PARC++ traces with ParaGraph: the only
objects ParaGraph handles are nodes (processing elements). In a multi-threaded
object-oriented environment such as PARC+ + it is more interesting to represent
objects as threads, mailboxes etc. than the nodes on which the methods of the
objects run. Because of this, we have changed the semantics of the ParaGraph
trace-file and have mapped PARC++ objects to ParaGraph nodes. For example,
if we have two PARC++ threads and one PARC++ mailbox, we map them to
ParaGraph nodes 0, 1 and 2. The advantage of this mapping is that the user is able
to analyse the performance of single threads.
The third trace format of PARC++ can be visualized with the POPA17 (Parallel
Object-oriented Program Animation Instrument). POPAI has been developed by
K. Kiderle as a diploma thesis at the University of Munich. All PARC++ objects
except threads can be visualized (see Figure 2). For example, POPAI shows the
queues of mailboxes and monitors. Furthermore, it is possible to create user events
which can easily be visualized with POPAI.
Vlawllratlm
Figure 2. POPAI
* ParaGraph usually visualizes PICL8 trace-files. In PICL there is only one process running on each node, and
so the handicap does not appear.
632 K. TODTER, C. HAMMER A N D W. STRUCKMANN
In recent years, there have been many attempts to define parallel extensions of
C + + . In this section we want to compare PARC++ with some of these projects.
In PARC++ we have taken the PRESTO3 thread concept because it is easy to
handle and user-friendly. In addition, there are some differences from and extensions
to PRESTO, which further enlarge the user's comfort: in PRESTO (1.0) the user
has to invoke the member function willjoinO before a thread is started if he wants
to get the result after the execution of the thread. A,fter starting, he has to wait
with Objany tval = t->join(),. There is no possibility of testing whether a thread
execution is completed or of waiting for all descendants of a started thread. In
PARC+ + the user may use test(), result(), wait0 and wait-all(). Furthermore, there
is no way of saying explicitly on which PE a thread shall execute, whereas PARC+ +
provides methods such as set-target. The locking mechanisms of PRESTO are similar
to PARC++. In addition, PARC++ provides mechanisms to try whether a lock
is free or not. In P A R C + + monitors, the condition variables are included, in
PRESTO there are external objects which have to be bound to monitors. PRESTO
does not provide any communication via message passing.
COOL9 (Concurrent Object-Oriented Language) was developed by R. Chandra,
A. Gupta and J. L. Hennessy at Stanford University. PARC++, PRESTO and
COOL are similar in that all three attempt to exploit the object model for con-
currency. But there are several differences from PARC++ and PRESTO: COOL
is a syntactical extension of C + + . It provides comfortable mechanisms for con-
currency and synchronization. But to implement COOL, the C + + compiler has to
be changed. In COOL, the programmer can declare a function as parallel, for
example, parallel int too(). If the user invokes foe(), the function will be executed
parallel (it is also possible to invoke a sequential function parallel and a parallel
function sequentially). As a result, it is easier to encapsulate parallelism within the
implementation of a class. Another difference from PARC++ is that COOL does
not provide monitors, but mutex functions. If several member functions of the same
object are declared as mutex, they all execute with mutual exclusion. COOL is no
more powerful (in semantics) than PRESTO or PAR(:++, but more flexible and
concurrency may be finer-grained. For this, the implennentor of COOL has to pay
the high price of changing a C + + compiler for every !special parallel machine.
C + + Parmacsl' (B. Beck, Sequent Computer Systems) is based on the M4
macros, also called Parmacs, which have been developed at the Argonne National
Laboratories. The major difference from PARC++, PRESTO and COOL is the
process model: in C + + Parmacs, the same C + + program (pmaino) runs on every
PE of the system. It is not possible to create new processes or threads dynamically.
Synchronization is done via several types of monitors.
Threads," developed by T. W. Doeppner and A. J . Gebele, is based on a system
by B. Stroustrup,l* which was designed for a single processor. The only way to
execute code in parallel is to define a class as a subclass of Task and to implement
the parallel code within the constructor. Synchronization of tasks can be done by
using monitors which are very similar to PARC++ monitors (we have adapted the
signal-wait from Threads monitors).
The major advantage of P A R C + + (in comparison with the four parallel C + +
dialects above) is the provision of both message passing viia mailboxes and synchroniz-
PARC++ : A PARALLEL C+ + 633
ation via locks and monitors. Furthermore, PARC+ + supports several trace formats
which can be visualized by different tools.
a work queue and sends this work (if there is any) to the requesting managing
thread.
3. The managing thread. Has various tasks. First, the managing thread creates a
user thread (see below) and sends a work request to the distribution thread.
If work is available, it will be passed to the user thread. The user thread can
now start execution. If the user thread is interrupted (for example, if a lock
is not available), the managing thread will create another user thread and will
send a new work request to the distribution thread. Furthermore, the managing
thread manages all locks, monitors and mailboxes created on its own processing
element.
4. The user thread. The task of a user thread is to execute the function or method
invoked by Thread::start. When the execution is finished, a message will be
sent to the managing thread.
5. The nowork thread. This thread has the lowest priority in the whole team.
Thus, if the nowork thread gets control, all other system threads (especially
the user threads) wait for some events. The only ‘task of the nowork thread is
to send a nowork message to the managing thread. After receiving such a
message, the managing thread creates a new user thread and sends a work
request to the distribution thread.
PERFORMANCE MEASUREMENTS
PARC++ currently runs on a FORCE machine (MC 68030 Processor on every
board) and an EDS machine (see below). The FORCE, is used as a testbed, where
virtually shared memory is emulated relatively slowly because message passing
between all processing elements is realized via one single bus. The major target
machine of PARC++ is an EDS machine with 40hAHz SPARC processors, a
Delta net for message passing and a separate processor on each processing element
supporting the virtually shared memory and the message passing. Performance
measurements on this machine are more relevant.
The initialization time of PARC+ + depends on the number of available processing
elements. On both machines, initialization time is about 0.14 s for the first processing
element. For every further processing element, the initialization time is about 0.4 s.
The reason for this ‘long’ time is the copying of the whole PARC++ run-time
system to every processing element.
The asynchronous start of a member function costs about 4 ms on FORCE and
about 1.5 ms on EDS. The real time between asynchronous invocation, parallel
execution and getting the return value on the invoker’s site is about 20 ms on
FORCE and about 8 ms on EDS (average values). The time for wait(), wait-all(),
test(), and result0 is about 0~01-0.015ms on both machines.
The time for requesting a lock depends on the processing element on which the
lock was created. For example, if a lock was created on the same processing element,
the request time is about 1.8 ms on EDS (3.6 ms on FORCE). If the lock was
created on another processing element, the request time increases up to 18 ms
(32 ms on FORCE) for the first request. This time includes the initialization of lock
objects. Once initialized, the request time decreases to 1-8 ms on EDS (3.5 ms on
FORCE). So the request time is lower than the page fault time! The time to unlock
a lock is about 0.2 ms on EDS and about 0.8 ms on FORCE. The timing of mailboxes
PARC+ + : A PARALLEL C+ + 635
(receive and send one integer value) and monitors (enter and exit) is similar to the
timing of locks. A monitor's wait0 lasts about the time of 2 * lock0 + unlock(), a
signal() lasts about the time of lock0 + unlock().
Because of the overhead of the asynchronous invocation, only functions with an
execution time of more than 20 ms (on the FORCE machine) or more than 8 ms
(on the EDS machine) should be executed in parallel. The performance of PARC+ +
depends basically on the performance of the message passing constructs and the
mapper of the virtually shared memory, both provided by the EMEX operating
system.
CONCLUSIONS
PARC+ + is a programming tool to write object-oriented parallel programs, provid-
ing several synchronization and communication constructs. In contrast to other
parallel C + + versions, PARC+ + provides both shared memory and message passing
constructs. As a result, some kinds of algorithms can be implemented in a more
natural way. Furthermore, data exchange in a virtually shared memory environment
via message passing may be more efficient than the reading and writing of virtually
shared data. PARC+ + provides interfaces to several visualization tools such as
Visit, POPAI and ParaGraph, which may be used for debugging and/or performance
analysis. PARC++ is implemented as a C + + class library without changing the
compiler. In this way, it is easy to port it to other systems. The PARC+ + implemen-
tation was the first application of the newly developed EMEX operating system.
We assume that we can increase the efficiency of PARC++ and hope that further
PARC++ applications will show that we have reached our objective.
636 K . TODTER, C. HAMMER AND W. STRIJCKMANN
ACKNOWLEDGEMENTS
We would like to thank Jiirgen Knopp and Thomas Henties for their feedback on
the design and implementation of PARC++ and Frliederike Richter for helpful
ideas and comments on this paper.
REFERENCES
1. B. Stroustrup, The C + + Programming Language, Addison-Wesley, March 1986.
2. B. Stroustrup, ‘An overview of C++’, AT&T Bell Laboratories, Murray Hill, 1986.
3. B. N. Bershad, E. D. Lazowska and H. M. Levy, ‘PRESTO: a system for object-oriented parallel
programming’, Software-Practice and Experience, 18, (8), 713-732 (1988).
4. P. Brinch Hansen, ‘The programming language Concurrent Pascal’, IEEE Trans. Software Engineer-
ing, 2 , 199-206 (1975).
5. H. Ilmberger and C. P. Wiedemann, ‘Visualization and control environment for parallel program
debugging’, HICSS-26, Hawaii, January 1993.
6. M. T. Heath and J. A. Etheridge, ‘Visualizing the performatnce of parallel programs’, IEEE
Software, 8, ( 5 ) , 29-39 (1991).
7 . K. Kiderle, ‘Visualisierung und Anaiyse des dynamischen Ablaufverhaltens paralleler objektorienti-
erter Programme’, Diploma thesis, Technical University Munich, February 1993.
8. G. A. Geist, M. T. Heath, B. W. Peyton and P. H. Worley, ‘PICL: a portable instrumented
communication library’, Technical Report. 1990.
9. R. Chandra, A. Gupta and J. L. Hennessy, ‘COOL: a language for parallel programming’,
Technical Report No. CSL-TR-89-396, University of Stanford, 1989.
10. €3. Beck, ‘Shared-memory parallel programming in C + +’, IEEE Software, July 1990, pp. 38-18.
11. T. W. Doeppner and A. J. Gebele, ‘ C + + on a parallel machine’, USENIX C + + Papers,
Department of Computer Science, Brown University, 1987, pp. 95-107.
12. B. Stroustrup, ‘A set of C + + classes for co-routine style prograimmlng’, AT&T Bell Laboratories
Computer Science Technical Report, available with Release notes for 1.2.1 C + f .
13. H. G . Baumgarten, L. Borrmann, €I. Hartlage, N. Holt, P. Istavrinos and S. Prior: ‘Specification
of the process control language (PCL)’, ESPRIT EP 2025, EDS.DD.lS.0007 Munich, 1989.
14. (3. Haworth, S. Leuning, C. Hammer and M. Reeve, ‘The European declarative system, database
and languages’, IEEE Micro, December 1990.
15. C. J . Skelton, C. Hammer, M. Lopez, M. J. Reeve, P. Townsend and K. F. Wong, ‘EDS: a
parallel computer system for advanced information processing’, Conference on Parallel Architectures
and Languages Europe, Park 92, Paris, June 1992.
16. M. Ward, P. Townsend and G. Watzlawik, ‘EDS hardware architecture’, Conference on Vector
and Parallel Processing, Zurich. September 1990.
17. K. Tiidter. ‘Entwicklung eines parallellen C+ +-Dialektes’, Diploma thesis, Technical University
Braunschweig, January 1992.
18. F. Armand, F. Herrmann, J. Lipkis and M. Rozier, ‘Multithreaded processes in CHORUS/MiX,
Proc. EUUG Spring’90 Conference, Munich, April 1990.
19. C. A. R. Hoare, ‘Monitors: an operating system structuring concept’, Communications of the
ACM, 17, (lo), 549-557 (1974).