The MILLIPEDE Project Technion, Israel

Windows-NT based Distributed Virtual Parallel Machine
http://www.cs.technion.ac.il/Labs/Millipede

What is Millipede ?
A strong Virtual Parallel Machine: employ non-dedicated distributed environments
Programs Implementation of Parallel Programming Langs

Distributed Environment

Programming Paradigms
Cilk/Calipso Java ParC ParFortran90 SPLASH CC++ Other CParPar ³Bare Millipede´

Events Mechanism (MJEC) Migration Services (MGS) Distributed Shared Memory (DSM)

Communication

Packages

Operating System Services
Communication, Threads, Page Protection, I/O

Software Packages
User-mode threads

U-Net, Transis, Horus,«

what¶s in a VPM? Check list: Using non-dedicated cluster of PCs (+ SMPs) Multi-threaded Shared memory User-mode Strong support for weak memory Dynamic page.and job-migration Load sharing for maximal locality of reference Convergence to optimal level of parallelism .So.

Using a non-dedicated cluster ‡ ‡ ‡ ‡ Dynamically identify idle machines Move work to idle machines Evacuate busy machines Do everything transparently to native user ‡ Co-existence of several parallel applications .

and comm. overlap ‡ Natural for parallel programing paradigms & environments ± Programmer defined max-level of parallelism ± Actual level of parallelism set dynamically. Applications scale up and down ± Nested parallelism ± SMPs .Multi-Threaded Environments ‡ Well known: ± Better utilization of resources ± An intuitive and high level of abstraction ± Latency hiding by comp.

not necessarily with the maximal number of computers ‡ Achieved level of parallelism .depends on the program needs and on the capabilities of the system . Better locality of memory reference ‡ Optimal speedup .Convergence to Optimal Speedup ‡ The Tradeoff: Higher level of parallelism VS.

1. 1. pvm_upkint(tids. ?nproc. n. 1). ?data). msg_type = 5. data. 1). /* Worker id is given at creation no need to compute it now */ /* do calculation. 1). break. work(id. 1). 1. pvm_upkint(&n. n. id. pvm_recv(-1.} /* Do calculations with data*/ result=work(me. msgtype).No/Explicit/Implicit Access Shared Memory PVM /* Receive data from master */ msgtype = 0. put result in DSM*/ out(³result´. data. msgtype). . /* send result to master */ pvm_initsend(PvmDataDefault).nproc-1)*/ for(i=0. nproc). pvm_pkint(&me. i++) if(mytid==tids[i]) {me=i. nproc. n. nproc)). /* Exit PVM before stopping */ pvm_exit(). ?n. data. /* Determine which slave I am (0. tids. pvm_pkfloat(&result. pvm_upkint(&nproc.. n. ³Bare´ Millipede result = work(milGetMyId(). milGetTotalIds()). i<nproc. C-Linda /* Retrieve data from DSM */ rd(³init data´. master = pvm_paremt(). pvm_upkfloat(data. pvm_send(master. 1). 1. 1).

Arbitrary-CW Sync ‡ Multiple relaxations for different shared variables within the same program ‡ No broadcast. CRUW.Relaxed Consistency (Avoiding false sharing and ping pong) ‡ Sequential. page Sync(var). no central address servers (so can work efficiently interconnected LANs) copies ‡ New protocols welcome (user defined?!) ‡ Step-by-step optimization towards maximal parallelism .

5 Reduc ed 1 1 2 hosts 3 4 5 1 2 hosts 3 4 5 .5 60 original reduc ed 3 Number of page migrations (page #4) #migrations per host 50 40 30 20 10 0 speedups 2.LU Decomposition 1024x1024 matrix written in SPLASH Advantages gained when reducing consistency of a single variable (the Global structure): Reducing Consistency 4 70 3.5 2 Original 1.

Millipede Job Event Control An open mechanism with which various synchronization methods can be implemented ‡ A job has a unique systemwide id ‡ Jobs communicate and synchronize by sending events ‡ Although a job is mobile. its events follow and reach its events queue wherever it goes ‡ Event handlers are context-sensitive .MJEC .

MJEC (con¶t) ‡ Modes: ± In Execution-Mode: arriving events are enqueued ± In Dispatching-Mode: events are dequeued and handled by a user-supplied dispatching routine .

context) Yes ret := func(EXIT. void *context) Post Event: milPostEvent(id target.MJEC Interface Execution Mode milEnterDispatchingMode(func. void *context) . int data) Dispatcher Routine Syntax: int foo(id origin. int event. context) Wait for event Registration and Entering Dispatch Mode: milEnterDispatchingMode((FUNC)foo. context) ret := func(INIT. int event. context) No ret==EXIT? event pending? Yes ret := func(event. int data.

. locks. condition variables. barriers ‡ Implementation of location-dependent services (e. graphical display) ‡ .Experience with MJEC ParC: ~ 250 lines SPLASH: ~ 120 lines ‡ Easy implementation of many synchronization methods: semaphores.g.

EVENT ARR } wait_in_barrier(src..Barriers with MJEC Barrier Server Dispatcher: Barrier() { milPostEvent(BARSERV. event. } . ARR. else return STAY_IN_DISPATCHER.Example ... 0). « « . 0). milEnterDispatchingMode(wait_in_barrier. context) { Job « « « Job BARRIER(.) Dispatcher: « « if (event == DEP) return EXIT_DISPATCHER..

. DEP). src).) Dispatcher: « « « « } . info). context) { if (event == ARR) enqueue(context.. EVENT DEP « « .Example .cnt>0) { milPostEvent(context. } return STAY_IN_DISPATCHER. event. Job Job BARRIER(. if (should_release(context)) while(context...) Dispatcher: BARRIER(..queue.dequeue. EVENT DEP } barrier_server(src..Barriers with MJEC (con¶t) Barrier Server Dispatcher: BarrierServer() { milEnterDispatchingMode(barrier_server.

Dynamic Page. critical section ± by programmer: control system .and Job-Migration ‡ Migration may occur in case of: ± Remote memory access ± Load imbalance ± User comes back from lunch ± Improving locality by location rearrangement ‡ Sometimes migration should be disabled ± by system: ping-pong.

Locality of memory reference is THE dominant efficiency factor Migration Can Help Locality: Only Job Migration Only Page Migration Page & Job Migration .

Load Sharing + Max. Locality = Minimum-Weight multiway cut p q p q r r .

000. We have n>X.Problems with the multiway cut model ‡ NP-hard for #cuts>2. Polynomial 2-approximations known ‡ Not optimized for load balancing ‡ Page replica ‡ Graph changes dynamically ‡ Only external accesses are recorded ===> only partial information is available .000.

Our Approach Access page 1 page 2 page 1 page 0 ‡ Record the history of remote accesses ‡ Use this information when taking decisions concerning load balancing/load sharing ‡ Save old information to avoid repeating bad decisions (learn from mistakes) ‡ Detect and solve ping-pong situations ‡ Do everything by piggybacking on communication that is taking place anyway .

The page leaves the host shortly after arrival Treatment (by ping-pong server): ‡ Collect information regarding all participating hosts and threads ‡ Try to locate an underloaded target host ‡ Stabilize the system by locking-in pages/threads .Ping Pong Detection (local): 1. Local threads attempt to use the page short time after it leaves the local host 2.

Effect of Locality Optimization 15 cities. In the other two cases each page is used by 2 threads: in FS no optimizations are used. . and in OPTIMIZED -FS the history mechanism is enabled.TSP . Bare Millipede sec 4000 3500 3000 2500 2000 1500 1000 500 0 NO-FS OPTIMIZED-FS FS 1 2 3 4 5 6 hosts In the NO -FS case false sharing is avoided by aligning all allocations to page size.

TSP on 6 hosts k number of threads falsely sharing a page k optimized? # DSM.# ping-pong related treatment msgs messages Yes 5100 290 No 176120 0 Yes No Yes No Yes No 4080 160460 5060 155540 6160 162505 279 0 343 0 443 0 Number of execution thread time (sec) migrations 68 645 23 1020 87 32 99 44 139 55 620 1514 690 1515 700 1442 2 2 3 3 4 4 5 5 .

. TSP-2 1100 1000 900 800 700 600 500 400 300 200 100 0 2 3 4 5 6 7 8 9 1011 121314151617 181920 Since part of the pages are accessed frequently and part .Ping Pong Detection Sensitivity TSP-1 1000 900 800 700 600 500 400 300 200 100 0 2 3 4 5 6 7 8 9 10111213 14151617 181920 Best results are achieved at maximal sensitivity. since all pages are accessed frequently.only occasionally. maximal sensitivity causes unnecessary ping pong treatment and significantly increases execution time.

Applications ‡ Numerical computations: Multigrid ‡ Model checking: BDDs ‡ Compute-intensive graphics: Ray-Tracing. Tracking. Search trees.. . CFD . Pruning. Radiosity ‡ Games..

Performance Evaluation L .underloaded H . job histories migration heuristic which func? ping-pong .DSM) msg delta .polling (MGS.overloaded Delta(ms) .max history time ??? .system pages delta T_epoch .lock in time t/o delta . is PP? .refresh old histories L_epoch .histories length page histories vs.what is initial noise? .what freq.remove old histories .

LU Decomposition 1024x1024 matrix written in SPLASH: Performance improvements when there are few threads on each host 6 1 threads/host 5 3 threads/host spee up 4 3 2 1 1 2 3 hos s 4 5 6 .

8 7 6 speedup 5 4 .1 8 .LU Decomposition 2048x2048 matrix written in SPLASH Super-Linear speedups due to the caching effect.4 7 4 3 2 1 1 3 h o s ts 6 7 .

3 2. 1 0. no false sharing) written in ParC 180 160 140 120 100 80 60 40 20 0 1 2 hosts 3 4 3.Jacobi Relaxation 512x512 matrix (using 2 matrices. 0 Speedup Time . 2 1.

Testing with Tracking algorithm: Overhead P C ure "Bare" M illipede P on M arC illipede Relative 1.98 0.9 1 10 targets 20 .03 1.Overhead of ParC/Millipede on a single host.01 0.04 1.99 0.05 1.02 .

Info.il Release available at the Millipede site ! .technion.il/Labs/Millipede millipede@cs..ac.ac..technion.cs. http://www.