You are on page 1of 8

2012 IEEE

26th International
and Distributed
and Distributed
& PhD Forum

A Scheduling Strategy Supporting OpenMP Task on Heterogeneous Multicore

Qian Cao, Min Zuo

School of Computer and Information Engineering
Beijing Technology and Business University
Beijing, China

Abstract- One of the most important topics in software adopt tasks as high level abstraction to solve such problems,
industry is how to utilize the OpenMP 3.0 programming model with OpenMP 3.0[10] as a prominent example. OpenMP 3.0
to improve the execution of irregular and unstructured specification, which adds a new directive task, has shifted
applications. In this paper, we present an original task from a thread-centric to a task-centric execution model. Task
scheduling strategy- Hybrid strategy, which is suited to the mechanism allows the programmers to explicitly identify
execution of OpenMP programs on Cell heterogeneous units of independent work, leaving the scheduling strategy
multicore. Hybrid scheduling strategy creates tasks in of how and when to execute the created tasks to the runtime
breadth-first order while executes tasks in work-first fashion system. As a consequence, an appropriate and flexible
during application execution. The former is capable of
scheduling decision of OpenMP 3.0 task mechanism on Cell
creating enough tasks, which prevents the worker threads
from idling. While the latter guarantees task dependence is
architecture has become a rapidly growing area of research.
freed quickly and consequently, the overhead with respect to Broadly speaking, these policies are approximately
task searching is significant decreased. The evaluation, with a categorized into work-first scheduling strategy (WF) and
variety of Barcelona OpenMP Task Suite, is conducted on a breadth-first scheduling strategy (BF).
PS3 heterogeneous multicore. And the experimental results However, there are still some deficiencies existing in the
indicate that Hybrid policy outperforms the existing work-first WF and BF strategies, which will be specified elaborately in
and breadth-first scheduling strategies for most irregular and the following related works. Considering this, we design and
unstructured benchmarks, with speedups from 1.5 to 4.6 when implement an original task scheduling strategy - Hybrid
6 SPEs are used. scheduling strategy which supports OpenMP 3.0 task model
targeting Cell heterogeneous multicore. Hybrid scheduling
Keywords- OpenMP 3.0; task; Hybrid scheduling strategy; strategy combines the advantages of existing BF and WF
heterogeneous multicore scheduling strategies. It creates tasks in BF order while
executes tasks in WF fashion. The former could spawn
I. INTRODUCTION enough tasks within a short time, which prevents the
computation threads from idling and thus decreases the
With the ever increasing of semconductor industry,
number of suspended tasks. Additionally, it avoids a
multicore system has become one of the research hotspots in
situation that all tied tasks in OpenMP 3.0 standard tie to a
integrated circuit design. Homogeneous multicore could
certain thread in WF task creation fashion. While the latter
greatly improve the application parallelism, but it fails in
guarantees that the tasks at the bottom in the task-tree are
exploiting the concurrency available in some complex
executed primarily. Therefore, task dependence is freed
applications. Heterogeneous multicore, which integrates a
earlier and the overhead with respect to task lookup is
number of diverse processors in a single chip, is inevitable
drastically decreased. Moreover, the overload balance is
in computer architecture [1].
The Cell Broadband Engine (Cell BE)[2] is a
To evaluate the performance of Hybrid strategy, we
representative heterogeneous multicore processor, which has
conducted the experiments on a set of Barcelona OpenMP
drawn substantial attention from both industry and academia.
Task Suite (BOTS) benchmarks. The experimental results
It comprises a conventional Power Processor Element (PPE)
suggest that Hybrid scheduling strategy achieves better
that controls eight synergistic processing elements (SPEs).
performance improvements than WF and BF policies for
The PPE has two levels of hardware cache while SPEs dont
most benchmarks, with the speedups up to 4.6 when 6 SPEs
have caches but each has a 256KB of local store (LS). The
are used.
PPE can access the main memory while SPE can only
The rest of the paper is organized as follows. The related
operate directly on its LS. Such architectural characteristics
works are presented in section II. The design of Hybrid
make Cell architecture compelling for some specific
scheduling strategy on Cell BE is presented in section III.
scientific computing.
Section IV describes the execution process of Hybrid policy.
As the hardware complexity increases, modern
Section V shows the evaluation results. And the last section
applications are getting more sophisticated. Irregular and
concludes the paper.
dynamic structures, such as recursive kernels, unbounded
loops are prevalent in high performance computing.
Therefore, many mainstream programming environments [3-9]

978-0-7695-4676-6/12 $26.00 2012 IEEE 1974

DOI 10.1109/IPDPSW.2012.244
II. RELATED WORKS importance of the careful management of the interaction
There has been considerable research on task scheduling between the PPE and SPEs. However, the Linuxs scheduler
strategies. Here we concentrate our discussion of related lacks of knowledge about the asynchronous nature of the
works on task scheduling decisions targeting OpenMP 3.0 SPE computation, which is likely to lead to the latency of
specification due to space considerations. Specifically for the response to the events on the SPE side. The same team[22]
the OpenMP 3.0 task mechanism, the approaches presented proposes an event-driven cooperative scheduling mechanism.
in NANOS runtime are the mainstream proposals in broad The kernel level implementation is an extension to the Linux
sense, and some other policies [13-20] are derived from them. kernel with data structures and system calls which allow the
NANOS runtime[11-12] covers two scheduling policies: SPEs to request activation of a specific process on the PPE
work-first scheduling strategy and breadth-first scheduling side. For the user-level approach shared memory
strategy, which implement the restrictions about scheduling abstractions and a work-stealing strategy is exploited.
Work-first scheduling strategy is also named depth-first
scheduling strategy, and it tries to follow the serial execution Considering the Cell architectural feature, the PPE and
path hoping that if the sequential algorithm was well SPE respectively resides in the dominant and subordinate
designed it will potentially benefit better data locality. WF position in our runtime system, as illustrated in Figure 1.
works as follows: whenever a task is created, the parent task The PPE is responsible for runtime initialization, task
is suspended and the executing thread switches to the newly creation, inter-task synchronization, etc. While the SPE
created task. When a task is suspended, the task is placed in mainly deals with task execution after the tasks are fetched
a threads local pool. Cilk programming models[7] developed from the system memory by software cache or direct buffer,
by MIT, as well as Intel TBB[8], both adopt the similar and then writes back the results to the main memory.
scheduling strategy.
WF performs well for some recursive applications in that direct buffer
the task dependence is freed as soon as possible, yet it has omp_runtime_init(); or
some deficiencies as follows: software cache

x First, the initial task creation is based on the depth- SPE runtime
first manner, which is incompatible with tied task execution
OpenMP 3.0 tasks. If this scheme were used for tied system

PPE runtime
tasks in OpenMP 3.0, all the tasks would remain tied task creation
to a single thread, which hampers the application scheduling
parallelism. granularity void ol$1_SPE(lb,ub)
x Second, WF policy is designed for scenarios in synchronization for(k=lb;k<ub;k+=10)
Communication dma_get();
which working-stealing is rare, but overhead related
to work-stealing is obviously increased with the dma_put();
number of threads increasing.
Figure 1. Runtime system interactive framework
Breadth-first scheduling strategy is a naive scheduler in
which newly created tasks are placed into the team pool and
execution of the parent task continues. This strategy A. Task structure
guarantees that all tasks in the current recursion level are Task structure is designed as follows, taking into account
generated before a thread executes tasks from the next level. the task characteristics as well as scheduling flexibility.
Microsoft TPL (task parallel library)[9] applies BF typedef struct TASK_STRUCT{
scheduling strategy. BF achieves ideal performance for some ctx_t ctx;
applications with the iterative structure since it creates struct TASK_STRUCT *parent;
enough tasks in the early step. However, it has the following int depth;
shortcomings: int n_children;
x The application locality is affected in that the newly void (*fn) (void*);
created tasks are put into the global task queue void *data;
instead of immediately executed. bool is_tied;
x Extra synchronization overhead is incurred when }TASK;
large scale program is involved, which hinders Each member is represented as follows:
application scalability. x ctx: task context;
x BF policy spreads out the tasks in the task-tree too x parent: the father of the current task. In Hybrid
early, which is a waste of valuable memory policy, a task is required to keep track of its parent
resources. to facilitate task synchronization. If the current task
Except for the strategies mentioned above, there are is the root, parent is assigned to NULL;
some scheduling schemes specifically for the Cell x depth: the depth of the current task, and the depth of
processor[21-22]. Blagojevic et al.[21] investigate the mapping the current task is one more than that of its father;
of parallel computation to the Cell BE, showing the

x n_children: the number of the current tasks direct When this function is called, the return value of the task
children, and its initial value is assigned to 0. It is written back to the memory address pointed by ea.
makes it easier to implement the taskwait primitive fn_data is the address of stored data on the SPE side; ea is
mentioned below; the address of the data in the main memory; size is the
x (*fn) (void*): the pointer to the encapsulated code of the amount of data required to be written back to the memory.
x data: the pointer to the data of the current task; STRATEGY
x is_tied: the property of the current task. If its a tied task,
is_tied is assigned to 1; Otherwise, is_tied is assigned to Initially, all the LTQs, including both TLTQ and ULTQ,
0. are empty, and all the SPEs are idle. The PPE first creates
the root task, and chooses a SPE, SPEi for instance, in the
B. Main interfaces round-robin fashion. And then the PPE puts the root task
The task runtime are implemented as convenient libraries, into the 0-level queue of SPEis ULTQ. When SPEi finds a
which provide a variety of flexible APIs for task ready task in its ULTQ, it picks up the task and starts the
initialization, task creation, task removal, etc. The interfaces execution. During the execution, task creation work-
are described as follows: stealing task synchronization and task completion are
(1) PPE side respectively implemented as follows:
x unsigned long task_init(spe_context_ptr_t t_ctx) Task creation: If SPEi encounters the task pragma, it
This is a task initialization function on the PPE side, and continues to execute the original task, leaving task creation
it plays an important role in initializing the PPE environment to the PPE. The new task, T2-0 (2 represents task depth) for
variable and notifying the SPE of receiving data. t_ctx is a instance, is created in BF fashion, and the process of task
pointer which points to the SPE context. When the SPE generation is summarized as follows:
receives the signal sent from the PPE, it initializes itself and x PPE participates in creating a new task, allocating
sends a verification signal back to the PPE, indicating its memory space and runtime system initialization;
ready status. x PPE locates a SPE, SPEk for instance, in the round-
x t_task* task_creation(void * function, void * data) robin fashion, and puts the newly created task to the
Its the function for the PPE to create a task. Once its corresponding queues tail of SPEks ULTQ, as
invoked, a new task is spawned. function is the entry, and illustrated in Figure 2 when the depth of the newly
data is the data pointer. When the new task is created, the created task is 2;
PPE locates a SPE thread in the round-robin fashion which x The number of the tasks is increased by 1.
keeps load balance among the active threads, and puts the Additionally, the parameter n_children in the father
new task into the corresponding SPEs ULTQ. task is increased by 1;
x unsigned long get_cnt_task() x Return.
This function allows the PPE to capture the total number
of created tasks, which makes it easy to control the whole
runtime information.
(2) SPE side TLTQk head ^
x void task_init() LTQk level0 T0-0 ^
The SPE promotes this function to initialize its
environment variable. The SPE thread waits loop after its level1 ^
created. When receiving the initialization message sent from level2 T2-1 T2-0 ^
the PPE thread, it sends back a confirmation receipt, and
level3 ^
then enters the following LTQ deque data removal and
task execution stages:
x int task_data_read(void* fn_data, unsigned long ea, Figure 2. The PPE creates a task and put it to task queue
long size)
This interface deals with reading data. Once this function When SPEk is ready, it polls a task from its LTQ, with
is called, the SPE is required to fetch data from the system TLTQ checked first. If TLTQ is empty, SPEk checks its
memory. fn_data is the address of stored data on the SPE ULTQ, with the order from depth to shallow, i.e., from
side; ea is the address in main memory; size represents the level3 to level0 in Figure 2. For a specific queue in ULTQ,
amount of data to be fetched. the sequence is from head to tail, so the task T2-1 is fetched
x void task_execution(void* fn, void* fn_data) primarily.
Its the functions responsibility to execute task on the Work-stealing: Hybrid scheduling policy adopts work-
SPE side. fn is the pointer to the task and fn_data is the stealing to achieve balanced load, which is similar to WF
pointer to the parameters passed to fn. strategy. Hybrid policy obeys a certain rule to reduce the
x int task_data_write(void* fn_data, unsigned long ea, severe contention on simultaneous access to the task queue.
long size) To illustrate the rule, we assume as follows:

x The 8 SPEs on Cell processor are represented by and the dependence between the tasks is freed as soon as
SPE0 S PE1 S PE2, S PE7; possible.
x SPEj (j=(j+1) mod 8) is taken as the right To clearly describe the work-stealing, Figure 3 gives an
neighbor of SPEj ( j {0 1 2
7 }); example. SPEi finishes its own tasks and becomes idle. Its
x SPEi is idle now. righter neighbor, namely SPEj, is chosen as the victim. And
the task located in the head of the 3th-level queue of ULTQ,
TLTQj head ^ namely T3-0 is to be stolen.
Task synchronization: taskwait and barrier constructs
SPEj LTQj level0 T0-0 ^ in OpenMP 3.0 are usually used to achieve task
level1 ^ synchronization.
level2 T2-0 ^ (1) taskwait
level3 T3-0 T3-1 ^ When the taskwait construct is encountered, a task has to
wait until all its direct descendants are completed. To
steal a task accomplish taskwait, a counter is added to the task structure
TLTQi head ^ to record the number of its direct children. The counter is
initialized to 0, and it increases by one when a new direct
SPEi LTQi level0 T3-0 ^ child is created. Analogically, the counter is decreased by
level1 ^ one on a direct childs completion. When the counter
level2 ^ decreases to 0, the task could be awaked. Figure 4 shows
level3 ^ taskwait flow. The task is executed to finish without
interruption unless taskwait is encountered. Once the
Figure 3. SPEi steals a task from SPEjs Local Task Queue taskwait directive is encountered during task execution, the
counter is checked first. If the counter doesnt equal 0, the
The rule that Hybrid strategy obeys is as follows: When current task is suspended into the corresponding TLQ or
the PPE thread detects the idle status of SPEi, the PPE ULQ according to its tied or untied property. When all its
chooses SPEis right neighbor as the victim, namely, the direct descendants are completed, the suspended task is
thread to be stolen. awaked. When the task is finished, the current task is
According to the feature of tied task, the tasks in TLTQ destroyed until all its children have been completed.
cannot be stolen. The task which resides in the deepest
queue of ULTQ of the victim thread could be stolen, which
guarantees the tasks near the head are executed primarily

task created
placed on tail of
placed on tail of starting SPEi s ULTQ
thread s ULTQ
placed on tail of starting
thread s TLTQ removed from queue

task completes n_children no

yes exit
execution == 0?
tied? taskwait all children
encountered completed

no n_children yes task destroyed

== 0?

Figure 4. Implementation flowchart of taskwait

(2) barrier this counter by one. When the counter is decreased to 0, all
The barrier directive requires that any thread cannot be the SPE threads could resume task execution.
continued until all created tasks before this construct be Task completion: When SPEi finishes its own task, it
finished. To achieve this, every thread team keeps a counter will first look on its TLTQ. If there are ready tasks, SPEi
to record the number of unfinished tasks in this team. When polls a task and starts to execute. If there are only suspended
the barrier construct is encountered, each thread decrements tasks, SPEi cannot grab tasks until its direct tasks are

completed. If the TLTQ of SPEi is empty, SPEi checks asynchronous DMA to transfer data between the local store
ULTQ. If its not empty, SPEi fetches a task and executes it. and main memory. The system runs Fedora9 (Linux Kernel
Otherwise (i.e., ULTQ is empty), SPEi finishes its own tasks 2.6.25-14) and the programs are compiled in the Cell
and sends a finish signal to the PPE. When the PPE SDK3.1. We implement both BF and WF on the PS3
receives the signals from all the SPEs, the parallel region multicore so as to compare Hybrid policy with BF and WF.
B. Experimental benchmarks
For a preliminary evaluation of our Hybrid policy, the
In this section, the experimental platform is first performance is measured with six kernel applications,
introduced. And then several benchmarks which are based NQueens, SparseLU, Multisort, Strassen, FFT and Fib,
on task model are presented. At last, Hybrid strategy is which are derived from Barcelona OpenMP Task Suite
compared with WF and BF in terms of speedups. (BOTS)[24]. BOTS, which is developed by Barcelona
supercomputing center, covers medicine, biology and other
A. Experimental platform fields, and is still expanding in order to cover more domains
The experiments are conducted on a PS3 console[23] and scenarios. It has been adopted previously [5-6, 11-12] to test
equipped with a 3.2GHz Cell processor and 256MB of the OpenMP implementation of different tasks, such as tied
system memory. Applications are allowed to access only six or untied task, single or multiple producers. The application
of the eight SPEs since two of them are forbidden to characteristics and input parameters for these benchmarks
use .The PPE has a 32KB L1 instruction cache, a 32KB L1 are summarized in Table 1.
data cache, and a 512KB L2 cache. Each SPE depends on
TABLE 1. BOTS applications summary.

Benchmarks Input parameters Computation structure Nested tasks

Strassen matrix size of 8192*8192 At each node yes
NQueens a chessboard size of 14*14 At each node yes
Fib the 40th Fib number is computed At each node yes
FFT 128M floats At leaf yes
Multisort array size of 128MB At leaf yes
SparseLU matrix size of 5000*5000, submatrix size of 100*100 Iterative no
Alignment 16777216 (224) single-precision complex numbers Iterative no

Strassen computes the multiplication of large dense SparseLU computes an LU matrix factorization over
matrices by hierarchical decomposition of a matrix. A task is sparse matrices. Lots of imbalance exists due to the
created at each decomposition. sparseness of the matrices.
NQueens, which utilizes a backtracking search algorithm, Alignment aligns all protein sequences against every
attempts to find a placement for N queens on an N*N other sequence from an input file. The alignments are scored
chessboard so that none of the queens attacks each other. and the best one for each pair is provided as the output.
Fib computes the nth Fib number using a recursive
parallelization. In Fib, the first two Fib numbers are C. Performance comparison and analysis
initialized to 0 and 1, and each subsequent number is the We implement WF, BF and Hybrid strategies on the PS3
sum of the previous two. in order to compare their performance, and the information
FFT exploits a divide and conquer algorithm that of these strategies is listed in Table 2. WF and BF both
recursively breaks down a Discrete Fourier Transform (DFT) adopt FIFO fashion to access LTQ, which is consistent with
into many smaller ones to compute the one-dimensional Nanos runtime. In Hybrid policy, the link array is accessed
DFT. in work-first manner and for a specific link, its visited in
Multisort sorts a random permutation with a fast parallel FIFO order. BF doesnt support work-stealing while WF and
sorting algorithm by dividing all the elements into half, Hybrid both adopt work-stealing with the right neighbor as
sorting each half recursively, and then merging the sorted the victim.
halves with a parallel divide-and-conquer method.
TABLE2. The summary of WF, BF and Hybrid strategies.

Scheduling strategy Order to access the task queue Order to work-stealing

BF FIFO No work-stealing
Hybrid link array: work-first; link: FIFO work-first

In our experiments, the number of work-stealing is an slight performance upgradation but not obvious. The
important component. On one hand, a large number of work- performance of Alignment and SparseLU remain essentially
stealing is desired to keep balanced workload. On the other unchanged under the three scheduling policies. For Fib,
hand, the number of work-stealing cannot be too large, since although Hybrid policy gets similar performance with WF
it may lead to frequent task transfers among the worker and outperforms BF, its performance is not ideal under all
threads, which brings additional communication cost. So in the three scheduling strategies. The experimental results are
Hybrid policy, the number of work-stealing is set to one half analyzed in detail as follows.
of the number of worker thread, namely 3 (6/2) in our For most benchmarks (Strassen, NQueens, Fib, Multisort
experiments. and FFT), WF performs better than BF due to their nested
Figure 5 illustrates the speedups for all the benchmarks parallel structure. Tasks located in the shallow level will be
under BF, WF and Hybrid strategies, with the sequential primarily spread up when BF is adopted, which results in
execution time in BF strategy of each kernel as baseline. In traversing too many tasks when looking for a ready one.
the experiments, the tasks are untied property, and the x- Whereas if WF scheduling strategy is used, tasks located in
coordinate, represents the number of SPEs used. the bottom of the task-tree will be executed with high
Furthermore, reported results reveals the average of 10 trials. priority, and subsequently, the dependences among tasks are
On the whole, Hybrid solution performs equally well or freed. Apparently, the tasks in the shallow level could be
better than BF and WF for most of the benchmarks. executed successfully. That is the key factor that determines
Specifically, Strassen and NQueens obtain more significant WF outperforms BF.
performance improvements. Multisort and FFT achieve a


1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6

Strassen NQueens Fib Multisort FFT SparseLU Alignment

Figure 5. Speedups for all benchmarks under three scheduling strategies

In contrast to Fib, Multisort and FFT, we observe that improvements as the number of SPEs increases. The task
Strassen and NQueens obtain significant performance gains granularity in Fib is the main cause that restrains its
after the introduction of Hybrid strategy, with almost linear scalability. The task, with each corresponding to an addition
speedups. Especially for Strassen, Hybrid strategy obviously operation, is so fine that the system quickly becomes
outperforms BF and WF when 6 SPEs are used, with overloaded with too many tasks. That is why the parallel
speedups increasing by 3.3% and 4.7% compared with WF efficiency, as well as the scalability, is subject to certain
and BF, respectively. NQueens benefits potential efficiency restriction. Moreover, the PPE, which is engaged in task
from Hybrid policy when compared with WF and BF, with creation, is prone to the performance bottleneck. To remedy
performance improvements by 1.7% and 3.8%. The reason the problem mentioned above, two solutions are proposed:
that the Strassen and NQueens kernels benefit more is the x The first method is to offload/delegate partial task
following: Strassen and NQueens possess computation in creation to several SPEs, which alleviates the
each of their nodes, and in the early stage, Hybrid strategy overhead of the PPE to some extent and thus
could generate a great quantity of tasks within a short time improves application scalability. This scheme is
to keep all the worker threads busy. However, WF couldnt intended for those cases in which the substantial
spawn enough tasks to maintain the saturation state, which amount of worker threads are involved, yet the
wastes valuable computation resources and hence the problem cannot be fundamentally solved.
potential parallelism. This is the primary component x The second solution is to increase the task
contributing to the adequate achievements. granularity by cut-off mechanism based on the
For Fib, Hybrid scheme gets a comparable performance number of created tasks or the depth of the current
with WF and outperforms BF in that Fib itself is a typical task, which prunes task generation and consequently,
recursive application. But the speedups of Fib are poor prevents the runtime system form creating too many
under all the three strategies, nearly without performance fine-grained tasks.

For the benchmarks Multisort and FFT, Hybrid strategy slightly suffers from performance degradation when
delivers slightly better speedups than WF policy since compared with BF scheme.
Hybrid strategy adopts link array to search for the task,
which guarantees fast task insert and removal. Different 214 #pragma omp parallel
form the above situation that the computation structure 215 #pragma omp single nowait
resides in each node, in Multisort and FFT, computation 216 for (kk=0; kk<bots_size; kk++)
structure locates in the leaf nodes, avoiding the latency 217 {
incurred in WF. Therefore, Hybrid strategy competes closely 218 /*function lu0*/
with WF. However, there are still some factors that have Task1
219 for (jj=kk+1; jj<bots_size; jj++)
impact on the scalability of Multisort, as shown in the 221 #pragma omp task...
following: 222 { /*function fwd*/ }
x All the different solutions have problems in scaling 223
for Multisort because of frequent memory 224 for (ii=kk+1; ii<bots_size; ii++) Task2
movement that affects its execution efficiency. 226 #pragma omp task...
x In addition, DMA operation requires that the source 227 { /* function bdiv*/ }
address and destination address are both 128-byte 228
aligned when implementing Multisort on Cell 229 #pragma omp taskwait
heterogeneous multicore. If the result for each task 230 for (ii=kk+1; ii<bots_size; ii++)
232 for (jj=kk+1; jj<bots_size; jj++) Task3
is written back directly to the memory, its likely to
overwrite the original data, which incurs the errors. 234 #pragma omp task ...
To solve such problems, the following approach is 235 { /* function bmod*/ }
adopted: the intermediate result is written back to
the 128-byte aligned address, and then the effective 238 #pragma omp taskwait
data under corresponding offset is written back to 239 }
the destination, from which the performance is
dragged down. Figure 7. Task parallelism abstraction for Alignment and SparseLU
x Moreover, the task creation pattern in Multisort
seems to be the critical cause for the severe There are a large supply of coarse-grained tasks in
performance problem. Two synchronization SpaseLU and Alignment, and the challenge for the task
primitives exist in each task, as revealed in Figure 6. implementation is exploiting the available concurrency
As a result, the scalability of the Multisort kernel is while avoiding load imbalance. Work-stealing in WF and
further limited. Hybrid strategies could improve load balance, but it brings
extra cost to the runtime system. Of course, the benchmarks
allows us to verify that, with the optimizations we performed,
void sort(ele *low, ele *tmp, long size){ link array for instance, the overhead introduced in Hybrid
/* split array in A, B, C and D*/ strategy, could be kept to a minimum. Besides task
#pragma omp task untied granularity, task parallel structure is a critical issue that
sort(A, tA, quart); determines the comparable performance. Their kernels are
#pragma omp task untied representative iterative structures, as given in Figure 7. WF
sort(B, tB, quart); policy loses its advantage exhibited in nested structure.
Hybrid policy executes tasks in FIFO fashion for a specific
#pragma omp task untied
level of ULTQ, which is consistent with BF strategy.
sort(C, tC, quart);
Therefore, BF, WF and Hybrid scheduling strategies make
#pragma omp task untied no significant difference for SpaseLU and Alignment.
sort(D, tD, size 3 *quart); In Hybrid strategy, the execution information could be
#pragma omp taskwait collected by the control core PPE, which guarantees a
#pragma omp task untied /*merge (AB)*/ suitable scheduling strategy and load balance. With the
merge(A, A+quart-1, B, B+quart-1,tA); number of computation cores increasing, different
#pragma omp task untied /*merge (CD)*/ implementations of task mechanism on Cell architecture
merge(C, C+quart-1, D, low+size-1,tC); may obtain different performance improvements.
#pragma omp taskwait Nevertheless, PS3 has only 6 SPEs. Therefore, the
merge(tA,tC-1,tC,tA+size-1,A); /*merge (AB)(CD)*/
benchmarks are evaluated from 1 to 6 cores in our
Figure 6. Multisort task creation pattern
This paper, which combines the characteristics of Cell
SpaseLU and Alignment are the only two benchmarks heterogeneous multicore and the specification of task model,
which do not benefit drastical performance improvements presents the design and implementation of Hybrid
from Hybrid strategy. Especially for Alignment, it even scheduling policy on Cell multicore. The strategy generates

task in breadth-first approach and executes tasks in work- [7] Frigo M, Leiserson C E, Randall K H. The Implementation of the
first manner. Hybrid scheduling policy, which combines the Cilk-5 Multithreaded Language [A]. ACM SIGPLAN conference on
Programming language design and implementation (PLDI) [C], New
advantage of BF and WF, not only follows the OpenMP 3.0 York: ACM Press, 1998. 212-223.
tied property, but also improves the performance of irregular [8] Reinders: Intel Threading Building Blocks. Technical report,
and unstructured applications. The evaluations demonstrate OReilly Media Inc. 2007.
that Hybrid scheduling strategy outperforms existing WF [9] Daan Leijen, Wolfram Schulte and Sebastian Burckhardt. The design
and BF policies for most irregular and unstructured of a task parallel library, Proceeding of the 24th ACM SIGPLAN
applications, and achieves speedups from 1.5 to 4.6. conference on Object oriented programming systems languages and
Of course, Hybrid policy hasnt explored exhaustively all applications (OOPSLA '09). ACM New York, NY, USA.
possibilities. There may be some deficiencies, for instance, [10] OpenMP Application Program Interface, Version 3.0. OpenMP
Architecture Review Board (2008)
the radical work-stealing may happen when some SPE is
idle, which degrades the application performance. We will [11] A. Duran, J. Corbaln, and E. Ayguad. Evaluation of OpenMP task
scheduling strategies. In Proceedings of the 4th International
concentrate on building a proper model, which facilitates Workshop on OpenMP (IWOMP 08), pages 101110, May 2008.
beneficial work-stealing instead of arbitrary stealing. Also, [12] Teruel, X., Martorell, X., Duran, A., Ferrer, R., Ayguade, E.:
an observation from our experiments is that a technique Support for OpenMP tasks in Nanos v4. In: CAS Conference 2007
which could select the thread with heavy workload from all (October 2007)
the busy threads is essential for directing work-stealing, so [13] Shinpei Kato and Yutaka Ishikawa. Gang EDF Scheduling of Parallel
its one of our topics in future works. Task Systems. 30th IEEE Real-Time Systems Symposium, 2009.
Moreover, Pre-fetching technique, which overlaps the [14] Franois Broquedis, Franois Diakhat, Samuel Thibault, et al.
computations and communications, is particularly intended Scheduling Dynamic OpenMP Applications over Multicore
for the cases in which multiple tasks instead of a single one
are fetched in one time. Additionally, we will consider [15] Alexandros Tzannes, George C. Caragea, and Rajeev Barua. Lazy
Binary-Splitting: A Run-Time Adaptive Work-Stealing Scheduler.
offloading the task creation to the SPEs rather than the PPE, PPoPP10, January 914, 2010, Bangalore, India.
which efficiently avoids the sequential bottleneck and [16] Yi Guo, Rajkishore Barik, Raghavan Raman, et al. Work-First and
consequently, explores potential of future CMPs. Help-First Scheduling Policies for Async-Finish Task Parallelism.
IEEE, 2009.
ACKNOWLEDGMENT [17] Yi Guo, Jisheng Zhao, Vincent Cave, et al. SLAW: a Scalable
The research is partially supported by the Press Found of Locality-aware Adaptive Work-stealing Scheduler. IPDPS 2010.
Beijing Technology and Business University under Grant [18] HOFFMANN R, PRELL A, RAUBER T. Dynamic Task Scheduling
and Load Balancing on Cell Processors [C]// 18th Euromicro
No. QNJJ2011-37, the National Natural Science Foundation Conference on Parallel, Distributed and Network-based Processing
of China under Grant No. 61170113, and the National Basic (PDP'10). CA, USA: IEEE Computer Society, 2010: 205-212.
Research Program (973) of China under Grant No. [19] YI Jun, SHI Weiren, TANG Yunjian and XU Lei. A Dynamic Task
2012CB821200 (2012CB821206). Scheduling for Wireless Sensor and Actuator Networks. 2010 38(6):
REFERENCES [20] YI Kan, WANG Ru-chuan and Nash Equilibrium Based Task
[1] Gepner P., M.F.Kowalik. Multi-Core Processors: New Way to Scheduling Algorithm of Multi-schedulers in Grid Computing 2009
Achieve High system Performance [A]. In PARELEC'06: 37 (2): 329-333.
Proceedings of the International Symposium on Parallel Computing [21] F. Blagojevic, D. Nikolopoulos, A. Stamatakis, C. Antonopoulos:
in Electrical Engineneering [C]. Bialystok, Poland: 2006, 9-13. Dynamic Multi-grain Parallelization on the Cell Broadband Engine.
[2] J.A. Kahle et al. Introduction to the Cell Multiprocessor [J]. IBM J. In: 12th ACM SIGPLAN Symposium on Principles and Practice of
Research and Development, 2005, 49(4): 589-604. Parallel Programming (PPoPP07), pp. 90-100. ACM, New York,
[3] Tagged Precedure Calls (TPC): Efficient Runtime Support for Task-
Based Parallelism on the Cell Processor. HiPEAC 2010, LNCS 5952, [22] F. Blagojevic, C. Iancu, K. Yelick, M. Curtis-Maury et al.:
pp. 307321, 2010. Scheduling Dynamic Parallelism On Accelerators. In: 6th ACM
conference on Computing frontiers (CF09), pp.161--175. ACM,
[4] Josep M P, Rosa M B, Jess L. A dependency-aware task-based
New York (2009).
programming environment for multi-core architectures [A]. IEEE
International Conference on Cluster Computing [C], 2008. 142-151. [23] BUTTARI A, LUSZCZEK P, KURZAK J, et al. A Rough Guide to
Scientific Computing On the PlayStation 3, Technical Report UT-
[5] Cody Addison, James LaGrone, Lei Huang, and Barbara Chapman.
CS-07-595 Version 1.0 [R]. Innovative Computing Laboratory,
OpenMP 3.0 Tasking Implementation in OpenUH, 2nd Open64
University of Tennessee Knoxville, 2007.
Workshop at CGO, March 2009.
[24] Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and
[6] X. Teruel, P. Unnikrishnan, X. Martorell, et al., Openmp tasks in
Eduard Ayguad. Barcelona OpenMP Tasks Suite: A Set of
ibm XL compilers, In Proceedings of the 2008 conference of the
Benchmarks Targeting the Exploitation of Task Parallelism in
center for advanced studies on collaborative research (CASCON 08),
OpenMP. In 38th International Conference on Parallel Processing
New York, USA, ACM, pp. 207221, 2008.
(ICPP 09), page 124131, Vienna, Austria, September 2009.