Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020

Taskflow: A General-purpose Parallel
and Heterogeneous Task Programming

System using Modern C++
Dr. Tsung-Wei (TW) Huang
Department of Electrical and Computer Engineering
University of Utah, Salt Lake City, UT
https://taskflow.github.io/
1
Why Parallel Computing?
q It’s critical to advance your application performance
Time to Solve a Machine Learning Workload
1 CPU
10x faster
40 CPUs
100x faster
1 GPU
0 100 200 300 400 500 600

Machine Learning Workload
2
Parallel Programming is Not Easy, Yet
q You need to deal with many difficult technical details
q Standard concurrency control
q Task dependencies
q Scheduling
q Data race
Scheduling
Dependency
q … (more) constraints
efficiencies
Task and
p e rs data race
e v e lo
n y d m e in Debug
Ma hard t ght! i Concurrency
av e m r i control
h g the
e tt in
g
3
Taskflow offers a solution
How can we make it easier for C++ developers to

quickly write parallel and heterogeneous programs
with high performance scalability and simultaneous
high productivity?
4
“Hello World” in Taskflow
#include <taskflow/taskflow.hpp> // Taskflow is header-only
int main(){
tf::Taskflow taskflow;
tf::Executor executor;
auto [A, B, C, D] = taskflow.emplace(
Only 15 lines of code to get a
[] () { std::cout << "TaskA\n"; } parallel task execution!
[] () { std::cout << "TaskB\n"; },
[] () { std::cout << "TaskC\n"; },
[] () { std::cout << "TaskD\n"; }
);
A.precede(B, C); // A runs before B and C
D.succeed(B, C); // D runs after B and C
executor.run(taskflow).wait(); // submit the taskflow to the executor
return 0;
}
5
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism
6
Agenda
7
Motivation: Parallelizing VLSI CAD Tools
q Billions of tasks with diverse computational patterns
Machine learning in the loop Partial Tasks of Iteration 0
System Spec. Partition [0]collect_independent_sets_end
Partial Tasks of Iteration 1
Graph [0]construct_cost_matrices_begin [1]random_shufﬂe_begin
Module(a, b)
Architecture Floorplan [0]construct_cost_matrices_kernel_S [1]maximum_independent_set_parallel_kernel1_S
Input a;
Output b; Graph
Function, logic loop[0][0] loop[0][1] loop[0][2] loop[0][3] loop[0][0] loop[0][1] loop[0][2] loop[0][3]
Optimization
Placement 1
2
1’
2’
[0]construct_cost_matrices_kernel_T [1]maximum_independent_set_parallel_kernel1_T
Circuit design Analytical

3
4
3’
4’ [0]solve_assignment_kernel_S [1]maximum_independent_set_parallel_kernel2_S
5 5’
0
CTS loop[0][0] loop[0][1] loop[0][2] loop[0][3] loop[0][0] loop[0][1] loop[0][2] loop[0][3]
Physical design
Tree NP-hard problems [0]solve_assignment_kernel_T [1]maximum_independent_set_parallel_kernel2_T
Signoff Routing [0]solve_assignment_end [1]maximum_independent_set_parallel_update
2 3
DRC, LVS Graph [0]apply_solution_kernel_S [1]maximum_independent_set_parallel_cond
4 5
Dynamic
1
Manufacturing Timing loop[0][0] loop[0][1] loop[0][2] loop[0][3] controls

Modeling and simulation [1]maximum_independent_set_parallel_end
Computational problems of f1:Q u2:A u2:Y u3:A u3:Y out
Testing Irregular graphs

clock f1:CLK
[0]apply_solution_kernel_T
10B+ transistors inp1 u1:A
u1:Y
u4:B
u4:A
u4:Y
f1:D
inp2 u1:B
[0]apply_solution_end
Data science &

Final chip regression
[0]compute_hpwl_kernel [1]collect_independent_sets_begin
How can we write efficient C++ parallel programs for this monster computational task
graph with millions of CPU-GPU dependent tasks along with algorithmic control flow”
8
We Invested a lot in Existing Tools …
9
Two Big Problems of Existing Tools
q Our problems define complex task dependencies
q Example: analysis algorithms compute the circuit
network of million of node and dependencies
q Problem: existing tools are often good at loop
parallelism but weak in expressing heterogeneous task
graphs at this large scale
q Our problems define complex control flow
q Example: optimization algorithms make essential use of
dynamic control flow to implement various patterns
• Combinatorial optimization, analytical methods
q Problem: existing tools are directed acyclic graph (DAG)-
based and do not anticipate cycles or conditional
dependencies, lacking end-to-end parallelism
10
Example: An Iterative Optimizer
q 4 computational tasks with dynamic control flow
#1: starts with init task
#2: enters the optimizer task (e.g., GPU math solver)
#3: checks if the optimization converged
• No: loops back to optimizer
• Yes: proceeds to stop How can we easily describe this
workload with dynamic control flow
#4: outputs the result using existing tools (e.g., OpenMP,
TBB, StarPU, SYCL, Kokkos) ?
N
Y
init optimizer converged? output
Millions of such tasks? End-to-end parallelism?

11
Need a New C++ Parallel Programming System
While designing parallel algorithms is non-trivial …
what makes parallel programming an enormous challenge is the

infrastructure work of “how to efficiently express dependent
tasks along with an algorithmic control flow and schedule them
across heterogeneous computing resources”
12
Taskflow Project Mantra
We maximize
productivity compared We maximize the performance
to handcrafted time Performance compared to handcrafted solution
We maximize the portability using

Productivity Portability
the power of modern C++
q We are not to replace existing tools but

1. Address their limitations on task graph parallelism
2. Develop compatible interface to reuse their facilities
Together, we can deliver complementary advantages to

advance C++ parallelism
13
Agenda
14
Many arguments are based on my personal opinions
– no offense, no criticism, just plain C++ from an
end user’s perspective
15
“Hello World” in Taskflow (Revisited)
#include <taskflow/taskflow.hpp> // Taskflow is header-only
int main(){
Taskflow defines five tasks:
tf::Taskflow taskflow;
1. static task
tf::Executor executor; 2. dynamic task
auto [A, B, C, D] = taskflow.emplace( 3. cudaFlow task
[] () { std::cout << "TaskA\n"; } 4. condition task
[] () { std::cout << "TaskB\n"; }, 5. module task
[] () { std::cout << "TaskC\n"; },
[] () { std::cout << "TaskD\n"; }
);
A.precede(B, C); // A runs before B and C
D.succeed(B, C); // D runs after B and C
executor.run(taskflow).wait(); // submit the taskflow to the executor
return 0;
}
16
“Hello World” in OpenMP
#include <omp.h> // OpenMP is a lang ext to describe parallelism using compiler directives
int main(){
#omp parallel num_threads(std::thread::hardware_concurrency())
{
int A_B, A_C, B_D, C_D;
#pragma omp task depend(out: A_B, A_C) Task dependency clauses
{
s t d : : c o u t << ”TaskA\n” ;
}
#pragma omp task depend(in: A_B; out: B_D) Task dependency clauses
{
s t d : : c o u t << ” TaskB\n” ;
}
#pragma omp task depend(in: A_C; out: C_D) Task dependency clauses
{
s t d : : c o u t << ” TaskC\n” ;
}
#pragma omp task depend(in: B_D, C_D) Task dependency clauses
{
s t d : : c o u t << ”TaskD\n” ;
} OpenMP task clauses are static and explicit;
} Programmers are responsible for a proper order of
return 0; writing tasks consistent with sequential execution 17
}
“Hello World” in TBB
#include <tbb.h> // Intel’s TBB is a general-purpose parallel programming library in C++
int main(){
using namespace tbb;
using namespace tbb:flow;
int n = task_scheduler init::default_num_threads () ;
task scheduler_init init(n);
graph g;
Use TBB’s FlowGraph
continue_node<continue_msg> A(g, [] (const continue msg &) { for task parallelism
s t d : : c o u t << “TaskA” ;
}) ;
continue_node<continue_msg> B(g, [] (const continue msg &) {
s t d : : c o u t << “TaskB” ;
}) ;
continue_node<continue_msg> C(g, [] (const continue msg &) { Declare a task as a
s t d : : c o u t << “TaskC” ; continue_node
}) ;
continue_node<continue_msg> C(g, [] (const continue msg &) {
s t d : : c o u t << “TaskD” ;
}) ;
make_edge(A, B); TBB has excellent performance in generic parallel
make_edge(A, C);
make_edge(B, D); computing. Its drawback is mostly in the ease-of-use
make_edge(C, D); standpoint (simplicity, expressivity, and programmability).
A.try_put(continue_msg());
g.wait_for_all();
}
TBB FlowGraph: https://software.intel.com/content/www/us/en/develop/home.html 18
“Hello World” in Kokkos
struct A {
template <class TeamMember> KOKKOS_INLINE_FUNCTION Fixed-layout task functor
void operator()(TeamMember& member) {std::cout << "TaskA\n"; }
}; (no lambda interface …?)
struct B {
template <class TeamMember> KOKKOS_INLINE_FUNCTION
void operator()(TeamMember& member) {std::cout << "TaskB\n"; } Define team handle
};
struct C {
void operator()(TeamMember& member) {std::cout << "TaskC\n"; }
}; Task dependency is
struct D {
represented by instances of
void operator()(TeamMember& member) {std::cout << "TaskD\n"; } Kokkos::BasicFuture
};
auto scheduler = scheduler_type(/* ... */);
auto futA = Kokkos::host_spawn( Kokkos::TaskSingle(scheduler), A() );
auto futB = Kokkos::host_spawn( Kokkos::TaskSingle(scheduler, futA), B() ); Aggregated dependencies
auto futC = Kokkos::host_spawn( Kokkos::TaskSingle(scheduler, futA), C() );
auto futD = Kokkos::host_spawn(
Kokkos::TaskSingle(scheduler, when_all(futB, futC)), D()
);
Kokkos is powerful in describing
asynchronous tasks but not efficient in large
More scheduling code to follow …
task graph parallelism
Kokkos task parallelism: https://github.com/kokkos/kokkos/wiki/Task-Parallelism 19
“Hello World” Summary (Less Biased)
Vote for Simplicity
(100 C++ programmers of 2-5 years of C++11
experience)
1
6 4
15
74
Taskflow OpenMP TBB Kokkos std::thread
#1 concern: “My application is already very complex; it’s

important the parallel programming library doesn’t become
another burden.”
20
Dynamic Tasking (Subflow)
// create three regular tasks
tf::Task A = tf.emplace([](){}).name("A");
tf::Task C = tf.emplace([](){}).name("C");
tf::Task D = tf.emplace([](){}).name("D");
// create a subflow graph (dynamic tasking)

tf::Task B = tf.emplace([] (tf::Subflow& subflow) {
tf::Task B1 = subflow.emplace([](){}).name("B1");
B1.precede(B3);
B2.precede(B3);
}).name("B");
A.precede(B); // B runs after A

A.precede(C); // C runs after A
B.precede(D); // D runs after B
C.precede(D); // D runs after C
21
Subflow can be Nested
q Find the 7th Fibonacci number using subflow
q Fib(n) = Fib(n-1) + Fib(n-2)
22
Heterogeneous Tasking (cudaFlow)
const unsigned N = 1<<20;
std::vector<float> hx(N, 1.0f), hy(N, 2.0f);
float *dx{nullptr}, *dy{nullptr};
auto allocate_x = taskflow.emplace([&](){ cudaMalloc(&dx, 4*N);});
auto allocate_y = taskflow.emplace([&](){ cudaMalloc(&dy, 4*N);});
auto cudaflow = taskflow.emplace([&](tf::cudaFlow& cf) {

To Nvidia
auto h2d_x = cf.copy(dx, hx.data(), N); // CPU-GPU data transfer
cudaGraph
auto h2d_y = cf.copy(dy, hy.data(), N);
auto d2h_x = cf.copy(hx.data(), dx, N); // GPU-CPU data transfer
auto d2h_y = cf.copy(hy.data(), dy, N);
auto kernel = cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy);
kernel.succeed(h2d_x, h2d_y).precede(d2h_x, d2h_y);
});
cudaflow.succeed(allocate_x, allocate_y);
executor.run(taskflow).wait();
Users define GPU work in a graph rather than aggregated
operations à single kernel launch to reduce overheads
23
Three Key Motivations
q Our closure enables stateful interface
q Users capture data in reference to marshal data
exchange between CPU and GPU tasks
q Our closure hides implementation details judiciously
q We use cudaGraph (since cuda 10) due to its excellent
performance, much faster than streams in large graphs
q Our closure extend to new accelerator types
q syclFlow, openclFlow, coralFlow, tpuFlow, fpgaFlow, etc.
We do not simplify kernel
programming but focus on
CPU-GPU tasking that
affects the performance to a
large extent! (same for data
abstraction)
24
Conditional Tasking
auto init = taskflow.emplace([&](){ initialize_data_structure(); } )
.name(”init");
auto optimizer = taskflow.emplace([&](){ matrix_solver(); } )
.name(”optimizer");
auto converged = taskflow.emplace([&](){ return converged() ? 1 : 0 } )
.name(”converged");
auto output = taskflow.emplace([&](){ std::cout << ”done!\n"; } );
.name(”output");
init.precede(optimizer);
optimizer.precede(converged);
converged.precede(optimizer, output); // return 0 to the optimizer again
0
1
init optimizer converged? output
Condition task integrates control flow into a task graph to form end-to-end
parallelism; in this example, there are ultimately four tasks ever created
25
Conditional Tasking (cont’d)
auto A = taskflow.emplace([&](){ } );
auto B = taskflow.emplace([&](){ return rand()%2; } );
auto C = taskflow.emplace([&](){ return rand()%2; } );
auto D = taskflow.emplace([&](){ return rand()%2; } );
auto E = taskflow.emplace([&](){ return rand()%2; } );
auto F = taskflow.emplace([&](){ return rand()%2; } );
auto G = taskflow.emplace([&](){});
A.precede(B).name("init");
B.precede(C, B).name("flip-coin-1");
Each task flips a
C.precede(D, B).name("flip-coin-2"); binary coin to decide
D.precede(E, B).name("flip-coin-3"); the next path
E.precede(F, B).name("flip-coin-4");
F.precede(G, B).name("flip-coin-5");
G.name(“end”);
You can describe non-deterministic, nested control flow!

26
Existing Frameworks on Control Flow?
q Expand a task graph across fixed-length iterations
q Graph size is linearly proportional to decision points
q Unknown iterations? Non-deterministic conditions?
q Complex dynamic tasks executing “if” on the fly
q Dynamic control flows and dynamic tasks?
q … (resort to client-side decision)
Existing frameworks on expressing conditional

tasking or dynamic control flow suffer from
exponential growth of code complexity
27
Composable Tasking
tf::Taskflow f1, f2;
auto [f1A, f1B] = f1.emplace(

[]() { std::cout << "Task f1A\n"; },
[]() { std::cout << "Task f1B\n"; }
);
auto [f2A, f2B, f2C] = f2.emplace(
[]() { std::cout << "Task f2A\n"; },
[]() { std::cout << "Task f2B\n"; },
[]() { std::cout << "Task f2C\n"; }
);
auto f1_module_task = f2.composed_of(f1);
f1_module_task.succeed(f2A, f2B)
.precede(f2C);
28
Everything is Unified in Taskflow
q Use “emplace” to create a task
q Use “precede” to add a task dependency
q No need to learn different sets of API
q You can create a really complex graph Dynamic task
q Subflow(ConditionTask(cudaFlow))
cudaFlow
q ConditionTask(StaticTask(cudaFlow))
q Composition(Subflow(ConditionTask))
q Subflow(ConditionTask(cudaFlow))
Composition Control
q … flow
q Scheduler performs end-to-end optimization

q Runtime, energy efficiency, and throughput
29
Example: k-means Clustering
One cudaFlow for host-to-
device data transfers
One cudaFlow for finding the

k centroids
One condition task to

model iterations
One cudaFlow for device-

to-host data transfers
30
Agenda
31
Submit Taskflow to Executor
q Executor manages a set of threads to run taskflows
q All execution methods are non-blocking
q All execution methods are thread-safe
{
tf::Taskflow taskflow1, taskflow2, taskflow3;
tf::Executor executor;
// create tasks and dependencies
// …
auto future1 = executor.run(taskflow1);
auto future2 = executor.run_n(taskflow2, 1000);
auto future3 = executor.run_until(taskflow3, [i=0](){ return i++>5 });
executor.async([](){ std::cout << “async task\n"; });
executor.wait_for_all(); // wait for all the above tasks to finish
}
32
Executor Scheduling Algorithm
q Task-level scheduling
q Decides how tasks are enqueued under control flow
• Goal #1: ensures a feasible path to carry out control flow
• Goal #2: avoids task race under cyclic and conditional execution
• Goal #3: maximizes the capability of conditional tasking
q Worker-level scheduling
q Decides how tasks are executed by which workers
• Goal #1: adopts work stealing to dynamically balance load
• Goal #2: adapts workers to available task parallelism
• Goal #3: maximizes performance, energy, and throughput
33
Task-level Scheduling
q “Strong dependency” versus “Weak dependency”
q Weak dependency: dependencies out of condition tasks
q Strong dependency: others else
init Y
Queue empty? Wait for tasks
N
invoke(t)
optimizer Dequeue a task t
0 Decrement strong
N
Condition task? dependency of t ‘s
successors by one
converged? Y
r = invoke(t)
1 Enqueue successors
of zero strong
enqueue rth successor dependencies
output
34
Task-level Scheduling (cont’d)
q Condition task is powerful but prone to mistakes …
It is users’ responsibility to ensure a taskflow is properly

conditioned, i.e., no task race under our task-level scheduling policy
35
Worker-level Scheduling
q Taskflow adopts work stealing to run tasks
q What is work stealing?
q I finish my jobs first, and then steal jobs from you
q Improve performance through dynamic load balancing
Work stealing is commonly adopted

by parallel task programming
libraries (e.g., TBB, StarPU, TPL) CppCon 2015: Pablo Halpern “Work Stealing,”
https://www.youtube.com/watch?v=iLHNF7SgVN4
36
Worker-level Scheduling (cont’d)
q Challenge #1: distinct CPU-GPU performance traits
q Challenge #2: available task parallelism varies
q Challenge #3: wasteful steals eat out performance
q We solve the three challenges by the following:

1. Keep a different set of workers per heterogeneous
domain (e.g., CPU workers, GPU workers)
2. Keep an invariant that balances the active workers with
available task parallelism
3. Bring workers to sleep when tasks are scarce and wake
up workers to run tasks following the invariant
37
Worker-level Scheduling (cont’d)
CPU worker threads GPU worker threads
steal steal CPU tasks CPU task
queue
push pop push pop push push

H2D
core core core core GPU core GPU
push push D2H pop
push push
steal GPU GPU task
tasks queue
Shared CPU Shared GPU task

push graph push
task queue queue
(external threads) (external threads)
Generalizable to arbitrary heterogeneous domains 38

Agenda
39
Micro-benchmarks
q Randomly generate graphs with CPU-GPU tasks
q CPU task: aX + Y (saxpy) with 1K elements
q GPU task: aX + Y (saxpy) with 1M elements
q Comparison with TBB, StarPU, HPX, and OpenMP
q What is the turnaround time to program?
q What is the overhead of task graph parallelism?
Table I: Programming cost Table II: Task graph overhead (amortized)
SLOCCount: https://dwheeler.com/sloccount/
40
Micro-benchmarks (cont’d)
q Performance on 40 Intel CPUs and 4 Nvidia GPUs
41
Application 1: Machine Learning
q Compute a 1920-layer DNN each of 65536 neurons
q IEEE HPEC 2020 Neural Network Challenge Compute
Each cudaFlow
contains thousands
of GPU tasks
A partial taskflow graph of 4 cudaFlows, 6 static tasks, and 8 conditioned cycles for this workload
42
Application 1: Machine Learning (cont’d)
q Comparison with TBB and StarPU
q Unroll task graphs across iterations found in hindsight
q Implement cudaGraph for all
q Taskflow’s runtime is up to 2x faster Due to the

conditional tasking
q Taskflow’s memory is up to 1.6x less
Champions of HPEC 2020 Graph Challenge: https://graphchallenge.mit.edu/champions
43
Application 2: VLSI Placement
q Optimize cell locations on a chip
VLSI optimization
makes essential use of
dynamic control flow
A partial TDG of 4 cudaFlows, 1 conditioned cycle, and 12 static tasks

44
Application 2: VLSI Placement (cont’d)
q Runtime, memory, power, and throughput
Performance improvement comes

from end-to-end expression of CPU-
GPU dependent tasks using
condition tasks
45
Parallel programming
infrastructure matters
Different models give different implementations. The

parallel code/algorithm may run fast, yet the parallel
computing infrastructure to support that algorithm
may dominate the entire performance.
Taskflow enables end-to-end expression of

CPU-GPU dependent tasks along with
algorithmic control flow
46
Agenda
47
Parallel Computing is Never Standalone
m
lis
lle
ra
Pa
48
No One Can Express All Parallelisms …
q Languages ∪ Compilers ∪ Libraries ∪ Programmers
49
IMHO, C++ Parallelism Needs Enhancement
q C++ parallelism is primitive (but in a good shape)

q std::thread is powerful but very low-level
q std::async leaves off handling task dependencies
q No easy ways to describe control flow in parallelism
• C++17 parallel STL count on bulk synchronous parallelism
q No standard ways to offload tasks to accelerators (GPU)
q Existing 3rd-party tools have enabled vast success but
q Lack an easy and expressive interface for parallelism
q Lack a mechanism for modeling control flow
q Lack an efficient executor for heterogeneous tasking
• Good at either CPU- or GPU-focused workload, but rarely both
simultaneously
50
Conclusion
q Taskflow is a general-purpose parallel tasking tool
q Simple, efficient, and transparent tasking models
q Efficient heterogeneous work-stealing executor
q Promising performance in large-scale ML and VLSI CAD
q Taskflow is not to replace anyone but to
q Complement the current state-of-the-art
q Leverage modern C++ to express task graph parallelism
q Taskflow is very open to collaboration
q We want to integrate OpenCL, SYCL, Intel DPC++, etc.
q We want to provide higher-level algorithms
q We want to broaden real use cases
51
Thank You All Using Taskflow!
52
Dr. Tsung-Wei Huang
tsung-wei.huang@utah.edu
Taskflow: https://taskflow.github.io
53

Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020

Uploaded by

Copyright:

Available Formats

Taskflow: A General-purpose Parallel

and Heterogeneous Task Programming

Time to Solve a Machine Learning Workload

0 100 200 300 400 500 600

How can we make it easier for C++ developers to

Graph [0]construct_cost_matrices_begin [1]random_shufﬂe_begin

Circuit design Analytical

CTS loop[0][0] loop[0][1] loop[0][2] loop[0][3] loop[0][0] loop[0][1] loop[0][2] loop[0][3]

Signoff Routing [0]solve_assignment_end [1]maximum_independent_set_parallel_update

Manufacturing Timing loop[0][0] loop[0][1] loop[0][2] loop[0][3] controls

Computational problems of f1:Q u2:A u2:Y u3:A u3:Y out

Testing Irregular graphs

Data science &

Millions of such tasks? End-to-end parallelism?

While designing parallel algorithms is non-trivial …

what makes parallel programming an enormous challenge is the

We maximize the portability using

q We are not to replace existing tools but

Together, we can deliver complementary advantages to

Taskflow OpenMP TBB Kokkos std::thread

#1 concern: “My application is already very complex; it’s

// create a subflow graph (dynamic tasking)

A.precede(B); // B runs after A

auto cudaflow = taskflow.emplace([&](tf::cudaFlow& cf) {

You can describe non-deterministic, nested control flow!

Existing frameworks on expressing conditional

auto [f1A, f1B] = f1.emplace(

auto f1_module_task = f2.composed_of(f1);

q Scheduler performs end-to-end optimization

One cudaFlow for finding the

One condition task to

One cudaFlow for device-

It is users’ responsibility to ensure a taskflow is properly

Work stealing is commonly adopted

q We solve the three challenges by the following:

push pop push pop push push

Shared CPU Shared GPU task

Generalizable to arbitrary heterogeneous domains 38

q Taskflow’s runtime is up to 2x faster Due to the

A partial TDG of 4 cudaFlows, 1 conditioned cycle, and 12 static tasks

Performance improvement comes

Different models give different implementations. The

Taskflow enables end-to-end expression of

q C++ parallelism is primitive (but in a good shape)

You might also like