You are on page 1of 53

Taskflow: A General-purpose Parallel

and Heterogeneous Task Programming


System using Modern C++
Dr. Tsung-Wei (TW) Huang
Department of Electrical and Computer Engineering
University of Utah, Salt Lake City, UT
https://taskflow.github.io/

1
Why Parallel Computing?
q It’s critical to advance your application performance

Time to Solve a Machine Learning Workload

1 CPU
10x faster
40 CPUs
100x faster
1 GPU

0 100 200 300 400 500 600


Machine Learning Workload

2
Parallel Programming is Not Easy, Yet
q You need to deal with many difficult technical details
q Standard concurrency control
q Task dependencies
q Scheduling
q Data race
Scheduling
Dependency
q … (more) constraints
efficiencies

Task and
p e rs data race
e v e lo
n y d m e in Debug
Ma hard t ght! i Concurrency
av e m r i control
h g the
e tt in
g
3
Taskflow offers a solution

How can we make it easier for C++ developers to


quickly write parallel and heterogeneous programs
with high performance scalability and simultaneous
high productivity?

4
“Hello World” in Taskflow
#include <taskflow/taskflow.hpp> // Taskflow is header-only
int main(){
tf::Taskflow taskflow;
tf::Executor executor;
auto [A, B, C, D] = taskflow.emplace(
Only 15 lines of code to get a
[] () { std::cout << "TaskA\n"; } parallel task execution!
[] () { std::cout << "TaskB\n"; },
[] () { std::cout << "TaskC\n"; },
[] () { std::cout << "TaskD\n"; }
);
A.precede(B, C); // A runs before B and C
D.succeed(B, C); // D runs after B and C
executor.run(taskflow).wait(); // submit the taskflow to the executor
return 0;
}
5
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism

6
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism

7
Motivation: Parallelizing VLSI CAD Tools
q Billions of tasks with diverse computational patterns
Machine learning in the loop Partial Tasks of Iteration 0
System Spec. Partition [0]collect_independent_sets_end
Partial Tasks of Iteration 1

Graph [0]construct_cost_matrices_begin [1]random_shuffle_begin

Module(a, b)
Architecture Floorplan [0]construct_cost_matrices_kernel_S [1]maximum_independent_set_parallel_kernel1_S
Input a;
Output b; Graph
Function, logic loop[0][0] loop[0][1] loop[0][2] loop[0][3] loop[0][0] loop[0][1] loop[0][2] loop[0][3]

Optimization
Placement 1

2
1’

2’
[0]construct_cost_matrices_kernel_T [1]maximum_independent_set_parallel_kernel1_T

Circuit design Analytical


3

4
3’

4’ [0]solve_assignment_kernel_S [1]maximum_independent_set_parallel_kernel2_S

5 5’
0

CTS loop[0][0] loop[0][1] loop[0][2] loop[0][3] loop[0][0] loop[0][1] loop[0][2] loop[0][3]

Physical design
Tree NP-hard problems [0]solve_assignment_kernel_T [1]maximum_independent_set_parallel_kernel2_T

Signoff Routing [0]solve_assignment_end [1]maximum_independent_set_parallel_update

2 3
DRC, LVS Graph [0]apply_solution_kernel_S [1]maximum_independent_set_parallel_cond
4 5
Dynamic
1

Manufacturing Timing loop[0][0] loop[0][1] loop[0][2] loop[0][3] controls


Modeling and simulation [1]maximum_independent_set_parallel_end

Computational problems of f1:Q u2:A u2:Y u3:A u3:Y out

Testing Irregular graphs


clock f1:CLK

[0]apply_solution_kernel_T
10B+ transistors inp1 u1:A
u1:Y
u4:B

u4:A
u4:Y
f1:D

inp2 u1:B

[0]apply_solution_end

Data science &


Final chip regression
[0]compute_hpwl_kernel [1]collect_independent_sets_begin

How can we write efficient C++ parallel programs for this monster computational task
graph with millions of CPU-GPU dependent tasks along with algorithmic control flow”
8
We Invested a lot in Existing Tools …

9
Two Big Problems of Existing Tools
q Our problems define complex task dependencies
q Example: analysis algorithms compute the circuit
network of million of node and dependencies
q Problem: existing tools are often good at loop
parallelism but weak in expressing heterogeneous task
graphs at this large scale
q Our problems define complex control flow
q Example: optimization algorithms make essential use of
dynamic control flow to implement various patterns
• Combinatorial optimization, analytical methods
q Problem: existing tools are directed acyclic graph (DAG)-
based and do not anticipate cycles or conditional
dependencies, lacking end-to-end parallelism
10
Example: An Iterative Optimizer
q 4 computational tasks with dynamic control flow
#1: starts with init task
#2: enters the optimizer task (e.g., GPU math solver)
#3: checks if the optimization converged
• No: loops back to optimizer
• Yes: proceeds to stop How can we easily describe this
workload with dynamic control flow
#4: outputs the result using existing tools (e.g., OpenMP,
TBB, StarPU, SYCL, Kokkos) ?
N

Y
init optimizer converged? output

Millions of such tasks? End-to-end parallelism?


11
Need a New C++ Parallel Programming System

While designing parallel algorithms is non-trivial …

what makes parallel programming an enormous challenge is the


infrastructure work of “how to efficiently express dependent
tasks along with an algorithmic control flow and schedule them
across heterogeneous computing resources”
12
Taskflow Project Mantra
We maximize
productivity compared We maximize the performance
to handcrafted time Performance compared to handcrafted solution

We maximize the portability using


Productivity Portability
the power of modern C++

q We are not to replace existing tools but


1. Address their limitations on task graph parallelism
2. Develop compatible interface to reuse their facilities

Together, we can deliver complementary advantages to


advance C++ parallelism
13
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism

14
Many arguments are based on my personal opinions
– no offense, no criticism, just plain C++ from an
end user’s perspective
15
“Hello World” in Taskflow (Revisited)
#include <taskflow/taskflow.hpp> // Taskflow is header-only
int main(){
Taskflow defines five tasks:
tf::Taskflow taskflow;
1. static task
tf::Executor executor; 2. dynamic task
auto [A, B, C, D] = taskflow.emplace( 3. cudaFlow task
[] () { std::cout << "TaskA\n"; } 4. condition task
[] () { std::cout << "TaskB\n"; }, 5. module task
[] () { std::cout << "TaskC\n"; },
[] () { std::cout << "TaskD\n"; }
);
A.precede(B, C); // A runs before B and C
D.succeed(B, C); // D runs after B and C
executor.run(taskflow).wait(); // submit the taskflow to the executor
return 0;
}
16
“Hello World” in OpenMP
#include <omp.h> // OpenMP is a lang ext to describe parallelism using compiler directives
int main(){
#omp parallel num_threads(std::thread::hardware_concurrency())
{
int A_B, A_C, B_D, C_D;
#pragma omp task depend(out: A_B, A_C) Task dependency clauses
{
s t d : : c o u t << ”TaskA\n” ;
}
#pragma omp task depend(in: A_B; out: B_D) Task dependency clauses
{
s t d : : c o u t << ” TaskB\n” ;
}
#pragma omp task depend(in: A_C; out: C_D) Task dependency clauses
{
s t d : : c o u t << ” TaskC\n” ;
}
#pragma omp task depend(in: B_D, C_D) Task dependency clauses
{
s t d : : c o u t << ”TaskD\n” ;
} OpenMP task clauses are static and explicit;
} Programmers are responsible for a proper order of
return 0; writing tasks consistent with sequential execution 17
}
“Hello World” in TBB
#include <tbb.h> // Intel’s TBB is a general-purpose parallel programming library in C++
int main(){
using namespace tbb;
using namespace tbb:flow;
int n = task_scheduler init::default_num_threads () ;
task scheduler_init init(n);
graph g;
Use TBB’s FlowGraph
continue_node<continue_msg> A(g, [] (const continue msg &) { for task parallelism
s t d : : c o u t << “TaskA” ;
}) ;
continue_node<continue_msg> B(g, [] (const continue msg &) {
s t d : : c o u t << “TaskB” ;
}) ;
continue_node<continue_msg> C(g, [] (const continue msg &) { Declare a task as a
s t d : : c o u t << “TaskC” ; continue_node
}) ;
continue_node<continue_msg> C(g, [] (const continue msg &) {
s t d : : c o u t << “TaskD” ;
}) ;
make_edge(A, B); TBB has excellent performance in generic parallel
make_edge(A, C);
make_edge(B, D); computing. Its drawback is mostly in the ease-of-use
make_edge(C, D); standpoint (simplicity, expressivity, and programmability).
A.try_put(continue_msg());
g.wait_for_all();
}
TBB FlowGraph: https://software.intel.com/content/www/us/en/develop/home.html 18
“Hello World” in Kokkos
struct A {
template <class TeamMember> KOKKOS_INLINE_FUNCTION Fixed-layout task functor
void operator()(TeamMember& member) {std::cout << "TaskA\n"; }
}; (no lambda interface …?)
struct B {
template <class TeamMember> KOKKOS_INLINE_FUNCTION
void operator()(TeamMember& member) {std::cout << "TaskB\n"; } Define team handle
};
struct C {
template <class TeamMember> KOKKOS_INLINE_FUNCTION
void operator()(TeamMember& member) {std::cout << "TaskC\n"; }
}; Task dependency is
struct D {
template <class TeamMember> KOKKOS_INLINE_FUNCTION
represented by instances of
void operator()(TeamMember& member) {std::cout << "TaskD\n"; } Kokkos::BasicFuture
};
auto scheduler = scheduler_type(/* ... */);
auto futA = Kokkos::host_spawn( Kokkos::TaskSingle(scheduler), A() );
auto futB = Kokkos::host_spawn( Kokkos::TaskSingle(scheduler, futA), B() ); Aggregated dependencies
auto futC = Kokkos::host_spawn( Kokkos::TaskSingle(scheduler, futA), C() );
auto futD = Kokkos::host_spawn(
Kokkos::TaskSingle(scheduler, when_all(futB, futC)), D()
);
Kokkos is powerful in describing
asynchronous tasks but not efficient in large
More scheduling code to follow …
task graph parallelism
Kokkos task parallelism: https://github.com/kokkos/kokkos/wiki/Task-Parallelism 19
“Hello World” Summary (Less Biased)
Vote for Simplicity
(100 C++ programmers of 2-5 years of C++11
experience)
1
6 4
15

74

Taskflow OpenMP TBB Kokkos std::thread

#1 concern: “My application is already very complex; it’s


important the parallel programming library doesn’t become
another burden.”

20
Dynamic Tasking (Subflow)
// create three regular tasks
tf::Task A = tf.emplace([](){}).name("A");
tf::Task C = tf.emplace([](){}).name("C");
tf::Task D = tf.emplace([](){}).name("D");

// create a subflow graph (dynamic tasking)


tf::Task B = tf.emplace([] (tf::Subflow& subflow) {
tf::Task B1 = subflow.emplace([](){}).name("B1");
tf::Task B2 = subflow.emplace([](){}).name("B2");
tf::Task B3 = subflow.emplace([](){}).name("B3");
B1.precede(B3);
B2.precede(B3);
}).name("B");

A.precede(B); // B runs after A


A.precede(C); // C runs after A
B.precede(D); // D runs after B
C.precede(D); // D runs after C

21
Subflow can be Nested
q Find the 7th Fibonacci number using subflow
q Fib(n) = Fib(n-1) + Fib(n-2)

22
Heterogeneous Tasking (cudaFlow)
const unsigned N = 1<<20;
std::vector<float> hx(N, 1.0f), hy(N, 2.0f);
float *dx{nullptr}, *dy{nullptr};
auto allocate_x = taskflow.emplace([&](){ cudaMalloc(&dx, 4*N);});
auto allocate_y = taskflow.emplace([&](){ cudaMalloc(&dy, 4*N);});

auto cudaflow = taskflow.emplace([&](tf::cudaFlow& cf) {


To Nvidia
auto h2d_x = cf.copy(dx, hx.data(), N); // CPU-GPU data transfer
cudaGraph
auto h2d_y = cf.copy(dy, hy.data(), N);
auto d2h_x = cf.copy(hx.data(), dx, N); // GPU-CPU data transfer
auto d2h_y = cf.copy(hy.data(), dy, N);
auto kernel = cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy);
kernel.succeed(h2d_x, h2d_y).precede(d2h_x, d2h_y);
});

cudaflow.succeed(allocate_x, allocate_y);
executor.run(taskflow).wait();
Users define GPU work in a graph rather than aggregated
operations à single kernel launch to reduce overheads
23
Three Key Motivations
q Our closure enables stateful interface
q Users capture data in reference to marshal data
exchange between CPU and GPU tasks
q Our closure hides implementation details judiciously
q We use cudaGraph (since cuda 10) due to its excellent
performance, much faster than streams in large graphs
q Our closure extend to new accelerator types
q syclFlow, openclFlow, coralFlow, tpuFlow, fpgaFlow, etc.
We do not simplify kernel
programming but focus on
CPU-GPU tasking that
affects the performance to a
large extent! (same for data
abstraction)
24
Conditional Tasking
auto init = taskflow.emplace([&](){ initialize_data_structure(); } )
.name(”init");
auto optimizer = taskflow.emplace([&](){ matrix_solver(); } )
.name(”optimizer");
auto converged = taskflow.emplace([&](){ return converged() ? 1 : 0 } )
.name(”converged");
auto output = taskflow.emplace([&](){ std::cout << ”done!\n"; } );
.name(”output");
init.precede(optimizer);
optimizer.precede(converged);
converged.precede(optimizer, output); // return 0 to the optimizer again

0
1
init optimizer converged? output

Condition task integrates control flow into a task graph to form end-to-end
parallelism; in this example, there are ultimately four tasks ever created
25
Conditional Tasking (cont’d)
auto A = taskflow.emplace([&](){ } );
auto B = taskflow.emplace([&](){ return rand()%2; } );
auto C = taskflow.emplace([&](){ return rand()%2; } );
auto D = taskflow.emplace([&](){ return rand()%2; } );
auto E = taskflow.emplace([&](){ return rand()%2; } );
auto F = taskflow.emplace([&](){ return rand()%2; } );
auto G = taskflow.emplace([&](){});

A.precede(B).name("init");
B.precede(C, B).name("flip-coin-1");
Each task flips a
C.precede(D, B).name("flip-coin-2"); binary coin to decide
D.precede(E, B).name("flip-coin-3"); the next path
E.precede(F, B).name("flip-coin-4");
F.precede(G, B).name("flip-coin-5");
G.name(“end”);

You can describe non-deterministic, nested control flow!


26
Existing Frameworks on Control Flow?
q Expand a task graph across fixed-length iterations
q Graph size is linearly proportional to decision points
q Unknown iterations? Non-deterministic conditions?
q Complex dynamic tasks executing “if” on the fly
q Dynamic control flows and dynamic tasks?
q … (resort to client-side decision)

Existing frameworks on expressing conditional


tasking or dynamic control flow suffer from
exponential growth of code complexity

27
Composable Tasking
tf::Taskflow f1, f2;

auto [f1A, f1B] = f1.emplace(


[]() { std::cout << "Task f1A\n"; },
[]() { std::cout << "Task f1B\n"; }
);
auto [f2A, f2B, f2C] = f2.emplace(
[]() { std::cout << "Task f2A\n"; },
[]() { std::cout << "Task f2B\n"; },
[]() { std::cout << "Task f2C\n"; }
);

auto f1_module_task = f2.composed_of(f1);

f1_module_task.succeed(f2A, f2B)
.precede(f2C);

28
Everything is Unified in Taskflow
q Use “emplace” to create a task
q Use “precede” to add a task dependency
q No need to learn different sets of API
q You can create a really complex graph Dynamic task
q Subflow(ConditionTask(cudaFlow))
cudaFlow
q ConditionTask(StaticTask(cudaFlow))
q Composition(Subflow(ConditionTask))
q Subflow(ConditionTask(cudaFlow))
Composition Control
q … flow

q Scheduler performs end-to-end optimization


q Runtime, energy efficiency, and throughput
29
Example: k-means Clustering
One cudaFlow for host-to-
device data transfers

One cudaFlow for finding the


k centroids

One condition task to


model iterations

One cudaFlow for device-


to-host data transfers
30
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism

31
Submit Taskflow to Executor
q Executor manages a set of threads to run taskflows
q All execution methods are non-blocking
q All execution methods are thread-safe
{
tf::Taskflow taskflow1, taskflow2, taskflow3;
tf::Executor executor;
// create tasks and dependencies
// …
auto future1 = executor.run(taskflow1);
auto future2 = executor.run_n(taskflow2, 1000);
auto future3 = executor.run_until(taskflow3, [i=0](){ return i++>5 });
executor.async([](){ std::cout << “async task\n"; });
executor.wait_for_all(); // wait for all the above tasks to finish
}
32
Executor Scheduling Algorithm
q Task-level scheduling
q Decides how tasks are enqueued under control flow
• Goal #1: ensures a feasible path to carry out control flow
• Goal #2: avoids task race under cyclic and conditional execution
• Goal #3: maximizes the capability of conditional tasking
q Worker-level scheduling
q Decides how tasks are executed by which workers
• Goal #1: adopts work stealing to dynamically balance load
• Goal #2: adapts workers to available task parallelism
• Goal #3: maximizes performance, energy, and throughput

33
Task-level Scheduling
q “Strong dependency” versus “Weak dependency”
q Weak dependency: dependencies out of condition tasks
q Strong dependency: others else

init Y
Queue empty? Wait for tasks
N
invoke(t)
optimizer Dequeue a task t

0 Decrement strong
N
Condition task? dependency of t ‘s
successors by one
converged? Y
r = invoke(t)
1 Enqueue successors
of zero strong
enqueue rth successor dependencies
output

34
Task-level Scheduling (cont’d)
q Condition task is powerful but prone to mistakes …

It is users’ responsibility to ensure a taskflow is properly


conditioned, i.e., no task race under our task-level scheduling policy
35
Worker-level Scheduling
q Taskflow adopts work stealing to run tasks
q What is work stealing?
q I finish my jobs first, and then steal jobs from you
q Improve performance through dynamic load balancing

Work stealing is commonly adopted


by parallel task programming
libraries (e.g., TBB, StarPU, TPL) CppCon 2015: Pablo Halpern “Work Stealing,”
https://www.youtube.com/watch?v=iLHNF7SgVN4
36
Worker-level Scheduling (cont’d)
q Challenge #1: distinct CPU-GPU performance traits
q Challenge #2: available task parallelism varies
q Challenge #3: wasteful steals eat out performance

q We solve the three challenges by the following:


1. Keep a different set of workers per heterogeneous
domain (e.g., CPU workers, GPU workers)
2. Keep an invariant that balances the active workers with
available task parallelism
3. Bring workers to sleep when tasks are scarce and wake
up workers to run tasks following the invariant

37
Worker-level Scheduling (cont’d)
CPU worker threads GPU worker threads
steal steal CPU tasks CPU task
queue

push pop push pop push push


H2D
core core core core GPU core GPU
push push D2H pop
push push
steal GPU GPU task
tasks queue

Shared CPU Shared GPU task


push graph push
task queue queue
(external threads) (external threads)

Generalizable to arbitrary heterogeneous domains 38


Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism

39
Micro-benchmarks
q Randomly generate graphs with CPU-GPU tasks
q CPU task: aX + Y (saxpy) with 1K elements
q GPU task: aX + Y (saxpy) with 1M elements
q Comparison with TBB, StarPU, HPX, and OpenMP
q What is the turnaround time to program?
q What is the overhead of task graph parallelism?
Table I: Programming cost Table II: Task graph overhead (amortized)

SLOCCount: https://dwheeler.com/sloccount/
40
Micro-benchmarks (cont’d)
q Performance on 40 Intel CPUs and 4 Nvidia GPUs

41
Application 1: Machine Learning
q Compute a 1920-layer DNN each of 65536 neurons
q IEEE HPEC 2020 Neural Network Challenge Compute

Each cudaFlow
contains thousands
of GPU tasks

A partial taskflow graph of 4 cudaFlows, 6 static tasks, and 8 conditioned cycles for this workload
42
Application 1: Machine Learning (cont’d)
q Comparison with TBB and StarPU
q Unroll task graphs across iterations found in hindsight
q Implement cudaGraph for all

q Taskflow’s runtime is up to 2x faster Due to the


conditional tasking
q Taskflow’s memory is up to 1.6x less
Champions of HPEC 2020 Graph Challenge: https://graphchallenge.mit.edu/champions
43
Application 2: VLSI Placement
q Optimize cell locations on a chip

VLSI optimization
makes essential use of
dynamic control flow

A partial TDG of 4 cudaFlows, 1 conditioned cycle, and 12 static tasks


44
Application 2: VLSI Placement (cont’d)
q Runtime, memory, power, and throughput

Performance improvement comes


from end-to-end expression of CPU-
GPU dependent tasks using
condition tasks

45
Parallel programming
infrastructure matters

Different models give different implementations. The


parallel code/algorithm may run fast, yet the parallel
computing infrastructure to support that algorithm
may dominate the entire performance.

Taskflow enables end-to-end expression of


CPU-GPU dependent tasks along with
algorithmic control flow

46
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism

47
Parallel Computing is Never Standalone

m
lis
lle
ra
Pa

48
No One Can Express All Parallelisms …
q Languages ∪ Compilers ∪ Libraries ∪ Programmers

49
IMHO, C++ Parallelism Needs Enhancement

q C++ parallelism is primitive (but in a good shape)


q std::thread is powerful but very low-level
q std::async leaves off handling task dependencies
q No easy ways to describe control flow in parallelism
• C++17 parallel STL count on bulk synchronous parallelism
q No standard ways to offload tasks to accelerators (GPU)
q Existing 3rd-party tools have enabled vast success but
q Lack an easy and expressive interface for parallelism
q Lack a mechanism for modeling control flow
q Lack an efficient executor for heterogeneous tasking
• Good at either CPU- or GPU-focused workload, but rarely both
simultaneously
50
Conclusion
q Taskflow is a general-purpose parallel tasking tool
q Simple, efficient, and transparent tasking models
q Efficient heterogeneous work-stealing executor
q Promising performance in large-scale ML and VLSI CAD
q Taskflow is not to replace anyone but to
q Complement the current state-of-the-art
q Leverage modern C++ to express task graph parallelism
q Taskflow is very open to collaboration
q We want to integrate OpenCL, SYCL, Intel DPC++, etc.
q We want to provide higher-level algorithms
q We want to broaden real use cases
51
Thank You All Using Taskflow!

52
Dr. Tsung-Wei Huang
tsung-wei.huang@utah.edu
Taskflow: https://taskflow.github.io

53

You might also like