0% found this document useful (0 votes)

50 views58 pages

Parallel Algorithms for C++17

Uploaded by

alan88w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views58 pages

Parallel Algorithms for C++17

Uploaded by

alan88w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Parallelizing the Standard

Algorithms Library
Jared Hoberock
NVIDIA
Programming Systems and Applications Research Group

CppCon
September, 2014
Bringing Parallelism to C++
Technical Specification for Parallel Algorithms
Multi-vendor collaboration
Based on proven technologies
● Thrust (NVIDIA)
● PPL (Microsoft)
● TBB (Intel)
Multiple implementations in progress
Targeting C++17
Roadmap
Parallelism?
Motivating example
What’s included in the box
The details
Future work
What do I mean by parallelism?
That’s like threads, right?
When I say “parallel”, think “independent”
● Concurrency is an optimization
● Concurrency can be hard
● locking, exclusion, communication, shared
state, data races...
What do I mean by parallelism?
Parallel programming => identifying tasks which
may be performed independently image credit: npr.org

How to communicate
this information to the
system?

It’s easy!
Simple parallelism for everyone
Easy to access
Interoperability with existing codes
Supported as broadly as possible
Concurrency is an invisible optimization
Vendor extensible
Motivating Example
Motivating Example: Weld Vertices
Marching Cubes Algorithm

case 0 case 1

case 2 case 3
Motivating Example: Weld Vertices
Problem: Marching Cubes produces “triangle soup”
Solution: “Weld” redundant vertices together
2 5 4 3
4 8

0 1 3 6 7 0 1 2
Motivating Example: Weld Vertices
Easy with the right high-level algorithms
Procedure:
1. Sort triangle vertices
2. Collapse spans of like vertices
3. Search for each vertex’s unique index
Motivating Example: Weld Vertices
using namespace std;

using vertex = tuple<float,float>;

vector<vertex> vertices = input;
vector<size_t> indices(input.size());

// sort vertices to bring duplicates together

sort(vertices.begin(), vertices.end());

// find unique vertices and erase redundancies

auto redundant_begin = unique(vertices.begin(), vertices.end());
vertices.erase(redundant_begin, vertices.end());

// find index of each vertex in the list of unique vertices

my_find_all_lower_bounds(vertices.begin(), vertices.end(),
input.begin(), input.end(),
indices.begin());
Now do it in parallel?
Easy!
using namespace std;
using namespace std::experimental::parallel;

using vertex = tuple<float,float>;

vector<vertex> vertices = input;
vector<size_t> indices(input.size());

// sort vertices to bring duplicates together

sort(par, vertices.begin(), vertices.end());

// find unique vertices and erase redundancies

auto redundant_begin = unique(par, vertices.begin(), vertices.end());
vertices.erase(redundant_begin, vertices.end());

// find index of each vertex in the list of unique vertices

my_find_all_lower_bounds(par, vertices.begin(), vertices.end(),
input.begin(), input.end(),
indices.begin());
Wait, I changed my mind...
using namespace std;
using namespace std::experimental::parallel;

using vertex = tuple<float,float>;

vector<vertex> vertices = input;
vector<size_t> indices(input.size());

// sort vertices to bring duplicates together

sort(seq, vertices.begin(), vertices.end());

// find unique vertices and erase redundancies

auto redundant_begin = unique(seq, vertices.begin(), vertices.end());
vertices.erase(redundant_begin, vertices.end());

// find index of each vertex in the list of unique vertices

my_find_all_lower_bounds(seq, vertices.begin(), vertices.end(),
input.begin(), input.end(),
indices.begin());
Don’t make me choose!
using namespace std;
using namespace std::experimental::parallel;

...

execution_policy exec = seq;

if(input.size() > some_threshold) exec = par;

// sort vertices to bring duplicates together

sort(exec, vertices.begin(), vertices.end());

// find unique vertices and erase redundancies

auto redundant_begin = unique(exec, vertices.begin(), vertices.end());
vertices.erase(redundant_begin, vertices.end());

// find index of each vertex in the list of unique vertices

my_find_all_lower_bounds(exec, vertices.begin(), vertices.end(),
input.begin(), input.end(),
indices.begin());
my_find_all_lower_bounds
template<class ExecutionPolicy,
class ForwardIterator,
class InputIterator,
class OutputIterator>
OutputIterator my_find_all_lower_bounds(ExecutionPolicy& exec,
ForwardIterator haystack_begin,
ForwardIterator haystack_end,
InputIterator needles_begin,
InputIterator needles_end,
OutputIterator result)
{
return transform(exec, needles_begin, needles_end, result,
[=](auto& needle)
{
auto iter = std::lower_bound(haystack_begin, haystack_end, needle);
return std::distance(haystack_begin, iter);
});
}
my_find_all_lower_bounds
Truly general
● generic in data types (via iterators)
● generic in execution (via execution policy)

Composing our higher-level algorithms from

lower-level primitives gives us parallelism for
free!
How to write parallel programs
High-level algorithms

Control sequential/parallel execution with

policies

Communicate dependencies
What’s included in the box
Execution Policies

Parallel Algorithms

Parallel Exceptions
Execution Policies
using namespace std::experimental::parallelism;
std::vector<int> data = ...

// vanilla sort
sort(data.begin(), data.end());

// explicitly sequential sort

sort(seq, data.begin(), data.end());

// permitting parallel execution

sort(par, data.begin(), data.end());

// permitting vectorization as well

sort(par_vec, data.begin(), data.end());
Execution Policies
// sort with dynamically-selected execution
size_t threshold = ...
execution_policy = seq;
if(vec.size() > threshold)
{
exec = par;
}

sort(exec, vec.begin(), vec.end());

Execution Policies
sort(vectorize_in_this_thread, vec.begin(), vec.end());

sort(submit_to_my_thread_pool, vec.begin(), vec.end());

sort(execute_on_that_gpu, vec.begin(), vec.end());

sort(offload_to_my_fpga, vec.begin(), vec.end());

sort(send_to_the_cloud, vec.begin(), vec.end());

sort(launder_through_botnet, vec.begin(), vec.end());

Implementation-Defined
(Non-Standard)
What is an execution policy?
Promise that a particular kind of reordering will
preserve the meaning of a program

Request that the implementation use some sort

of execution strategy

What sorts of reordering are allowable?

The Details
Sequential Execution
algo(seq, begin, end, func);
calling thread

algo invokes function

and iterator operations calling thread
func()
func()
in sequential order in time
func()
.
the calling thread .
.

calling thread
Parallelizable Execution
algo(par, begin, end, func);

algo is permitted to invoke function

unsequenced if invoked in different threads,
and indeterminate order if invoked in the same
thread
Parallelizable Execution
algo(par, begin, end, func);
calling thread

thread 1 thread 2 ... thread N

func() func() func()
local func() func() func()
time func() func() func()
. . .
. . .
. . .

calling thread
Parallelizable Execution
It is the caller’s responsibility to ensure
correctness, for example that the invocation
does not introduce data races or deadlocks.
Parallelizable Execution
// data race
int a[] = {0,1};
std::vector<int> v;
for_each(par, a, a + 2, [&](int& i)
{
v.push_back(i);
});
Parallelizable Execution
// OK (don't do this):
int a[] = {0,1};
std::vector<int> v;

std::mutex mut;
for_each(par, a, a + 2, [&](int& i)
{
mut.lock();
v.push_back(i);
mut.unlock();
});
Parallelizable Execution
// OK (do this):
int a[] = {0,1};
std::vector<int> v(2);

for_each(par, a, a + 2, [&](int& i)
{
v[i] = i;
});
Parallelizable Execution
// may deadlock (don't do this):
std::atomic<int> counter = 0;
int a[] = {0,1};

for_each(par, a, a + 2, [&](int& i)
{
counter++;

// spin wait for both lambdas to arrive

while(counter.load() != 2)
{
;
}

// try to do something crazy

...
});
Parallelizable + Vectorizable Execution
algo(par_vec, begin, end, func);

algo is permitted to invoke function

unsequenced if invoked in different threads,
and unsequenced if invoked in the same thread
Parallelizable + Vectorizable Execution
algo(par_vec, begin, end, func);
calling thread

thread 1 thread 2 ... thread N

func() func() func()
func() func()
func() func() func()
func() func()
func()
func() func() func()
func()
func() func() func()

calling thread
Difference between par & par_vec
Function invocation is unsequenced when in
different threads

When executed in the same thread

● par: unspecified, sequenced invocations
● par_vec: no sequence exists at all
Parallelizable + Vectorizable Execution

It is the caller’s responsibility to ensure

correctness, for example that the invocation of
functions do not attempt to synchronize with
each other.
Parallelizable + Vectorizable Execution
// may deadlock (don't do this):
int counter = 0;
int a[] = {0,1};

std::mutex m;
for_each(par_vec, a, a + 2, [&](int)
{
mut.lock();
++counter;
mut.unlock();
});
Parallelizable + Vectorizable Execution
// OK:
std::atomic<int> counter = 0;
int a[] = {0,1};
for_each(par_vec, a, a + 2, [&](int)
{
++counter;
});
Parallelizable + Vectorizable Execution
// Best:
int count = count_if(par_vec, ...);

Use the highest-level algorithm that does the job!

Parallel Algorithms
Provide overloads of standard algorithm which
receive execution policy as parameter

To the degree possible, parallelize everything

New Algorithms
reduce : reduction over a collection
result = init + a[0] + a[1] + … + a[N-1]

exclusive_scan : exclusive prefix sum

result[i] = init + a[0] + a[1] + … + a[i-1]

inclusive_scan : inclusive prefix sum

result[i] = init + a[0] + a[1] + … + a[i]
No Parallelism
Binary search algorithms
Heap algorithms
Permutation algorithms
Shuffling algorithms
Sequential numeric algorithms
Parallel Exceptions
Parallel algorithms throw one of two:
● If no temporary storage is available,
bad_alloc

● exception_list of exceptions thrown by

user-provided code
Exceptions Example
struct superstition_error { const char* what() { return "eek"; } };

std::vector<int> data = ...

try
{
for_each(data.begin(), data.end(), [](auto x)
{
if(x == 13)
{
throw superstition_error();
}
});
}
catch(superstition_error& error)
{
std::cerr << error.what() << std::endl;
}
Exceptions Example
struct superstition_error { const char* what() { return "eek"; } };

std::vector<int> data = ...

using namespace std::experimental::parallelism;

try
{
for_each(par, data.begin(), data.end(), [](auto x)
{
if(x == 13)
{
throw superstition_error();
}
});
}
catch(exception_list& error)
{
std::cerr << “Encountered “ << errors.size() << “ unlucky numbers” << std::endl;

reduce(par, errors.begin(), errors.end(), my_handler());

}
Exceptions Example
struct superstition_error { const char* what() { return "eek"; } };

std::vector<int> data = ...

using namespace std::experimental::parallelism;

try
{
for_each(seq, data.begin(), data.end(), [](auto x)
{
if(x == 13)
{
throw superstition_error();
}
});
}
catch(exception_list& error)
{
std::cerr << “Encountered “ << errors.size() << “ unlucky numbers” << std::endl;

reduce(seq, errors.begin(), errors.end(), my_handler());

}
Example: Estimating Pi
Example: Estimating Pi
Area of circle = pi * radius^2

Throw darts at unit square

pi ~ 4 * green / total

Algorithm: Count the number 1 unit

of darts which fall within
quarter circle
Example: Estimating Pi
// generate a random point and test whether it // burtleburtle.net/bob/hash/integer.html
// lies in the quarter circle int my_hash(int a)
bool test_quarter_circle(int seed) {
{ a = (a+0x7ed55d16) + (a<<12);
// seed an RNG a = (a^0xc761c23c) ^ (a>>19);
std::default_random_engine rng(my_hash(seed)); a = (a+0x165667b1) + (a<<5);
a = (a+0xd3a2646c) ^ (a<<9);
// generate numbers uniformly
a = (a+0xfd7046c5) + (a<<3);
// in the unit interval
a = (a^0xb55a4f09) ^ (a>>16);
std::uniform_real_distribution<float> u01(0,1);
return a;
// generate a point within the unit square }
float x = u01(rng);
float y = u01(rng);

// measure distance from the origin

float dist2 = std::sqrt(x*x + y*y);

return dist2 <= 1.0f;

}
Example: Estimating Pi
// throw 300M darts
auto n = 300 << 20;

// create the integers [0..n)

auto iter = boost::make_counting_iterator(0);

// count the number of points in the quarter circle

using namespace std::experimental::parallel;

auto num_within_quarter_circle =
count_if(par, iter, iter + n, test_quarter_circle);

double pi_estimate = (4.0 * num_within_quarter_circle) / n;

Performance Portability
single CPU thread 8 CPU threads many GPU threads
Intel i7 860 OpenMP, Intel i7 860 CUDA, NVIDIA Tesla K20

time ./pi_seq time ./pi_par time ./pi_gpu

pi is approximately pi is approximately pi is approximately
3.14 3.14 3.14

real 0m5.097s real 0m0.971s real 0m0.260s

user 0m5.098s user 0m7.565s user 0m0.047s
sys 0m0.004s sys 0m0.015s sys 0m0.188s

1x 5.25x 19.6x
Future Work
Scheduling?
algo(exec, begin, end, func);
algo needs to compose with scheduling decisions in the
surrounding application
exec specifies how algo is allowed to execute
● specifies what work an implementation is allowed to
create
● does not specify where the work should be executed

Placement is orthogonal
Scheduling?
algo(exec(sched), begin, end,
func);
We anticipate extending our execution policies to accept
scheduling requirements as parameters

sched specifies where the work should be executed

Scheduling?
sched could be:
● hard requirement to execute vectorizable work in the
current thread
● number of threads to use
● a thread pool to use
● an executor to use
● which GPU(s) to use
Implementations in progress
github.com/n3554
● based on Thrust

parallelstl.codeplex.com
● based on PPL?

github.com/t-lutz/ParallelSTL
● based on std::thread
Summary
High-level algorithms make parallelism easy
● Portable & Composable
● Concurrency is invisible

Standardization
● On track for C++17
● Experimental Tech Spec in the meantime
● github.com/cplusplus/parallelism-ts
Questions?
jhoberock@nvidia.com
github.com/jaredhoberock

Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
No ratings yet
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
37 pages
C++ Thread Management Techniques
No ratings yet
C++ Thread Management Techniques
28 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Parallel Algorithms: Sorting and Races
No ratings yet
Parallel Algorithms: Sorting and Races
51 pages
C++ STL Parallelization Guide
No ratings yet
C++ STL Parallelization Guide
67 pages
Benchmarking Performance of in ( ) &:, Platforms:, &
No ratings yet
Benchmarking Performance of in ( ) &:, Platforms:, &
22 pages
C++11 in The Wild - Techniques From A Real Codebase - Arthur O'Dwyer - CppCon 2014
No ratings yet
C++11 in The Wild - Techniques From A Real Codebase - Arthur O'Dwyer - CppCon 2014
82 pages
Module 7
No ratings yet
Module 7
28 pages
hpcxx2024 d3
No ratings yet
hpcxx2024 d3
53 pages
Advanced Parallel Computing Concepts
No ratings yet
Advanced Parallel Computing Concepts
43 pages
Parallel Algorithms: Work-Depth Model
No ratings yet
Parallel Algorithms: Work-Depth Model
44 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
C Code Task Analysis and Optimization
No ratings yet
C Code Task Analysis and Optimization
7 pages
Memory Management and Parallel Computing Techniques
No ratings yet
Memory Management and Parallel Computing Techniques
2 pages
Parallel Algorithms: BFS, DFS, Sorts
No ratings yet
Parallel Algorithms: BFS, DFS, Sorts
15 pages
Solutions 2
No ratings yet
Solutions 2
14 pages
CS33 S25 L16 Exceptions Annotated
No ratings yet
CS33 S25 L16 Exceptions Annotated
74 pages
Document 15
No ratings yet
Document 15
5 pages
Understanding GPU Reduction Trees
No ratings yet
Understanding GPU Reduction Trees
27 pages
Introduction to Parallel Algorithms
No ratings yet
Introduction to Parallel Algorithms
3 pages
Fast Parallel Code Generation in LLVM
No ratings yet
Fast Parallel Code Generation in LLVM
122 pages
Parallel Graph Search Algorithms
No ratings yet
Parallel Graph Search Algorithms
26 pages
Parallel Algorithms: Performance Insights
No ratings yet
Parallel Algorithms: Performance Insights
12 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
Parallel Algorithm Design Principles
No ratings yet
Parallel Algorithm Design Principles
31 pages
04 Progbasics
No ratings yet
04 Progbasics
51 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
PAR Laboratory Assignment Overview
No ratings yet
PAR Laboratory Assignment Overview
40 pages
Parallel Computational Algorithms Overview
No ratings yet
Parallel Computational Algorithms Overview
22 pages
A Provably Time-Efficient Parallel Implementation of Full Speculation
No ratings yet
A Provably Time-Efficient Parallel Implementation of Full Speculation
46 pages
Parallel Computation Models Explained
No ratings yet
Parallel Computation Models Explained
3 pages
Loops
No ratings yet
Loops
26 pages
Parallel Reduction Trees in Programming
No ratings yet
Parallel Reduction Trees in Programming
45 pages
CPP04-STLAlgorithms 4up PDF
No ratings yet
CPP04-STLAlgorithms 4up PDF
14 pages
STL Algorithms in C++: A Comprehensive Guide
No ratings yet
STL Algorithms in C++: A Comprehensive Guide
14 pages
Lec6 - OpenMP (Function-Level Parallelism)
No ratings yet
Lec6 - OpenMP (Function-Level Parallelism)
20 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
Boost.Compute: C++ GPU Computing Library
No ratings yet
Boost.Compute: C++ GPU Computing Library
30 pages
OpenMP Shared
No ratings yet
OpenMP Shared
28 pages
Layers of Implementing An Application in Software or Hardware Using Parallel Computers
No ratings yet
Layers of Implementing An Application in Software or Hardware Using Parallel Computers
46 pages
Parallel Algorithm Design Essentials
No ratings yet
Parallel Algorithm Design Essentials
107 pages
Mergesort and Parallel Algorithms Explained
No ratings yet
Mergesort and Parallel Algorithms Explained
4 pages
Lec03 1 Program Optimizations
No ratings yet
Lec03 1 Program Optimizations
43 pages
APT06 2024S2 New
No ratings yet
APT06 2024S2 New
21 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
HPC Codes
No ratings yet
HPC Codes
14 pages
Understanding Parallel Programming Concepts
No ratings yet
Understanding Parallel Programming Concepts
128 pages
Parallel Computing: Algorithm Design Insights
No ratings yet
Parallel Computing: Algorithm Design Insights
53 pages
Parallelizing C++ Loops with TBB Plug-in
No ratings yet
Parallelizing C++ Loops with TBB Plug-in
46 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
Parallelism and Concurrency Exercise Solutions
No ratings yet
Parallelism and Concurrency Exercise Solutions
5 pages
CSE524sp10 01
No ratings yet
CSE524sp10 01
62 pages
Parallel Programming with OpenACC Guide
No ratings yet
Parallel Programming with OpenACC Guide
44 pages
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
No ratings yet
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
104 pages
PC - Lab Manuall
No ratings yet
PC - Lab Manuall
15 pages
Parallelization Principles and Steps
No ratings yet
Parallelization Principles and Steps
40 pages
Parallel Algorithm Design Basics
No ratings yet
Parallel Algorithm Design Basics
63 pages
Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF
No ratings yet
Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF
33 pages
Applied Best Practices Jason Turner Cppcon 2018
No ratings yet
Applied Best Practices Jason Turner Cppcon 2018
108 pages
A Little Order Delving Into The STL Sorting Algorithms Fred Tingaud Cppcon 2018
No ratings yet
A Little Order Delving Into The STL Sorting Algorithms Fred Tingaud Cppcon 2018
90 pages
An Allocator Is A Handle To A Heap Arthur Odwyer Cppcon 2018
No ratings yet
An Allocator Is A Handle To A Heap Arthur Odwyer Cppcon 2018
59 pages
Reactive Stream Processing in IIoT
No ratings yet
Reactive Stream Processing in IIoT
51 pages
Simple Extensible Pattern Matching With C++14 - John Bandela - CppCon 2015
No ratings yet
Simple Extensible Pattern Matching With C++14 - John Bandela - CppCon 2015
118 pages
C++ On The Web - JF Bastien - CppCon 2015
No ratings yet
C++ On The Web - JF Bastien - CppCon 2015
24 pages
The Birth of Study Group 14 - Nicolas Guillemot, Sean Middleditch, Michael Wong - CppCon 2015
No ratings yet
The Birth of Study Group 14 - Nicolas Guillemot, Sean Middleditch, Michael Wong - CppCon 2015
44 pages
STL Algorithms in Action - Michael VanLoon - CppCon 2015
No ratings yet
STL Algorithms in Action - Michael VanLoon - CppCon 2015
99 pages
Functors and Monads in Functional Programming
100% (1)
Functors and Monads in Functional Programming
19 pages
Contracts For Dependable C++ - Gabriel Dos Reis - CppCon 2015
No ratings yet
Contracts For Dependable C++ - Gabriel Dos Reis - CppCon 2015
35 pages
RCPP - Seamless R and C++ Integration - Matt P. Dziubinski - CppCon 2015
No ratings yet
RCPP - Seamless R and C++ Integration - Matt P. Dziubinski - CppCon 2015
137 pages
C++ Web Services for Developers
No ratings yet
C++ Web Services for Developers
40 pages
C++ Functional Design by David Sankel
No ratings yet
C++ Functional Design by David Sankel
43 pages
Modern C++ User Interfaces with Qt
No ratings yet
Modern C++ User Interfaces with Qt
43 pages
C++11, 14, 17 Atomics - The Deep Dive - Michael Wong - CppCon 2015
No ratings yet
C++11, 14, 17 Atomics - The Deep Dive - Michael Wong - CppCon 2015
69 pages
From Functional To Parallel - Stochastic Modelling in C++ - Kevin Carpenter - CppCon 2015
100% (1)
From Functional To Parallel - Stochastic Modelling in C++ - Kevin Carpenter - CppCon 2015
64 pages
Algorithmic Differentiation in C++
No ratings yet
Algorithmic Differentiation in C++
283 pages
Compile-Time Tools For Generic Programming in C++ - Abel Sinkovics - CppCon 2015
No ratings yet
Compile-Time Tools For Generic Programming in C++ - Abel Sinkovics - CppCon 2015
241 pages
Easy Compilation From TouchDevelop To ARM Cortex-M0 Using C++11 - Jonathan Protzenko - CppCon 2015
No ratings yet
Easy Compilation From TouchDevelop To ARM Cortex-M0 Using C++11 - Jonathan Protzenko - CppCon 2015
20 pages
C++ Multi-Dimensional Arrays For Computational Physics and Applied Mathematics - Pramod Gupta - CppCon 2015
No ratings yet
C++ Multi-Dimensional Arrays For Computational Physics and Applied Mathematics - Pramod Gupta - CppCon 2015
43 pages
Brigand: C++11 Metaprogramming Library
No ratings yet
Brigand: C++11 Metaprogramming Library
9 pages
C++ Metaprogramming Basics and Applications
100% (1)
C++ Metaprogramming Basics and Applications
76 pages
Benchmarking C++ Code - Bryce Adelstein Lelbach - CppCon 2015
No ratings yet
Benchmarking C++ Code - Bryce Adelstein Lelbach - CppCon 2015
79 pages
C++ in Telecom: Key Role & Impact
No ratings yet
C++ in Telecom: Key Role & Impact
13 pages
Contiguous Data in C++ Explained
No ratings yet
Contiguous Data in C++ Explained
131 pages
Smart Pointers and RAII in C++
No ratings yet
Smart Pointers and RAII in C++
47 pages
Where Did My Performance Go - Fedor Pikus - CppCon 2014
No ratings yet
Where Did My Performance Go - Fedor Pikus - CppCon 2014
66 pages
Types Don't Know # - Howard Hinnant - CppCon 2014
No ratings yet
Types Don't Know # - Howard Hinnant - CppCon 2014
95 pages
The Canonical Class - Michael Caisse - CppCon 2014
No ratings yet
The Canonical Class - Michael Caisse - CppCon 2014
138 pages
The Implementation of Value Types - Lawrence Crowl - CppCon 2014
No ratings yet
The Implementation of Value Types - Lawrence Crowl - CppCon 2014
71 pages
SRM Institute of Science and Technology
No ratings yet
SRM Institute of Science and Technology
7 pages
Detection and Estimation Theory Overview
100% (2)
Detection and Estimation Theory Overview
27 pages
Interaction Diagrams
No ratings yet
Interaction Diagrams
12 pages
SP3D Electrical User's Guide
50% (4)
SP3D Electrical User's Guide
228 pages
Fluent Overset Mesh: 张理想/高级工程师 Ansys
No ratings yet
Fluent Overset Mesh: 张理想/高级工程师 Ansys
76 pages
In The High Court of Delhi at New Delhi W.P. (C) 1060/2023, CM Appls. 4174-75/2023
No ratings yet
In The High Court of Delhi at New Delhi W.P. (C) 1060/2023, CM Appls. 4174-75/2023
4 pages
Manuel Xcom GSM LAN V2.2 - EN
No ratings yet
Manuel Xcom GSM LAN V2.2 - EN
36 pages
ReleaseNote FileList of X456UAK WIN10 64 V2.00 Lite
No ratings yet
ReleaseNote FileList of X456UAK WIN10 64 V2.00 Lite
2 pages
Introduction to Embedded System Design
No ratings yet
Introduction to Embedded System Design
49 pages
Mod Menu Crash 2025 02 21-16 18 08
No ratings yet
Mod Menu Crash 2025 02 21-16 18 08
1 page
Brazo Articulado Modelo 9000
No ratings yet
Brazo Articulado Modelo 9000
46 pages
Linked Lists in Data Structures Overview
No ratings yet
Linked Lists in Data Structures Overview
22 pages
Advance Data Analytics Unit 2
No ratings yet
Advance Data Analytics Unit 2
56 pages
Noise Pollution Warning System Study
No ratings yet
Noise Pollution Warning System Study
25 pages
Rsa NW 11.5 Esa Alerting User Guide
No ratings yet
Rsa NW 11.5 Esa Alerting User Guide
216 pages
Simuaid: Charles H. Roth, JR
No ratings yet
Simuaid: Charles H. Roth, JR
60 pages
Anti Rolling Gyro Installation ARG250T
No ratings yet
Anti Rolling Gyro Installation ARG250T
95 pages
Fdocuments - in Introduction To Cobas C 311
No ratings yet
Fdocuments - in Introduction To Cobas C 311
16 pages
Experiment # 2 (Basic Logic Gates) at Proteus
No ratings yet
Experiment # 2 (Basic Logic Gates) at Proteus
5 pages
E90 CIC Info
No ratings yet
E90 CIC Info
71 pages
Access Consciousness Bars Research 2015
75% (4)
Access Consciousness Bars Research 2015
5 pages
QM-06 User Manual - Quality Planning - Inspection Plan
No ratings yet
QM-06 User Manual - Quality Planning - Inspection Plan
19 pages
WP Contentuploads201910V 7E 29.22 ENG PDF
No ratings yet
WP Contentuploads201910V 7E 29.22 ENG PDF
52 pages
CERC ADMS-Screen User Guide
No ratings yet
CERC ADMS-Screen User Guide
222 pages
Digital Skills for Global Youth
No ratings yet
Digital Skills for Global Youth
13 pages
BCA 6th Sem Network Programming
No ratings yet
BCA 6th Sem Network Programming
6 pages
AVEVA Licensing: 60 Tips in 60 Mins
No ratings yet
AVEVA Licensing: 60 Tips in 60 Mins
58 pages
XR3 Operator Manual 1 8
No ratings yet
XR3 Operator Manual 1 8
187 pages
Intermediate R - Principal Component Analysis
No ratings yet
Intermediate R - Principal Component Analysis
8 pages

Parallel Algorithms for C++17

Uploaded by

Parallel Algorithms for C++17

Uploaded by

Parallelizing the Standard

using vertex = tuple<float,float>;

// sort vertices to bring duplicates together

// find unique vertices and erase redundancies

// find index of each vertex in the list of unique vertices

using vertex = tuple<float,float>;

// sort vertices to bring duplicates together

// find unique vertices and erase redundancies

// find index of each vertex in the list of unique vertices

using vertex = tuple<float,float>;

// sort vertices to bring duplicates together

// find unique vertices and erase redundancies

// find index of each vertex in the list of unique vertices

execution_policy exec = seq;

// sort vertices to bring duplicates together

// find unique vertices and erase redundancies

// find index of each vertex in the list of unique vertices

Composing our higher-level algorithms from

Control sequential/parallel execution with

// explicitly sequential sort

// permitting parallel execution

// permitting vectorization as well

sort(exec, vec.begin(), vec.end());

sort(submit_to_my_thread_pool, vec.begin(), vec.end());

sort(execute_on_that_gpu, vec.begin(), vec.end());

sort(offload_to_my_fpga, vec.begin(), vec.end());

sort(send_to_the_cloud, vec.begin(), vec.end());

sort(launder_through_botnet, vec.begin(), vec.end());

Request that the implementation use some sort

What sorts of reordering are allowable?

algo invokes function

algo is permitted to invoke function

thread 1 thread 2 ... thread N

// spin wait for both lambdas to arrive

// try to do something crazy

algo is permitted to invoke function

thread 1 thread 2 ... thread N

When executed in the same thread

It is the caller’s responsibility to ensure

Use the highest-level algorithm that does the job!

To the degree possible, parallelize everything

exclusive_scan : exclusive prefix sum

inclusive_scan : inclusive prefix sum

● exception_list of exceptions thrown by

std::vector<int> data = ...

std::vector<int> data = ...

using namespace std::experimental::parallelism;

reduce(par, errors.begin(), errors.end(), my_handler());

std::vector<int> data = ...

using namespace std::experimental::parallelism;

reduce(seq, errors.begin(), errors.end(), my_handler());

Throw darts at unit square

Algorithm: Count the number 1 unit

// measure distance from the origin

return dist2 <= 1.0f;

// create the integers [0..n)

// count the number of points in the quarter circle

double pi_estimate = (4.0 * num_within_quarter_circle) / n;

time ./pi_seq time ./pi_par time ./pi_gpu

real 0m5.097s real 0m0.971s real 0m0.260s

sched specifies where the work should be executed

You might also like