You are on page 1of 13

Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 0


Parallelism or Paralysis:
The Essential High Performance Debate

There is an insatiable appetite for performance in capital markets. Not just investment performance, but
computational performance, as well and more so with each passing day. Buy side, sell side,
intermediary or vendor and no matter the nature of the underlying strategies or mix of services all of
these firms will compete increasingly on output from high-performance computing (HPC) infrastructure.
The ability to perform increasingly complex calculations on increasingly complex data flows, at higher
update frequencies, is growing in todays pantheon of competitive necessities. Navigating global markets
no longer can be conducted with the low precision of overnight batch runs; it now demands the
enhanced precision of intraday, near-real-time, and real-time calculations.

Parallelism a topic that will be new to many, but one that TABB Group believes will quickly become a
part of common technology vernacular in our business offers a significant key to HPC challenges. On
the backs of increasingly parallel storage, compute, and network architectures, specifically developed
(and re-developed) software can now deliver performance that is orders of magnitude greater than the
combination of serial software operating on serial hardware.

Fair warning: Parallel programming is hard. It will not be a challenge tackled by all trading firms and
their solution providers, and certainly not for all use cases. There is no known way to decompose and
parallelize many computational challenges of today. However, for those use cases that do apply,
exceedingly few have been structured to exploit parallelism. This leaves an incredible wealth of
performance capabilities lying untapped on nearly every computer system. This is a call to first movers
to get up to speed on the competitive advantages of parallelism.

E. Paul Rowady, Jr.
V12:030
June 2014
www.tabbgroup.com

Data
and
Analytics
Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 1

Introduction
Riding an unprecedented and global wave of regulations, increased competition, the high pace
of transformation, and increasingly complex data flows has come a growing need for high-
performance capabilities. A few steps back from the bleeding edges of speed being explored by
some trading firms exists a spectrum of computationally intensive use cases that have
spawned an ongoing search for new methods and tools to employ more number-crunching
horsepower. In many ways, these challenges sometimes known as throughput computing
applications are far more complex than the pure speed challenges. As such, this is a broad
area of development which we have been calling Latency 2.0: Bigger Workloads, Faster
that is more generally being addressed by high-performance computing (HPC) platforms.

Parallelism is a topic within the overall HPC juggernaut that is emerging in both awareness and
deployments because of its potential to respond to some of these performance demands. The
benefits of parallelism can be achieved on multiple and complementary levels ranging from
storage architectures to compute architectures to network architectures and already have
demonstrated potential for dramatic performance gains. Modern server architectures are now
increasingly parallel. Therefore, compute performance is now a function of the level of
parallelism enabled by your software (see Exhibit 1, below).

Exhibit 1
Only Parallel Software + Parallel Hardware = Parallel Performance

Source: TABB Group, Intel

Where graphical processing units (GPUs) were recently seen as the tool of choice to harvest
the benefits of highly parallel processing for general processing applications, now new central
processing unit (CPU) architectures are proving equally as powerful, yet at a lower total cost of
ownership (TCO) a more detailed comparison will follow.
Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 2

Computational Targets
Parallelism can significantly boost the performance and throughput of problems that are well
suited for it that is, large problems that can be decomposed easily into smaller problems that
are then solved in parallel. It is technologys answer to divide et impera divide and conquer.

Truth be told, however, parallelism may not be right for all firms and definitely is not right for
all use cases. On top of the hardware and code development challenges, there are applications
with no known way to parallelize them today. For instance, those cases that are related to
alpha discovery and capture are really tough to optimize because the out-of-sample behavior
is so dynamic.

From a programmers perspective, parallel programming brings a number of challenges to the
table that dont exist for sequential programming. These include a lack of basic developer tools
(IDEs, compilers, debuggers, etc.), the added complexity of writing thread-safe parallel
algorithms, and exposure to low-level programming languages such as C and C++.

Specific capital markets examples of these include:

Derivatives pricing (including swaps) and volatility estimation;
Portfolio optimizations;
Credit value adjustment (CVA) and other xVA calculations; and,
Value-at-Risk (VaR), stress tests and other risk analytics.

These examples represent a subset of a much broader spectrum of computational challenges
that generally exhibit the greatest potential for parallelism, including those that use or require:

Linear Algebra: Used for matrix or vector multiplication; often a fundamental part of
in finance, bioinformatics and fluid dynamics.

Monte Carlo Simulations: Taking a single function, executing it in parallel on
independent data sets, and using the results to glean actionable knowledge for
example, identifying probabilistic outcomes, including fat tail events, in a financial
portfolio.

Fast Fourier Transformations (FFTs): Converting time domains into frequency
domains, and vice versa; essential for signal processing applications and useful for
options pricing and other financial time series analysis, including applications in high-
frequency trading.

Image Processing: With each pixel calculated separately, there is ample opportunity
for parallelism for example, facial recognition software.

Map/reduce: Taking large datasets and distributing the data filtering, sorting and
aggregation workloads across multiple nodes in a compute cluster for example,
Googles MapReduce or Hadoop.

Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 3

To further establish a comparative sense of the challenges that parallelism is most likely to
address, a list of throughput computing kernels and their characteristics can be found in
Exhibit 2, below:


Exhibit 2
Sample List Throughput Computing Kernels and Applications



Source: TABB Group, Intel

Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 4

Shifting Hardware Considerations
General-purpose graphical processing units (GPGPUs) have earned a place in the high-
performance spotlight over the past few years because they are specifically designed for
parallel processing, and thus, throughput computing applications. As a result, trading strategy
and risk analytics developers in search of higher performance gravitated to GPUs; as they were
seen at the time as the only option to harvest the benefits of parallelism. And, for a time, this
may have been a solid assumption. The logic was pretty straight-forward, although easier said
than done: Plug a GPU with specifically (and often painstakingly) designed code into an x86-
based server and youre off to the races, so to speak. Of course, studies touting the 100x
performance advantage of GPUs over multi-core CPUs didnt hurt.

Today, this logic no longer holds up nearly as well and the claims of the early comparative
studies have largely been debunked. Recent studies of the latest CPU architectures (using
processors with and without coprocessors), some of which are highlighted below, show the
same or similar performance benefits as GPUs across a diverse array of throughput kernels
while also delivering the additional benefits of lower costs, less operational risk, and greater
returns on investment from existing infrastructure and tools.

These advantages are critically important in the current environment. Mounting global markets
headwinds are forcing firms to take a more holistic view of technology costs and deployment
strategies. This is true even in areas where the demands for higher performance are very
strong, such as in pricing, valuation and risk analytics. Unfortunately, the luxury of
experimentation even in areas with strong demands still comes with boundaries. This
means that the single-purpose-built technologies like GPUs - make their TCO less
compelling than originally envisioned.

Sure, if your firm is clicking on all cylinders with GPUs and has all the talent in place to
extract satisfactory ROI then it is likely you will want to stick with what you have. But this is
the exception rather than the rule. Most trading firms and most computational challenges (that
would apply in the first place) are still sitting on square one with un-optimized, serial and/or
single-threaded code.

Meanwhile, new developments over the past 5 years allow similar levels of performance to be
achieved on CPUs. It turns out that x86-based chip architectures may have much more to offer
the growing high-performance crowd and represent a better fit for this more-for-less era.

Consider that processing speed is no longer only about fiddling with the balance of frequency
(GHz) and cycles-per-instruction (CPI). For the past 20 years, increased performance came
mainly from improving CPI and increasing GHz. But that lever is played out; no one is going to
crank up 100 GHz processors anytime soon. Laws of physics get in the way of this.




Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 5

Wider Hardware
Now, computers arent necessarily becoming faster, they are becoming wider. The width of
new computers is a third factor contributing to new levels of performance. So not only are the
benefits of parallelism accessible from multiple cores, but also independently from vector
executions within each core using SIMD (single instructions on multiple data) technology. (This
is the difference, by the way, between thread-level parallelism, or TLP, and data-level
parallelism, DLP.).

Executing SIMD means a single instruction operating on numerous single precision (SP) or
double precision (DP) values at the same time. In this sense, processing speed is dependent
on the number of lanes, a number that has recently been expanding on the back of the
evolution of 128-bit, then 256-bit, and now up to 512-bit processors. (The figures 128, 256
and 512 refer to the number of bits in a vector register. The lanes are the number of data
items that can fit into those registers.)

Another way to think of this is that supercomputers keep getting smaller on the order of up
to 40 cores on a single 2U chassis. On a walk down a different (memory) lane, achieving that
much compute in a single machine used to require a lot more money, a lot more data center
space, and a lot more power than it does today.
Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 6

Software Going Wider
With a somewhat broader list of hardware choices in mind, we can now turn to the software
development, starting with this simple axiom: If your code isnt parallelized, it wont matter
how many cores you have, or how wide the computer is, or the specific brand of processing
architecture it uses. Applications that have not been created or modified to utilize high degrees
of parallelism via tasks, threads, vectors and so on will be severely limited in the benefit
they derive from hardware that is designed to offer high degrees of parallelism.

Code optimization projects can range from minor work to major restructuring to expose and
exploit parallelism through multiple tasks and use of vectors. This is where the more nuanced
choices related to programmer skills, toolsets, hardware design, power utilization and other
factors come into play (see Exhibit 3, below).


Exhibit 3
Comparative Analysis: CPU vs. GPU


Source: TABB Group, Intel

Parallel programming is challenging, requiring advanced expertise. You are going to pay the
programmer and software development costs whether you go the CPU or GPU route. This is
due to increased complexity from the likes of task decomposition, mapping and
synchronization challenges that dont exist in sequential programming. And because of the
mostly pioneering nature of new parallelization efforts, they can be error-prone. Currently, the
main weapon to combat this complexity is experience.

But the analysis must also reflect the costs and benefits of multi-use hardware vs. specialized
hardware. If parallelizing your code is challenging to begin with, then parallelizing your code
on specialized hardware is going to be more challenging, more expensive, and simply more
risky from an operational perspective.

Furthermore, and perhaps even more important, consider that there is a critical temporal
component in this decision process as well. It will take time no matter which course you
choose. Hardware design and the evolution of hardware improvements comes into play in
a big way here: Leveraging the benefits of GPUs will usually require a total re-write of existing
Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 7

code, leaving the benefits to production until the end of this re-writing process. Depending on
numerous factors but principally due to the nature of the problem and the programmer skills
at hand this performance enhancement process typically can be measured in weeks or even
several months.

This point is extremely important: Since the advertised 100x (or more) performance
enhancement of GPUs over base case, un-optimized code will not happen overnight, it may
take more time than originally expected to achieve fully optimized performance gains on
various use cases and on the existing hardware. By comparison, accessing the same or similar
parallel benefits on new CPUs can be achieved incrementally. Even this exercise comes at
some cost: Fully taking advantage of the parallel compute capabilities of modern x86 hardware
requires a solid understanding of low-level programming, machine architecture, and thread
concurrency.

Consider an experiment in which a performance comparison of parallelized and serialized code
is performed on the latest 6 vintages of Intel Xeon platforms, including the upcoming version
for later in 2014 (see Exhibit 4, below).


Exhibit 4
Incremental Performance Improvement of CPUs


Source: TABB Group, Intel
Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 8

This experiment shows that the average performance improvement over these six platforms
and nine use cases (or kernels) is more than 86x, with the peak improvement of the latest
platform on the Monte Carlo single precision kernel of 375x.

Now, when we place these results in the context of additional studies, including one from
2010
1
, which specifically compares the performance of GPUs and CPUs and finds average GPU
performance advantages of 2.5x, the evidence for equivalence given CPU architecture
improvements since 2010 becomes even more compelling.

This improving performance trajectory of CPUs for parallel computing is further supported by a
very recent study (May 2014) conducted by the Securities Technology Analysis Center (STAC)
2

that yielded the following highlights, among others [STAC-A2 is a technology-neutral
benchmark suite developed by bank quants to represent a non-trivial calculation typical in
computational finance (calculating Greeks for multi-asset American options using modern
methods)]:

In the end-to-end Greeks benchmark (STAC-A2.2.GREEKS.TIME), this system was:

o The fastest of any system published to date (cold, warm, and mean results);
o 34% faster than the average speed of the next fastest system, which used GPUs
(SUT ID: NVDA131118); and
o More than 9x the average speed of the previous Intel 4-socket system tested
(SUT ID: INTC130607b).

In the capacity benchmarks (STAC-A2.2.GREEKS.MAX_ASSETS and STAC-
A2.2.GREEKS.MAX_PATHS) this system handled:

o The most assets of any system;
o Over 63% more assets than the next best system, which used GPUs (SUT ID:
NVDA131118);
o The most paths of any system; and






1
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU,
Intel, June 2010.

2
STAC-A2 Benchmark tests on a stack consisting of the STAC-A2 Pack for Intel Composer XE (Rev C)
with Intel MKL 11.1, Intel Compiler XE 14, and Intel Threading Building Blocks 4.2 on an Intel White Box
using 4 x Intel Xeon E7-4890 v2 @ 2.80 GHz (Ivy Bridge EP) processors with 1TB RAM and Red Hat
Enterprise Linux 6.5 (SUT ID: INTC140509).
Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 9

o More than 58% more paths than the next best system, which used GPUs (SUT
ID: NVDA131118).

Of course, there is plenty of art to go with this science, since the aggregate community
knowledgebase that would normally support such efforts is still in a formative stage and
there is no way of knowing when the maximum performance level has been reached (outside
of the growing archive of benchmarks such as those referenced above). It may take months of
trial and error to achieve an initial 10x performance enhancement over baseline sequential
code with either architecture.

In the meantime, there will be no production code with GPUs particularly if you are just
getting started on the parallelism journey. However, using the tools and building blocks youre
already more familiar with in x86 architectures (vs. something completely unknown), you can
more easily start the optimization effort with an expectation that within a reasonable amount
of time an incremental increase in performance can be deployed in production while you
consider seeking additional incremental improvements on the research bench.
Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 10


Conclusion
Parallel programming will not always be as challenging as it is today. Its early; support from
communities, libraries and tools will grow. As such, your firms software parallelization strategy
and it will need one if it does not already have one is critical to future success. Most
trading firms and their solution partners need to go down the path of parallelizing their code
for certain use cases sooner or later, particularly if you believe (as we do) that there will be
increasing demand for higher-performance computing applications along the road ahead.

Two factors support this claim. No. 1, the most relevant applications in capital markets such
as option pricing and certain risk analytics have not been structured to exploit parallelism.
This leaves an incredible wealth of capabilities lying untapped on nearly every computer
system. No. 2, as the spectrum of high-performance applications to known use cases expands
and the community, libraries and other knowledgebase of tools accumulates as well TABB
believes that new and previously inconceivable use cases will become conceivable. You cant
see these use cases from where you are today. Being on the HPC/parallelism journey is the
only way to unleash this new level of creativity.

With that in mind, parallelism is one choice that all capital markets firms need to explore.
Except, today, there is more than one way to get there. And even if you go to GPUs later, you
should explore on x86 first to test performance improvements on existing hardware.

Consider this: Applications that show positive results with GPUs should always benefit from
CPUs (stand-alone or with coprocessors) because the same fundamentals of vectorization
another term for parallelization must be present. However, the opposite is never true. The
flexibility of CPUs includes support for applications that cannot run on GPUs. This is the main
reason that a system built including CPUs will have broader applicability than a system using
GPUs.

Ask yourself: Is my code even taking advantage of the stuff I already have? Chances are your
answer is, No. This needs to change.

Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 11

About
TABB Group
TABB Group is a financial markets research and strategic advisory firm focused exclusively on
capital markets. Founded in 2003 and based on the methodology of first-person knowledge,
TABB Group analyzes and quantifies the investing value chain, from the fiduciary and
investment manager, to the broker, exchange and custodian. Our goal is to help senior
business leaders gain a truer understanding of financial markets issues and trends so they can
grow their businesses. TABB Group members are regularly cited in the press and speak at
industry conferences. For more information about TABB Group, visit www.tabbgroup.com.

The Author
E. Paul Rowady, Jr.
Paul Rowady, Principal and Director of Data and Analytics Research joined TABB Group in
2009. He has more than 24 years of capital markets, proprietary trading and hedge fund
experience, with a background in strategy research, risk analytics and technology
development. Paul also has specific expertise in derivatives, highly automated trading
systems, and numerous data management initiatives. He is a featured speaker at capital
markets, data and technology events; regularly quoted in national, financial and industry
media; and has provided live and taped commentary for CNBC, National Public Radio, and
client media channels. With TABB, Pauls research and consulting focus ranges from market
data, risk analytics, high performance computing, social media impacts and data visualization
to OTC derivatives reform, clearing and collateral management; and includes authorship of
reports such as Faster to Smarter: The Single Global Markets Megatrend; Fixed Income
Market Data: Growth of Context and the Rate of Triangulation; Patterns in the Words: The
Evolution of Machine-Readable Data, Enhanced Risk Discovery: Exploration into the
Unknown, The Risk Analytics Library: Time for a Single Source of Truth, The New Global
Risk Transfer Market: Transformation and the Status Quo, Real-Time Market Data: Circus of
the Absurd, and Quantitative Research: The World After High-Speed Saturation. Paul earned
a Master of Management from the J. L. Kellogg Graduate School of Management at
Northwestern University and a B.S. in Business Administration from Valparaiso University. He
was also awarded a patent related to data visualization for trading applications in 2008.


Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 12







www.tabbgroup.com

New York
+ 1.646.722.7800

Westborough, MA
+ 1.508.836.2031

London
+ 44 (0) 203 207 9027

You might also like