You are on page 1of 2

Data-Level Parallelism in Vector, SIMD, and GPU Architectures

1 We call these algorithms data parallel algorithms because their parallelism comes from simultaneous
operations across large sets of data, rather than from multiple threads of control.

Introduction:

A question for the single instruction, multiple data (SIMD) architecture, which Chapter 1 introduced, has
always been just how wide a set of applications has significant data-level parallelism (DLP). Fifty years
later, the answer is not only the matrix-oriented computations of scientific computing, but also the
mediaoriented image and sound processing. Moreover, since a single instruction can launch many data
operations, SIMD is potentially more energy efficient than multiple instruction multiple data (MIMD),
which needs to fetch and execute one instruction per data operation. These two answers make SIMD
attractive for Personal Mobile Devices. Finally, perhaps the biggest advantage of SIMD versus MIMD is
that the programmer continues to think sequentially yet achieves parallel speedup by having parallel
data operations. This chapter covers three variations of SIMD: vector architectures, multimedia SIMD
instruction set extensions, and graphics processing units (GPUs).1 The first variation, which predates the
other two by more than 30 years, means essentially pipelined execution of many data operations. These
vector architectures are easier to understand and to compile to than other SIMD variations, but they
were considered too expensive for microprocessors until recently. Part of that expense was in
transistors and part was in the cost of sufficient DRAM bandwidth, given the widespread reliance on
caches to meet memory performance demands on conventional microprocessors. The second SIMD
variation borrows the SIMD name to mean basically simultaneous parallel data operations and is found
in most instruction set architectures today that support multimedia applications. For x86 architectures,
the SIMD instruction extensions started with the MMX (Multimedia Extensions) in 1996, which were
followed by several SSE (Streaming SIMD Extensions) versions in the next decade, and they continue to
this day with AVX (Advanced Vector Extensions). To get the highest computation rate from an x86
computer, you often need to use these SIMD instructions, especially for floating-point programs. The
third variation on SIMD comes from the GPU community, offering higher potential performance than is
found in traditional multicore computers today. While GPUs share features with vector architectures,
they have their own distinguishing characteristics, in part due to the ecosystem in which they evolved.
This environment has a system processor and system memory in addition to the GPU and its graphics
memory. In fact, to recognize those distinctions, the GPU community refers to this type of architecture
as heterogeneous.

Thread-Level Parallelism
1 The turning away from the conventional organization came in the middle 1960s, when the law of
diminishing returns began to take effect in the effort to increase the operational speed of a computer. . .
. Electronic circuits are ultimately limited in their speed of operation by the speed of light . . . and many
of the circuits were already operating in the nanosecond range.

Introduction:

As the quotations that open this chapter show, the view that advances in uniprocessor architecture
were nearing an end has been held by some researchers for many years. Clearly, these views were
premature; in fact, during the period of 1986–2003, uniprocessor performance growth, driven by the
microprocessor, was at its highest rate since the first transistorized computers in the late 1950s and
early 1960s. Nonetheless, the importance of multiprocessors was growing throughout the 1990s as
designers sought a way to build servers and supercomputers that achieved higher performance than a
single microprocessor, while exploiting the tremendous cost-performance advantages of commodity
microprocessors. As we discussed in Chapters 1 and 3, the slowdown in uniprocessor performance
arising from diminishing returns in exploiting instruction-level parallelism (ILP) combined with growing
concern over power, is leading to a new era in computer architecture—an era where multiprocessors
play a major role from the low end to the high end. The second quotation captures this clear inflection
point. This increased importance of multiprocessing reflects several major factors:

The dramatically lower efficiencies in silicon and energy use that were encountered between 2000 and
2005 as designers attempted to find and exploit more ILP, which turned out to be inefficient, since
power and silicon costs grew faster than performance. Other than ILP, the only scalable and general-
purpose way we know how to increase performance faster than the basic technology allows (from a
switching perspective) is through multiprocessing.

■ A growing interest in high-end servers as cloud computing and software-asa-service become more
important.

■ A growth in data-intensive applications driven by the availability of massive amounts of data on the
Internet.

■ The insight that increasing performance on the desktop is less important (outside of graphics, at least),
either because current performance is acceptable or because highly compute- and data-intensive
applications are being done in the cloud.

■ An improved understanding of how to use multiprocessors effectively, especially in server


environments where there is significant natural parallelism, arising from large datasets, natural
parallelism (which occurs in scientific codes), or parallelism among large numbers of independent
requests (requestlevel parallelism).

■ The advantages of leveraging a design investment by replication rather than unique design; all
multiprocessor designs provide such leverage.

You might also like