You are on page 1of 26

COMSATS University Islamabad

Parallel Processing (EC1713)

Lecture 08

Dept. of Electrical Engineering


COMSATS University Islamabad
Fall 2018

Instructor: Dr. Omair Inam

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Outline
 Parallel Processor
 GPUs
 A brief history of GPUs
 Early limitation in GPGPU
 CUDA s step towards GPGPU
 Applications of CUDA
 Introduction to CUDA

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Parallel Processors
 Why multi-processor?
» Since 2003, the energy consumption and heat-dissipation issues have limited the
increase of the clock frequency and the level of productive activities that can be
performed in each clock period within a single CPU.
» Virtually all microprocessor vendors have switched to models where multiple
processing units, referred to as processor cores, are used in each chip to increase
the processing powers.
» This switch has exerted a tremendous impact on the software developer
community
» Traditionally, the vast majority of software applications are written as sequential
programs, as described by von Neumann [1945] in his seminal report. The
execution of these programs can be understood by a human sequentially stepping
through the code.

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Parallel Processors
 Concurrency revolution
» Rather, the applications software that will continue to enjoy performance
improvement with each new generation of microprocessors will be parallel
programs, in which multiple threads of execution cooperate to complete the
work faster. This new, dramatically escalated incentive for parallel program
development has been referred to as the concurrency revolution [Sutter 2005].
» The practice of parallel programming is by no means new. The high-
performance computing community has been developing parallel programs for
decades. These programs run on large-scale, expensive computers. Only a few
elite applications can justify the use of these expensive computers, thus limiting
the practice of parallel programming to a small number of application
developers. Now that all new microprocessors are parallel computers, the
number of applications that must be developed as parallel programs has
increased dramatically. There is now a great need for software developers to
learn about parallel programming, which is the focus of this topic.

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Parallel Processors
 GPU AS Parallel Computers
» Since 2003, the semiconductor industry has settled on two main
trajectories for designing microprocessor
1. Multi-core
2. Many-cores

 Multi-core: seeks to maintain the execution speed of sequential programs


while moving into multiple cores. The multicores began as two-core
processors, with the number of cores approximately doubling with each
semiconductor process generation. A current example is the recent Intel
Corei9 X series which has up to 16 processing cores supports hyper
threading to maximize the execution speed of sequential programs.

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Parallel Processors
 Many core: The many-core trajectory focuses more on the execution
throughput of parallel applications. For example, NVIDIA GeForce GTX
TITAN X graphics processing unit (GPU) with 3072 cores, each of which is
a heavily multithreaded
 Many-core processors, especially the GPUs, have led the race of floating-
point performance since 2003.

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Parallel Processors
 Why there is such a large performance gap between many-
core GPUs and general-purpose multicore CPUs?
» Fundamental design philosophies between the two types of processors, as
illustrated in Figure

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Parallel Processors
 Constraints (Multi-core CPUs)
» The design of a CPU is optimized for sequential code performance.
» It makes use of sophisticated control logic to allow instructions from a
single thread of execution to execute in parallel or even out of their
sequential order while maintaining the appearance of sequential
execution.
» More importantly, large cache memories are provided to reduce the
instruction and data access latencies of large complex applications.
» Memory bandwidth is another important issue. Graphics chips have been
operating at approximately 10 times the bandwidth of
contemporaneously available CPU chips.

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Parallel Processors
 GPU(what is idea behind evolution)
» Capable of moving data at about 336 gigabytes per second (GB/s) in and
out of its main dynamic random access memory (DRAM).

» Microprocessor system memory bandwidth will probably not grow


beyond 50 GB/s for about 3 years, so CPUs will continue to be at a
disadvantage in terms of memory bandwidth for some time.

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Why used the GPU for computing

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
GPUs

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
GPUs
 Architecture (GPU)

Introduction to VLSI Circuits and Systems, NCUT 2007


GPUs
 It is organized into an array of highly threaded streaming multiprocessors
(SMs).
 (Figure in previous slide) two SMs form a building block; however, the number
of SMs in a building block can vary from one generation of CUDA GPUs to
another generation.
 Also, each SM in Figure 1.3 has a number of streaming processors (SPs) that
share control logic and instruction cache.
 Each GPU currently comes with up to 8 gigabytes of graphics double data rate
(GDDR) DRAM, referred to as global memory in Figure 1.3. These GDDR
DRAMs differ from the system DRAMs on the CPU motherboard in that they
are essentially the frame buffer memory that is used for graphics.
 For graphics applications, they hold video images, and texture information for
three-dimensional (3D) rendering, but for computing they function as very-
high-bandwidth, off-chip memory, though with somewhat more latency than
typical system memory. For massively parallel applications, the higher
bandwidth makes up for the longer latency.
Introduction to VLSI Circuits and Systems, NCUT 2007
GPUs
 The massively parallel G80 chip has 128 SPs (16 SMs, each with 8 SPs).
Each SP has a multiply–add (MAD) unit and an additional multiply unit.
 With 128 SPs, that’s a total of over 500 gigaflops.
 In addition, special function units perform floating-point functions such as
square root (SQRT).
 With 240 SPs, the GT200 exceeds 1 teraflops.
 Because each SP is massively threaded, it can run thousands of threads per
application. A good application typically runs 5000–12,000 threads
simultaneously on this chip. For those who are used to simultaneous
multithreading, note that Intel CPUs support 2 or 4 threads, depending on
the machine model, per core.
 The G80 chip supports up to 768 threads per SM, which sums up to about
12,000 threads for this chip.
 GT200 supports 1024 threads per SM and up to about 30,000 threads

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
A brief history of GPUs
 In the late 1980s and early 1990s, the growth in popularity of graphically
driven operating systems such as Microsoft Windows helped create a market
for a new type of processor.
 In the early 1990s, users began purchasing 2D display accelerators for their
personal computers
 In 1992, Silicon Graphics opened the programming interface to its hardware
by releasing the OpenGL library
 Silicon Graphics intended OpenGL to be used as a standardized, platform-
independent method for writing 3D graphics applications.
 The release of NVIDIA’s GeForce 256 further pushed the capabilities of
consumer graphics hardware. For the first time, transform and lighting
computations could be performed directly on the graphics processor, thereby
enhancing the potential for even more visually interesting applications

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Early GPU Computing
 The release of GPUs that possessed programmable pipelines attracted many
researchers to the possibility of using graphics hardware for more than
simply OpenGL- or DirectX-based rendering.
 The GPUs of the early 2000s were designed to produce a color for every
pixel on the screen using programmable arithmetic units known as pixel
shaders.
 In general, a pixel shader uses its (x,y) position on the screen as well as
some additional information to combine various inputs in computing a final
color.
 The additional information could be input colors, texture coordinates, or
other attributes that would be passed to the shader when it ran. But because
the arithmetic being performed on the input colors and textures was
completely controlled by the programmer, researchers observed that these
input “colors” could actually be any data.

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Early limitations in GPGPU
 The programming model was still far too restrictive for any critical mass of
developers to form.
 There were tight resource constraints, since programs could receive input
data only from a handful of input colors and a handful of texture units.
 There were serious limitations on how and where the programmer could
write results to memory, so algorithms requiring the ability to write to
arbitrary locations in memory (scatter) could not run on a GPU
 Moreover, it was nearly impossible to predict how your particular GPU
would deal with floating-point data
 Finally, when the program inevitably computed the incorrect results, failed
to terminate, or simply hung the machine, there existed no reasonably good
method to debug any code that was being executed on the GPU.

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
CUDA a step toward to
GPGPU
 CUDA
» In November 2006, NVIDIA unveiled the GeForce 8800 GTX. The GeForce
8800 GTX was the first GPU to be built with NVIDIA’s CUDA Architecture.
This architecture included several new components designed strictly for GPU
computing and aimed to alleviate many of the limitations that prevented
previous graphics processors from being legitimately useful for general-purpose
computation.
» The CUDA Architecture included a unified shader pipeline, allowing each and
every arithmetic logic unit (ALU) on the chip to be marshaled by a program
intending to perform general-purpose computations.
» ALUs were built to comply with IEEE requirements for single-precision
floating-point arithmetic and were designed to use an instruction set tailored for
general computation rather than specifically for graphics.
» the execution units on the GPU were allowed arbitrary read and write access to
memory as well as access to a software-managed cache known as shared
memory.

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Application of CUDA
 Medical Imaging
 Computational Fluid Dynamics
 Environmental Sciences

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Getting Started

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Introduction to CUDA
 NVIDIA introduced CUDA in 2006
 Programming Model to program GPUs (C,C++)
 To a CUDA programmer, the computing system consists of a host, which is
a traditional central processing unit (CPU), such as an Intel architecture
microprocessor in personal computers today, and one or more devices,
which are massively parallel processors equipped with a large number of
arithmetic execution units.
 In modern software applications, program sections often exhibit a rich
amount of data parallelism, a property allowing many arithmetic operations
to be safely performed on program data structures in a simultaneous manner.
 The CUDA devices accelerate the execution of these applications by
harvesting a large amount of data parallelism

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Introduction to CUDA
 Host (refers to the CPU memory )
 Devise (refer to the GPU memory)
 A CUDA program consists of one or more phases that are executed on either
the host (CPU) or a device such as a GPU.
 The phases that exhibit little or no data parallelism are implemented in host
code.
 The phases that exhibit rich amount of data parallelism are implemented in
the device code
 The host code is straight ANSI C code; it is further compiled with the host’s
standard C compilers and runs as an ordinary CPU process.
 The device code is written using ANSI C extended with keywords for
labeling data-parallel functions, called kernels, and their associated data
structures.

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Introduction to CUDA

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Introduction to CUDA
 CUDA assumes that the device is a co-processor to the host, whereas both
the host and device have their own separate dedicated memories in the form
of DRAM.
 In this relationship between the host and device, the host runs the main
program and sends instructions to the device by invoking programs to run
parallel tasks on the GPUs. These programs are called kernels. GPUs can
respond to the CPU request to send or receive data from host to device or
device to host. For a given heterogeneous system with host and device, a
typical sequence of operations for a CUDA program is:
1. The host allocates storage on device using cudaMalloc()
2. The host copies the input data from the device using cudaMemcpy()
3. The host launches kernel(s) on the device to process the data (transferred
on device).
4. The host copies the results back from the device using cudaMemcpy.

Introduction to VLSI Circuits and Systems, NCUT 2007


Parallel processing FALL 2018
Introduction to CUDA
 In CUDA programming model the specific hardware architecture is
identified by compute capability as shown in Table 1. The compute
capability of a device is represented by a version number to identify the
features supported by the GPU hardware.
 The compute capability version consists of a major and minor revision
number.
 The major revision number is 5 for the devices based on Maxwell
architecture, 3 for Kepler architecture, 2 for Fermi and 1 for devices based
on Tesla architecture.
 The minor numbers correspond to incremental improvements in the core
architecture.

Introduction to VLSI Circuits and Systems, NCUT 2007


Question?

Introduction to VLSI Circuits and Systems, NCUT 2007


VLSI
Parallel Design FALL 2018
processing

You might also like