GPU Khoruzhenko

KU LEUVEN
FACULTY OF ENGINEERING TECHNOLOGY
CAMPUS GEEL
GPU
Khoruzhenko Olha
RADMEP
21.05.2023
GPU
A GPU, as the very name suggests, is a specialized chip for graphics, that deals with all
the processing of objects appearing on the screen. Since it has to operate with a large amount of
data (each operation should be applied to every pixel of the image) in parallel, GPUs have much
more cores, than CPUs (Fig. 1 (a)), because the goal of the latter is finishing the task quickly, at
the same time keeping the ability to switch between different operations, whilst the former should
allow pushing the maximum number of tasks at once. So, a CPU is latency optimized and a GPU
is throughput optimized [1].
Fig. 1 a) Comparison of a CPU and a GPU architecture [2]; b) GPU architecture [1]
Let’s dive into GPU’s architecture. It contains multiple processor clusters, that consist of
streaming multiprocessors, each of which has a cache layer with associated cores (Fig. 1 (b)) [1].
In comparison to CPUs, GPUs have smaller and fewer memory cache layers, because they have
more transistors specifically for computation, so the time, during which data is retrieved from the
memory, is less important. This latency is masked, if GPU is kept busy by having enough
computations to deal with. To give an approximate level of parallelism possible, we can refer to
the number of cores, for example, Tesla V100 by NVIDIA consists of 80 streaming
multiprocessors, each one containing 64 cores, what makes 5120 cores in total.
To understand the ideal match for a GPU in terms of computer architecture categorization,
let’s summarize the goal: for graphics, one should simply run the same mathematical function
again and again on all the pixels of an image. So a single instruction is applied to multiple data
sets, which makes it a SIMD [3]. This programming model allows serious acceleration of a large
number of applications. For example, if we want to scale an image, then each core takes care of
each pixel by simply scaling it absolutely parallelly. In contrast to a sequential machine case,
where it’d take n clock cycles to perform the task on n pixels, it only takes 1 clock cycle for SIMD
(if we assume that there are enough cores to cover the whole computational load). But in fact, the
task should not even be that parallel for the GPU, it only has to match the SIMD scheme of
computation, so it can be decomposed via repeating the same operation on different data at each
moment of time.
NVIDIA’s Compute Unified Device Architecture (CUDA) parallel computing platform

provides an application programming interface that allows the optimization of the used resources
of GPU, and it doesn’t require special graphics programming knowledge [3]. The basic
components of the CUDA hierarchy are threads, thread blocks and kernel grids. A thread (actual
CUDA core) is a parallel processor, the main task of which is computing of floating-point math
operations. So all the data, that is processed by a GPU, goes through the thread, which has its
own memory register unavailable to other CUDA cores. A thread block is a set of threads, that
are executed together, logical grouping inside the block allows efficient mapping of data, a typical
number of threads per block is 1024. Each core in the same block has access to the shared
memory. A kernel grid is the next level of abstraction, it’s a grouping of blocks on the same kernel,
but since different thread blocks don’t have the same shared memory, the synchronization is
different, than at the block level. Thus, CUDA abstractions enable fine-grained data and tread
parallelism enclosed within coarse-grained. So the developer can divide the problem into coarse
sub-problems to be solved independently and those into fine ones to be solved cooperatively in
parallel [4].
Let’s take a look at memory hierarchy, which logically exists due to the architectural
hierarchy in CUDA (Fig. 1(b)) [1]. The lowest level is the registers, which are allocated to individual
cores, since it’s an individual on-chip memory, register data can be processed quicker, than any
other. Read0only memory is the on-chip memory of streaming multiprocessors, which is used for
particular tasks, like texture memory, that can be accessed via texture functions of CUDA [3].
Within the thread blocks the on-chip memory is layer 1 cache and shared memory, where the
latter is controlled by software and the former by hardware. The next level is layer 2 cache, which
can be accessed by all the threads in all thread blocks and stores global and local memory at the
same time. And at the top, there’s a global memory, which resides in the DRAM of a device, it is
comparable to a CPU’s RAM. Retrieving data from each next level of memory in the hierarchy,
naturally, gets slower.
The memory bandwidth of a GPU is the potential maximum amount of data, that can be
handled by the bus, so it characterizes the speed of retrieving and usage of a GPU framebuffer.
Modern GPUs are capable of an order of 100 Gb of transfer per second [5]. This can be a
system’s bottleneck, if it’s too slow, because all the numerous cores of a GPU will be doing
nothing, while awaiting the memory response. For example, if the GPU can process data blocks
repeatedly n-times, then the external peripheral component interconnect should be 1/n of the
internal bandwidth of the GPU [6].
Scatter/gather engines (a.k.a. memory management unit) play a crucial role in GPUs for
efficiwnt memory access and data movement [7]. For instance, they translate the virtual memory
addresses to physical ones, thus ensuring correct memory access; they combine neighboring
threads’ memry accesses into single transactions, thus reducing latency and maximizing
throughput; they ease efficient data movement of data to and from scattered memory locations,
because GPUs quite frequently process data in irregular memory regions; and additionally, they
enable caching mechanisms, storing often accesses memory regions, and prefetching
techniques, anticipating future accesses and fetching data in advance.
GPUs were specifically designed to complement CPUs, so their collaboration increases

data throughput and simultaneous computation within an application [8]. In short, a COU runs the
main program, and a GPU enables repetitive computations, such that the CPU coordinates the
integral computing range, whilst the GPU performs more specialized finer tasks. What comes to
exclusively GPU programming, then the question comes down to a specific problem, that is to be
solved. CPU and GPU have their own areas of excellence and limitations. If one works with a
large amount of data, that can be processed in parallel, then GPU computing will save a lot of
time and resources. To utilize graphical primitives to solve computational problems several
programming frameworks were developed (such as CUDA, OpenCL and OpenACC), so the
programmer can focus only on high-level compting concepts. OpenCL and OpenACC both
support C/C++ language, they aim to simplify parallel programming of HW platforms and
architectures of heterogeneous GPUs and CPUs with much less programming effort, than it would
take for a low-level model [9]. The limitations of GPU include the problems, the size of which is
too small or the behavior is too unpredictable, because the former lack parallelism to effectively
use GPU and the latter have too many meaningful branches, which prevent efficient data
streaming from GPU memory to the cores and additionally can break SIMD principle (for example,
sparse linear algebra, small signal processing, sorting, search and others).
REFERENCES
1) Hagoort, Niels. "Exploring the GPU Architecture." Vmware, 29 Oct. 2020,

core.vmware.com/resource/exploring-GPU-architecture. Accessed 20 May 2023.
2) Vitality Learning. "Understanding the Architecture of a GPU." Medium, 25 Mar. 2021,
medium.com/codex/understanding-the-architecture-of-a-gpu-d5d2d2e8978b. Accessed
20 May 2023.
3) Levinas, Mantas. "Everything You Need to Know About GPU Architecture and How It Has
Evolved." Cherry Servers, 23 Mar. 2021, www.cherryservers.com/blog/everything-you-
need-to-know-about-gpu-architecture. Accessed 20 May 2023.
4) NYU. "Introduction to GPUs. CUDA." Github, 1 Jan. 2017, nyu-cds.github.io/python-
gpu/02-cuda/. Accessed 20 May 2023.
5) Lheureux, Adil. "Computing GPU Memory Bandwidth with Deep Learning Benchmarks."
Paperspace, 1 May 2022, blog.paperspace.com/understanding-memory-bandwidth-
benchmarks/. Accessed 20 May 2023.
6) Lheureux, Adil. "GPU Memory Bandwidth." Paperspace, 10 May 2022,
blog.paperspace.com/understanding-memory-bandwidth-benchmarks/. Accessed 20
May 2023.
7) Buck, Ian. "Chapter 32. Taking the Plunge into GPU Computing." NVIDIA.Developer, 1
Apr. 2005, developer.nvidia.com/gpugems/gpugems2/part-iv-general-purpose-
computation-gpus-primer/chapter-32-taking-plunge-gpu. Accessed 20 May 2023.
8) Howard. "Deep Comparison Between Server CPU and GPU." FS Community, 1 Jun.
2022, community.fs.com/blog/deep-comparison-between-server-cpu-and-gpu.html.
Accessed 20 May 2023.
9) Levinas, Mantas. "A Complete Introduction to GPU Programming With Practical Examples
in CUDA and Python." Cherry Servers, 30 Sept. 2021,
www.cherryservers.com/blog/introduction-to-gpu-programming-with-cuda-and-python.
Accessed 20 May 2023.

GPU Khoruzhenko

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GPU Khoruzhenko

Uploaded by

Copyright:

Available Formats

KU LEUVEN

FACULTY OF ENGINEERING TECHNOLOGY

NVIDIA’s Compute Unified Device Architecture (CUDA) parallel computing platform

GPUs were specifically designed to complement CPUs, so their collaboration increases

1) Hagoort, Niels. "Exploring the GPU Architecture." Vmware, 29 Oct. 2020,

You might also like