You are on page 1of 16

Optimizing Memory Usage and Accesses

on CUDA-Based Recurrent Pattern Matching


Image Compression

Patricio Domingues1 , João Silva1 , Tiago Ribeiro1 , Nuno M.M. Rodrigues1,2 ,


Murilo B. De Carvalho3, and Sérgio M.M. De Faria1,2
1
School of Management and Technology,
Polytechnic Institute of Leiria,
Leiria, Portugal
2
Instituto de Telecomunicações, Portugal
3
TET/CTC, Universidade Federal Fluminense,
Niterói - RJ, Brazil

Abstract. This paper reports the adaptation of the Multidimensional


Multiscale Parser (MMP) algorithm to CUDA. Specifically, we focus on
memory optimization issues, such as the layout of data structures in
memory, the type of GPU memory – shared, constant and global – and
on achieving coalesced accesses. MMP is a demanding lossy compression
algorithm for images. For example, MMP requires nearly 9000 seconds
to encode the 512 × 512 Lenna image on a 2013’s Intel Xeon. One of
the main challenges to adapt MMP to manycore is related to the de-
pendency over a pattern codebook which is built during the execution.
This forces the input image to be processed sequentially. Nonetheless,
CUDA-MMP achieves a 12× speedup over the sequential version when
ran on an NVIDIA GTX 680. By further optimizing memory operations,
the speedup is pushed to 17.1×.

Keywords: CUDA, memory optimization, manycore computing, image


compression.

1 Introduction
Creation and usage of multimedia content is growing exponentially [1]. On one
hand, the exploding growth of cheap commodity digital capture devices such as
recorders, cameras, and more recently smartphones, tablets and other mobile
gadgets enable not only professionals but also individuals to record and produce
massive amounts of multimedia material. On the other hand, multimedia de-
vices and standards are evolving toward delivering higher quality, namely higher
definition. Indeed, some of the newest smartphones are able to capture video in
so-called Full HD (1080p), with some high end models even delivering so-called
4K video [2]. These applications impose high demands on storage and bandwidth.
For instance, streaming bandwidth for video content is 8 Mb/s for the standard
television format (SD), 19.3 Mb/s for HD and 195 Mb/s for Ultra-HDTV (4K),

B. Murgante et al. (Eds.): ICCSA 2014, Part IV, LNCS 8582, pp. 560–575, 2014.

c Springer International Publishing Switzerland 2014
Optimizing Memory Usage and Accesses on CUDA 561

this last one considering the H.265 format [3]. All this contributes to a massive
increase in the size of multimedia content, stressing not only storage but also
networks, used to transfer content. Compression mechanisms targeting multi-
media content are thus essential to reduce storage and bandwidth demands to
a manageable level. Examples include lossless mechanisms such as the Portable
Network Graphic (PNG) and JPEG XR formats for images, Windows Media
Lossless (WMA Lossless) and Free Lossless Audio Codec (FLAC) for sounds,
just to name a few. Lossy methodologies, which sacrifice part of the digital data
to achieve higher compression ratios, have an important contribute to reduce
the demands of multimedia content. Examples includes MP3 and Ogg Vorbis
for sound, JPEG for images and MPEG-4 for video. Even so, current networks
struggle to deliver HD content, let alone 4K content [3].
The Multidimensional Multiscale Parser (MMP) is a lossy pattern-matching-
based algorithm for the compression of multimedia content, namely images [4,5].
It relies on a dynamically generated set of codebook patterns, which are used to
approximate each block of the input image. It uses a multiscale approach, which
enables the approximation of image blocks with different sizes.
MMP achieves high compression/quality ratio as measured through peak sig-
nal to noise ratio (PSNR), surpassing standards such as H.264 and JPEG2000 [6].
However, the compression/quality ratio achieved by MMP requires massive sin-
gle precision floating point computation to deal with the pattern matching pro-
cess. Indeed, MMP is computationally very complex, with the sequential CPU
version requiring lengthy execution times when compressing images. For in-
stance, the 512 × 512 Lenna 8-bit gray image requires more than two hours
of wall clock time on a state of the art 2013’s Intel Xeon machine.
MMP resorts to a dynamic codebook of blocks that are used to replace the
original blocks of the image to compress. In fact, for each original block of the
input image, many operations are performed in order to identify and if needed,
to create, the replacement block(s) that yields the lowest distortion. Moreover,
the best fitted blocks are added to the codebook after the encoding operation
of each input block. The use of an adaptive codebook constitutes an important
challenge to the parallelization of MMP and henceforth to successfully port the
algorithm to parallel accelerators such as Graphical Processing Units (GPUs).
In this work, we analyze the techniques used to achieve meaningful speedups
over commodity GPUs resorting to the CUDA framework. We essentially focus
on the memory optimization issues.
GPUs have revolutionized high performance computing, with affordable com-
modity hardware hosting thousands of execution cores, yielding massive speedups
for certain types of applications. For instance, Lee et. al [7] report that a commod-
ity GPU is, on average, 14 times faster than a state of the art 6-core CPU over a
set of CPU- and GPU-optimized kernels. However, important adaptations need
to be made to the CPU-based applications to achieve proper speedup on GPUs.
The popularity of GPUs within the gamer community brings large scale volumes
that not only help to reduce prices but also push performance forward by way of in-
tense competition among manufacturers of GPUs, namely AMD and NVIDIA [8].
562 P. Domingues et al.

Paramount to the success of GPU within the high performance community, is the
availability of programming platforms such as the CUDA framework [9] and the
OpenCL standard [10], which fitted with the appropriate developing tools, bring
simplicity and efficiency to programmers.
In this paper, we depict the adaptation of the MMP algorithm to CUDA,
focusing on the main challenges and reporting the main performance results.
We believe that the main contributions of this paper lie in i) in the way that a
pattern matching-based algorithm, with a major sequential dependency over its
codebook, can be successfully ported to CUDA manycore computing and ii) the
memory optimization strategies that pay off in CUDA.
The remainder of this paper is organized as follows. Section 2 focuses on
related work. Section 3 briefly presents CUDA. Section 4 describes the MMP
algorithm. Section 5 presents the main memory adaptations that were performed
to MMP-CUDA. In section 6, we assess the performance of the MMP-CUDA
version. Finally, section 7 concludes the paper and outlines some venues for
future work.

2 Related Work
We review related work. We first focus on i) image encoding algorithms adapted
to GPUs and then on ii) studies regarding optimal data layout and placement
for GPUs.
Victor Lee et al. analyze the performance of several applications comparing
optimized versions of CPU vs. GPU, reporting a speedup of 14×, on average, for
the GPU versions [7]. However, the execution of the JPEG2000 image encoding
algorithm in GPU – NVIDIA GeForce 9800 and NVIDIA GTX 280 – is slower
than the CPU versions ran over an INTEL Quad Core Q9450.
Ozsoy et al. present several optimizations, including memory-based ones, for
the CUDA-based version of the Lempel-Ziv-Storer-Szymanski (LZSS) lossless
data compression algorithm [11]. Regarding memory, the usage of texture mem-
ory brought, on average, a 16% speedup increase. With all optimization com-
bined, the performance increase was 5.63 times relatively to a previous CUDA-
based version. The optimized CUDA-LZSS is 34× and 2.21× faster than the
serial and the parallel (non-CUDA) versions, respectively.
Sodsong et al. introduce a JPEG decoding scheme for systems comprising a
multicore CPU and an OpenCL-programmable GPU optimized [12]. Under their
approach, the non-parallel tasks of the JPEG decoder are kept in the CPU, while
the parallel tasks are split between the CPU and the GPU. The GPU tasks are
additionaly optimized for OpenCL memory hierarchies. This approach yields a
speedup of 8.5× and 4.2× over the sequential decoder and the SIMD-version of
libjpeg-turbo [12], respectively.
Sung et al. present a formulation and language extension that enables auto-
matic data layout transformation for CUDA code [13]. This approach applies
a transformation tool, providing a data layout optimized for the underlying
CUDA memory hierarchy. The authors report a speedup comparable to hand-
tuned code. Nevertheless, this approach is targeted solely for structured grid
Optimizing Memory Usage and Accesses on CUDA 563

applications, where each output point is computed as a function of itself and its
nearest neighbors [13]. Additionally, the code needs to be adapted to the tool,
namely using FORTRAN-style subscripted array access. Likewise, Jaeger et al.
also focus on optimizing layouts for stencil-oriented multithreaded programs over
CPU and GPUs [14].
Sung, Lui and Hwu propose a runtime OpenCL DL library for optimal data
layout placement on heterogeneous architectures [15]. DL implements Array of
Structure of Tiled Arrays (ASTA). Additionally, DL supports the so called in-
place data marshalling, in order to counteract the size bloat that frequently
occurs when data are laid out for performance reasons. For certain problem types,
DL yields speedups higher than so called discrete arrays layout – programmers
break the structure by hand – but only when the overhead from the runtime
marshalling is not taken into account.
Mei and Tian study the impact on CUDA performance of several data lay-
outs, namely the Structure of Arrays (SoA) and the Array of Structures (AoS)
approaches, for the Inverse Distance Weighting [16]. The former approach is
based on a structure which holds all the needed data within its arrays, with
the existence of one array per type of data. The AoS approach is the commonly
used in sequential programming. It comprises a structure which holds the basic
data, which is replicated in an array. They conclude that, due to the CUDA de-
pendence over coalesced memory accesses, SoA-based layouts deliver the highest
performance. Our approach confirms this dependency and reinforces the need
to properly layout data on memory to maximize coalescing accesses and conse-
quently performance.

3 CUDA

Modern GPUs from NVIDIA and AMD have thousands of cores, bringing af-
fordable parallelism to the desktop. However hardware without proper APIs and
tools to harness the massive parallelism is of limited use and thus of reduced
attractiveness. NVIDIA’s Computer Unified Device Architecture (CUDA) is a
framework that offers a comprehensive software stack to facilitate the use and
programmability of manycore GPUs. The CUDA software stack includes com-
pilers, profilers, libraries and a vast set of examples and samples.
CUDA extends the C++ programming language adding a few keywords, iden-
tifiers, modifiers and functions. For instance, functions such as cudaMalloc and
cudaMemcpy have names that identify its functionalities with the well known C
malloc and memcpy functions. This allows C++ and C programmers to rapidly
get acquainted with CUDA, once the parallel paradigm is mastered and con-
tributes to the popularity of the framework. Indeed, although CUDA is propri-
etary, running solely in NVIDIAs GPUs, Stamatopoulos et al. report that, so
far, CUDA is the only manycore technology which has achieved wide adoption
and usage [17]. In this paper, we focus solely on the CUDA runtime API, which
is simpler and hence attracts more users than the more verbose, but low level,
CUDA driver level API [9].
564 P. Domingues et al.

At the hardware level, a CUDA GPU is comprised of multiple multiprocessors


(MP), each one holding a given number of CUDA cores. The number of CUDA
cores per multiprocessors is dependent on the so called compute capability of
the device. For instance, CUDA devices with compute capability 3.0 have 192
CUDA cores per multiprocessor. For example, the NVIDIA GTX 680 GPU is
comprised of 8 multiprocessors, holding a total of 1536 CUDA cores.
At the software level, a CUDA program is made up of code to run on the CPU
and code to explicitly run on the GPU. The CPU code controls the main func-
tions such as allocating memory on the GPU, controlling data transfers between
the host main memory and the diverse types of memory of the GPU (global,
constant, shared, etc.), and launching GPU code. The GPU code is organized in
so called kernels and in device functions. A kernel represents an entry point into
the GPU with its code being executed by the CUDA cores of the GPU. Specif-
ically, the CUDA framework exposes the cores to the programmer through a
so called execution grid which comprehends blocks organized in a layout that
can have up to three dimensions. Furthermore, a block holds a given number
of threads, with threads being the active executive entities. Like blocks, threads
within a block are organized in a layout that can have up to three dimensions.
The number of blocks at each dimension is specified by the programmer when
calling a kernel. Within the same call, the programmer also specifies the di-
mensionality and number of threads per block. Within the GPU code, special
CUDA identifiers such as threadIdx (available in the x, y and z dimensions) and
blockIdx allow the programmer to localize and differentiate the code in order to
distribute work among threads and blocks. Although programming in CUDA is
rather trivial, extracting good performance is much more difficult, since many
performance details are tied to the underlying architecture [18,19].
Examples of factors that decisively influence the performance of CUDA pro-
grams include adherence to the underlying memory model, namely coalesced
accesses to memory [20,21], resorting to shared memory in order to take ad-
vantage of its lower memory latency [19] and using constant memory to offload
read-only data in order to speedup the simultaneous access of warp of threads
to these data. We study the effectiveness of these techniques in this paper.

4 The MMP Algorithm

The MMP algorithm is based on multiscale pattern matching with exhaustive


search through an adaptive multiscale codebook, also referred to as dictionary.
Specifically, MMP parses the input image into blocks, which are compared with
every element of the codebook. Unlike conventional matching pattern algorithms,
which use fixed-sized blocks, MMP segments the input block to search for a
better representation of the signal as a concatenation of different sized patterns.
The use of different block sizes, or scales, account for the important multiscale
feature of the algorithm. The codebook is comprised by elements of all possible
scales. The adaptive nature of the codebook means that while the algorithm is
coding a signal, MMP uses information gathered in the parts already coded to
Optimizing Memory Usage and Accesses on CUDA 565

create new codebook entries (code-vectors), thus yielding a dynamic codebook.


These new code-vectors become available to approximate the blocks of the input
signal not yet encoded.
Although the MMP algorithm is suited for input signals with any dimension,
in this work we address standard grayscale images, i.e., two dimensional arrays
of bytes, which represent intensity values that range from 0 to 255. Furthermore,
the elementary input signal (block) corresponds to a 16 × 16 pixel block, with
the algorithm processing the input blocks of the image from left to right and top
to bottom.
To find the best element of the codebook that should be used to represent the
current block of the input signal, the MMP algorithm resorts to the Lagrangian
cost J, given by the equation J = D + λ.R, where D measures the distortion
and R represents the number of bits (rate) needed to encode the dictionary
element. The numerical parameter λ is given by the user and remains constant
throughout the encoding process. It allows to tune the balance between quality
(small λ, e.g. λ = 5) and bitrate (large λ, e.g. λ = 300). Note that a small λ
drives quality at the cost of computational complexity and a larger codebook.
The used version of the MMP algorithm combine the adaptive pattern-matching
with a predictive coding step. MMP uses an intra-frame prediction algorithm
to determine prediction blocks, by using the values of the previously encoded
neighboring pixels [5]. In this case, MMP does not deal directly with the input
image. Instead, it operates on a residue block, which is the difference between
each pixel of the prediction block and the corresponding pixel of the input block.
The rationale to encode the residue instead of the input signal lies from the fact
that the residue’s values have a much lesser variability and hence can be more
efficiently compressed [5]. MMP currently employs 10 prediction modes. This
means, that for each input block, MMP has to analyze all the candidate residue
blocks that arise from the 10 prediction modes.
A major part of the computational complexity of MMP is associated with
determining the distortion between each residue block and every available code-
vector. The distortion, D, between an input block, X and a given code-vector
C is given by the mean-squared error (MSE), defined by:
M 
 N
D= (X(i, j) − C(i, j))2 , (1)
i=1 j=1

where M × N represents the size of X and C.


For each residue block, the MMP algorithm computes the distortion for every
code-vector available in the codebook. This procedure allows MMP to determine
the code-vector that will be used to represent the input block. As was previously
mentioned, MMP does not stop its analysis at the current block scale. It also com-
putes the distortion for all the subscales of the block. For this purpose, the block is
first segmented in half, either vertically into two blocks of M/2 × N pixels (verti-
cal segmentation) or into two blocks of M ×N/2 pixels (horizontal segmentation).
This process goes on recursively until scale 0 (1 × 1 blocks) is reached. A 16 × 16
pixels block can yield 961 segments: two 8 × 16, two 16 × 8, four 8 × 8 and so on,
566 P. Domingues et al.

down to the 256 single-pixel blocks (scale 0). Thus a tree of blocks and subblocks is
built, with the distortion being computed for all the blocks that correspond to the
tree nodes. This requires a considerable amount of computation time to fully en-
code an input image. As we shall see later on, the MMP encoding of the 512 × 512
well known Lenna image requires slightly less than 9000 seconds on an 2013 high
end machine, fitted with an Intel Xeon CPU.
In a first stage, MMP expands the entire tree, determining the cost for every
possible node (sub-block). After this, a recursive pruning strategy compares the
sum of the distortion of the children blocks against the distortion of their parent.
If the distortion sum of the two children blocks is lower than the distortion of
their parent block, the children blocks are kept, otherwise they are pruned. This
results in a pruned tree, which represents the optimum segmentation of the orig-
inal input block. After encoding the elements of the optimum tree, new blocks,
generated by applying scale transformation over the concatenation of every pair
of child nodes, are added to the codebook. The whole process – building the tree
and its subsequent prune operation – is repeated for every block of the input
image.
The codebook updating procedure, which takes place at the end of the encod-
ing operation of each block of the input image, creates a sequential dependency.
This impedes the traditional approach for parallelization of image-related algo-
rithms, where several input blocks are processed in parallel. Thus parallelism
needs to be exploited within the operations that are performed to encode an in-
dividual input block. For an individual block, parallelization opportunities arise
from the usage of multiple prediction modes – each prediction mode forms a task
– and for the computationally costly operations of computing the distortion, as
well as finding the best suited block and updating the codebook.

5 Porting MMP to CUDA

The port of the MMP algorithm was preceded with the profiling of the sequen-
tial version to identify the performance hot spots, whose adaptation to CUDA
could provide performance gain. The profiling pointed out that the most time
consuming operations of the sequential version are i) the computation of the
distortion of the current input block with all the blocks of the codebook and
ii) the update of the adaptive codebook which is performed at the end of the
encoding operation of every input block.

5.1 Naı̈ve Approach

Two CUDA kernels were created – kernelDistortion and kernelReduction – in


order to offload the computational hotspots to the GPU. The kernel kernelD-
istortion computes not only the distortion set of a given input block against
all existing blocks in the codebook, but also the Lagrangian, while kernelRe-
duction locates the block that has the lowest Lagrangian, effectively performing
a reduction operation, hence its name [18]. The kernels are chained together,
Optimizing Memory Usage and Accesses on CUDA 567

with the second one called right after the first one has computed the whole set
of distortions. Note that the final part of the reduction is performed by the CPU
as it is usually the case in GPU-based reduction [22]. This chained pair of calls –
KernelDistortion and then kernelReduction – occur many times for every block
of the input image due to the multiple paths that are exploited by MMP by way
of prediction modes and multiscale.
Regarding data, the set of blocks that forms the adaptive MMP codebook
is kept in the global memory, along with the distortion set, allowing the ker-
nel kernelReduction to use this data, thus avoiding costly copies CPU-GPU.
Nonetheless, in the naı̈ve approach, the whole codebook is laid out in global
memory in a similar way that the one adopted in the CPU version. This yields
non-efficient accesses, especially non-coalesced accesses.
The poor performance results of the naı̈ve approach confirms that although
porting a program to CUDA is rather simple, achieving good performance re-
quires great efforts to accommodate the code to the CUDA architecture [9].
Indeed, the MMP-CUDA naı̈ve version performed as much as 10 times slower
than the CPU sequential version.

5.2 Optimizations
Non-memory Related. The performance of the naı̈ve approach was analyzed
with the NVVP profiling tool to pinpoint the performance bottlenecks and to
correct them. Then, we applied several non-memory related optimizations. For
instance, the reduction kernel was optimized following the practice led out by
Mark Harris [23]. Additionally, both the kernelDistortion and kernelReduction
were revised with the goal of reducing the usage of registers in order to increase
the occupancy of the GPU.
Specifically, the updated kernelDistortion has each CUDA thread computing
the distortion between one partition of the residue block and one element of the
codebook of the same level/scale. This way, a single call of the kernel computes
all the distortions values for every partition of the given block sweeping across all
the elements of the codebook. For instance, when dealing with a block of scale
24 (block with 16 × 16 pixels), the kernelDistortion computes the distortion
between the block and all the block of the codebook, repeating the process for
every partition of scale 24 down to scale 0 (individual pixels). The final output
of kernelDistortion is an array holding the distortion of the input block and its
partition for every element of the codebook. Regarding the CUDA execution
grid, the updated kernelDistortion has no limitations or restrictions, and can be
executed under any grid.
The kernel kernelReduction, which aims to select the best codebook’s entry for
each possible partition of the block being coded, was also heavily changed. It now
returns, for each partition, the block that yields the lowest Lagrangian cost. For
instance, for a coding block scale 24 (16 × 16), the kernel returns an array with
705 elements, that is, one per partition excluding the 256-single pixel partition.
Indeed, a 16 × 16 block has 961 possible partitions, 256 of them corresponds to a
single-pixel partition. To achieve its purpose, the kernel acts in two main steps.
568 P. Domingues et al.

First, each thread loads the previously computed distortion and computes the
Lagrangian cost, looping over a set of values (e.g., thread 0 handles the values
from index 0, index 128, etc.). Having finished its loop, each thread saves the
lowest Lagrangian cost into shared memory, yielding an array which holds the
lowest costs. The second step actually reduces the array built in the previous
step, through a parallel GPU reduction [22]. Note that each block of CUDA
threads deals with a given partition, hence there are as many blocks of CUDA
threads as there are partitions.
A change that affected both kernels was the Lagrangian computation which
was moved from the kernelDistortion to the kernelReduction. The change was per-
formed to save variables and hence registers, thus allowing for a larger execution
grid. The revision also allowed us to diminish the number of calls for each ker-
nel, thus reducing overhead. As we shall see later on, these non-memory related
optimizations yielded a meaningful 11.96× speedup relatively to MMP-CPU.
We then focused on memory-related optimizations to pursue further perfor-
mance improvement. Indeed, an important area of CUDA performance is the
proper usage of the hierarchy of memory of CUDA [22,21]. CUDA presents sev-
eral levels of memory: the relatively slow global memory, the fast but block-
limited and size-limited shared memory and the specific-purpose constant mem-
ory [20]. Next, we report the efforts and results achieved through the optimiza-
tion of memory related aspects of MMP-CUDA.
Using the Constant Memory. CUDA constant memory is a read-only mem-
ory from the point of view of the GPU, with values being loaded through CPU
code. It has a limited size of 64KB. Although the constant memory is not the
fastest CUDA memory – it comes third after the register file (fastest) and the
shared memory – it is cached [21]. More importantly, the constant memory is
specially adapted for broadcasting. In fact, it excels when all the threads of
the warp simultaneously access the same value held in constant memory, yield-
ing performance comparable to accessing the register file [21]. This broadcast
scenario is exploited by MMP through the kernelDistortion and it is the main
reason for the layout adopted to represent the codebook in GPU memory as de-
scribed above. Indeed, as explained earlier, all the 32 threads of a warp running
the kernelDistortion compute the distortion of the same pixel of the input block.
Therefore, by storing the residue of the input block on the constant memory, the
value of the same pixel is broadcast to the warp that also benefits from coalesced
accesses to the blocks of the codebook. Since the value of a pixel is represented
by an integer, the 64 KB of the constant memory are more than enough to
hold the needed values. Likewise, some control parameters were moved to the
constant memory to benefit from the broadcast effect.
Non-pageable Memory. Usage of non-pageable memory is a simple optimiza-
tion technique since it only involves minor changes, namely switching some API
calls [21]. The non-pageable memory is a set of memory pages from the host ma-
chine that cannot be swapped out by the operating system. In CUDA, usage of
Optimizing Memory Usage and Accesses on CUDA 569

non-pageable memory speeds up the memory transfer between the CPU and the
GPU, allowing the CUDA memory engine to copy data through Direct Memory
Address (DMA) between the CPU and the GPU without the need to resort to
extra internal copies as it is necessary when dealing with pageable memory. One
of the problems of using non-pageable memory is the strain that it can cause
on the memory management system, since it reduces the available quantity of
pageable memory. However, since the codebook for each scale is limited to 50000
elements, the maximum size of the whole codebook is 183 MB, an amount that
should cause no problem, since 4 GB or more of RAM are common nowadays.
Coalescing Memory Accesses to the Codebook. An important issue re-
garding performance on CUDA is properly accessing the global memory in order
to reduce the number of transactions with the memory controller. The recom-
mendation is to access memory in a so-called coalesced form, with neighbor
threads of a warp accessing neighbor elements. For instance, thread 0 of a given
CUDA block of threads accesses index 0 of the array, thread 1 accesses index 1
of the array and so forth. This way the access of the 32 threads that comprises
a warp of threads only incurs the cost of a single memory transaction. This as-
sumes that the CUDA array starts on properly aligned memory, which is the
case whenever the CUDA memory allocation functions are used. On the other
hand, non-coalesced memory accesses results in multiple memory transactions
that might significantly slow down computation [21]. Note also that memory
accesses and the effects of non-properly coalesced accesses depend on the CUDA
capability of the GPU. For instance, for devices with compute capability 2.x, and
as stated in [21], the concurrent accesses of the threads of a warp will coalesce
into a number of transactions equal to the number of cache lines necessary to
service all the of the threads of the warp, with a cache line holding 128 bytes.
The MMP-CPU version organizes the codebook in memory as a two-dimensional
array, as follows. The first index represents the scale, with its values ranging from
0 to 24 which corresponds to blocks 1 × 1 (scale 0) to 16 × 16 (scale 24). The second
index holds the offset of the pixel where the entry starts. For instance, for the scale
24, each block is represented by 16 × 16 integers, meaning that the first block starts
at offset 0, the second at offset 256 and so forth.
The naı̈ve CUDA implementation followed the MMP-CPU memory layout
organization for the codebook. However, as easily seen, this yields non-coalesced
memory accesses, since within the kernel, consecutive threads work on consecu-
tive entries of the codebook. For instance, thread 0 accesses the first codebook
entry for the scale it is dealing with, thread 1 performs the same for the second
codebook entry and so forth. This way, serving the memory request for a warp
requires multiple memory transactions, since neighbor accesses are 256 bytes
apart for scale 24. To reduce the number of memory transactions, the mem-
ory layout used to keep the GPU’s copy of the codebook was deeply changed.
First, a single large block of GPU memory is used to hold the entire codebook.
The size of the GPU’s memory block corresponds to the maximum size that the
570 P. Domingues et al.

codebook can attain. This maximum size is easily computed, since the maximum
number of elements per scale that a codebook can hold is a parameter of the
MMP algorithm, fixed at startup. In our experimental setup, the codebook was
dimensioned for a maximum of 50000 elements per scale, corresponding roughly
to 183 MB as reported earlier. Therefore the codebook fits easily in CUDA
devices, since CUDA only run on devices that have at least 512 MB of memory.
Moreover, modern GPUs all have 1 GB or more of memory.
The major changes with the codebook are related to its internal organization.
Specifically, the codebook is laid out as follows: the first value corresponds to
the value of the first pixel of codebook’s block 0, the second value corresponds
to the value of the first pixel of codebook’s block 1, and so forth up to the 32th
value which corresponds to the value of the first pixel of block 31. This means
that the first 32 values correspond to the first pixel of the first 32 blocks of the
codebooks. The second set of 32 values (from the 33th to the 64th value) holds
the second pixel for the first 32 blocks. This goes on until all the pixels of all the
first 32 blocks of the codebook are laid out. Then, the process is repeated for
the second set of blocks (from 33 to 64). To insure that the codebook is properly
aligned in memory, padding is used when needed to read 128-byte boundaries.
It is important to note that the codebook comprises the blocks for all the 25
supported scales, ranging from scale 0 to scale 24. For each of them, the above
layout is adopted.
The rationale for contiguously aligning 32 values per set stems from the fact
that it allows all the 32 threads of a warp to access, within a single memory
transaction, the value of the same order pixel. This way, thread 0 accesses the
ith order pixel for block 0, thread 1 accesses the ith order pixel for block 1 and
so forth. Under this memory layout, the codebook can be accessed efficiently not
only when the distortion is being computed by kernel kernelDistortion, but also
when the codebook is searched to check whether blocks to be inserted already
exist in the codebook, as we shall see in the next section.
Reengineering the Codebook Update. At the end of the encoding operation
of an input block, the codebook requires an update if new encoding block(s) has
been used. However, first MMP needs to determine if these blocks are effectively
new. This means that a search is needed over the codebook. For this purpose,
a third CUDA kernel – kernelSearchCodebook – was devised for the memory
optimized version of MMP. In reality, MMP does not use an exact comparison
criterion to judge whether two blocks are equivalent or not. Instead, it resorts
to the MSE metric computed between the two blocks: if the distance is less
than a fixed threshold, then the blocks are considered similar, meaning that the
codebook is not updated with the block. Therefore, the lookup of the codebook
for update purposes requires computing the MSE between the new candidate to
be inserted block and all the blocks of the codebook that belong to the same scale.
This computation can be performed efficiently in parallel, precisely through the
kernelSearchCodebook.
Optimizing Memory Usage and Accesses on CUDA 571

6 Performance Evaluation

6.1 Experimental Environment

To assess the performance of the various versions of MMP-CUDA, we resorted


to a desktop machine fitted with two Intel Dual-Xeon E5-2630 CPUs, 64 GB
of DDR3 RAM memory and a NVIDIA GTX 680 connected through the PCI
Express 3.0. The GTX 680 is powered by the GK104 Kepler architecture chip,
delivering compute capability 3.0. It has 2 GB of DDR5 memory, 1536 CUDA
cores evenly distributed over 8 CUDA multiprocessors (192 CUDA cores per
multiprocessor). The bandwidth as measured by the bandwidth benchmark pro-
vided in NVIDIA SDK is 11.081 GB/s between host to device, 10.957 GB/s
between device to host and 145.865 GB/s between the device itself. We note
that the last value regarding the device-device bandwidth is considerably less
than the theoretical 192 GB/s. Additionally, although the GTX 680 has a dou-
ble precision floating point performance which is only 1/24 of its single precision
floating-point performance, this does not affect our tests, since MMP relies solely
on single floating-point. The machine uses a 64-bit version of the Linux operat-
ing system with kernel 3.2.0. Finally, the GPU software stack comprises CUDA
version 5.5 and version 304.88 of the Linux NVIDIA GPU driver.

6.2 Main Performance Results

The main results for the various CUDA versions of MMP are shown in table 1.
All tests were run five times. The λ parameter was set to 10 which corresponds
to a quality-oriented setup which is computationally demanding. The input was
the well-known 512 × 512 pixel Lenna image with 8-bit gray intensity.
The kernels were run with the CUDA execution grids that experimentally
yielded the best execution times. Specifically, the kernels kernelDistortion and
kernelSearchCodebook were both set with a one-dimension execution grid, with
64 CUDA blocks, each block holding 256 threads. For the kernel kernelReduction
a bi-dimensional grid of blocks was selected. Specifically, the number of CUDA
blocks for this kernel corresponds on its x-dimension to the number of available
partitions and on the y-dimension to the number of available prediction modes.
Again, each CUDA block was set with 256 threads organized in a one-dimension
layout.
The column Execution time displays the wall clock time in seconds, which
corresponds to the average value of the five executions. The column stdev reports
the standard deviation of the runs. The column Speedup vs. CPU holds the
speedup of the current version relatively to the CPU version of MMP. Finally, the
column Speedup vs. reference shows the speedup of the current version relatively
to the MMP-CUDA reference version, that is, the MMP-CUDA version that
holds all non-memory related optimizations.
It is important to note that the current CPU version is single-threaded, and
thus does not take advantage of manycore CPUs. We aim to address this issue in
the future. The low performing results of the naı̈ve version confirm that attaining
572 P. Domingues et al.

Table 1. Main performance results

Execution Speedup Speedup


Version time stdev vs. vs.
(seconds) CPU reference
MMP-CPU 8990.584 14.797 1 –
MMP-CUDA Naive 23250.678 36.660 0.387 –
MMP-CUDA reference 752.012 11.107 11.955 1
ConstMem 732.440 5.590 12.275 1.027
ConstMemNP 716.570 1.178 12.547 1.049
ConstMemNPCoal 665.438 2.448 13.511 1.13
ConstMemNPCoalSearch 529.142 0.573 16.991 1.421
ConstMemNPCoalSearchConst 522.974 1.921 17.191 1.438
ConstMemNPCoalSearchShared 527.578 1.156 17.041 1.425

proper performance with CUDA is seldom feasible by just porting the sequential
code to CUDA. Indeed, the first set of optimizations and reengineering yielded
a nearly 12× performance improvement relatively to MMP-CPU as reported by
the MMP-CUDA reference version. The problems with the naı̈ve version were
manifold. Firstly, the kernels were consuming many registers, thus reducing the
available parallelism. Secondly, the code was performing a very large number of
kernel calls and associated CPU/GPU memory copies, both situations contribut-
ing for a significant overhead. Finally, the kernelReduction was poorly balanced,
with one thread assigned per partition of MMP.
We now focus on the results related with memory optimization. The Con-
stMem version makes uses of the constant memory features for the distortion
computation, achieving a mere 2.7% improvement. A similar improvement is
achieved when non-pageable memory is used on top of the constant memory op-
timization (version ConstMemNP ). Interestingly, the use of non-pageable mem-
ory seems to reduce the standard deviation of the tests. We aim to assess this
hypothesis in future work.
The coalesced version – ConstMemNPCoal also contributes with a small im-
provement to performance, around 8%. All together, the combined usage of con-
stant memory, non-pageable and coalesced accesses deliver a 13% performance
improvement.
A meaningful improvement was achieved through the creation of the kernel
kernelSearchCodebook. As stated earlier, this kernel filters out candidate blocks
that already exist in the codebook, thus preventing duplicate insertions and elim-
inating costly operations from the CPU. The performance benefits are shown by
the ConstMemNPCoalSearch version that attains a 42.1% improvement rela-
tively to MMP-CUDA reference and nearly 17× for the MMP-CPU version.
Finally, the two last reported versions – ConstMemNPCoalSearchConst and
ConstMemNPCoalSearchShared – are variant of the ConstMemNPCoalSearch
version. In these versions, the kernel kernelSearchCodebook uses, respectively,
constant and shared memory to hold the candidate blocks that are to be looked
Optimizing Memory Usage and Accesses on CUDA 573

up over the codebook to assess whether they already exist or not. However, the
performance improvements for both versions are marginal.

7 Conclusion and Future Work

We study the performance improvements achieved by resorting to memory-


related optimizations of a CUDA-based version of the MMP algorithm. Although
a final 17× speedup was attained relatively to the CPU sequential version, the
contribution of memory-based optimizations only amounts for a relatively small
part of the whole performance improvement. In fact, the main performance im-
provement derives from adapting costly operations of the CPU version to the
GPU through the creation of appropriately crafted kernels. Our work also ex-
emplifies that merely porting a relatively complex algorithm to CUDA is not
enough. Indeed, deep adaptation needs to be made, otherwise merely porting
the code to CUDA most probably generates negative speedups as happened
with our MMP-CUDA Naı̈ve version.
Taking into consideration the sequential restrictions of the MMP algorithm,
namely the causality that exists between the encoding of successive input blocks
– the encoding operation of the nth input block can only start after the comple-
tion of the encoding of the n − 1th input block – achieving a 17× performance
improvement is an important testimony of the performance that can be attained
with GPUs and CUDA.
As future work, we plan to assess the optimized version of MMP-CUDA in
newer GPU hardware. Additionally, we plan to study its adaptation to OpenCL
GPU-based platforms and to analyze the suitability of the sequential code to be
adapted to multicore CPUs. Finally, we aim to further revise the MMP algorithm
to identify optimizations at the algorithm level that can boost performance both
for the CPU and the GPU versions.

Acknowledgments. Financial support provided by FCT (Fundação para a


Ciência e Tecnologia, Portugal), under the grant PTDC/EIAEIA/122774/2010.

References

1. Seltzer, M.L., Zhang, L.: The data deluge: Challenges and opportunities of un-
limited data in statistical signal processing. In: IEEE International Conference
on Acoustics, Speech and Signal Processing, ICASSP 2009, pp. 3701–3704. IEEE
(2009)
2. Murakami, T.: The development and standardization of ultra high definition video
technology. In: Mrak, M., Grgic, M., Kunt, M. (eds.) High-Quality Visual Expe-
rience. Signals and Communication Technology, pp. 81–135. Springer, Heidelberg
(2010)
3. Coughlin, T.: Evolving Storage Technology in Consumer Electronic Products (The
Art of Storage). IEEE Consumer Electronics Magazine 2(2), 59–63 (2013)
574 P. Domingues et al.

4. De Carvalho, M.B., Da Silva, E.A., Finamore, W.A.: Multidimensional signal com-


pression using multiscale recurrent patterns. Signal Processing 82(11), 1559–1580
(2002)
5. Rodrigues, N.M., da Silva, E.A., de Carvalho, M.B., de Faria, S.M., da Silva,
V.M.M.: On dictionary adaptation for recurrent pattern image coding. IEEE Trans-
actions on Image Processing 17(9), 1640–1653 (2008)
6. De Simone, F., Ouaret, M., Dufaux, F., Tescher, A.G., Ebrahimi, T.: A comparative
study of JPEG2000, AVC/H.264, and HD photo, vol. 6696, pp. 669602–669602–12
(2007)
7. Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish,
N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.:
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing
on CPU and GPU. SIGARCH Comput. Archit. News 38, 451–460 (2010)
8. de Verdiére, G.C.: Introduction to GPGPU, a hardware and software background.
Comptes Rendus Mécanique 339(23), 78–89 (2011); High Performance Computing
Le Calcul Intensif
9. Farber, R.: CUDA Application Design and Development. Morgan Kaufmann (2011)
10. Stone, J.E., Gohara, D., Shi, G.: Opencl: A parallel programming standard for
heterogeneous computing systems. Computing in Science and Engineering 12(3),
66–73 (2010)
11. Ozsoy, A., Swany, M., Chauhan, A.: Optimizing LZSS compression on GPGPUs.
Future Generation Computer Systems 30, 170–178 (2014); Special Issue on Extreme
Scale Parallel Architectures and Systems. In: Cryptography in Cloud Computing
and Recent Advances in Parallel and Distributed Systems, ICPADS 2012 Selected
Papers
12. Sodsong, W., Hong, J., Chung, S., Lim, Y., Kim, S.D., Burgstaller, B.: Dynamic
partitioning-based jpeg decompression on heterogeneous multicore architectures.
In: Proceedings of Programming Models and Applications on Multicores and Many-
cores, PMAM 2014, pp. 80:80–80:91. ACM, New York (2007)
13. Sung, I.J., Stratton, J.A., Hwu, W.M.W.: Data layout transformation exploiting
memory-level parallelism in structured grid many-core applications. In: Proceed-
ings of the 19th International Conference on Parallel Architectures and Compila-
tion Techniques, PACT 2010, pp. 513–522. ACM, New York (2010)
14. Jaeger, J., Barthou, D.: et al.: Automatic efficient data layout for multithreaded
stencil codes on CPUs and GPUs. In: IEEE Proceedings of High Performance
Computing Conference, pp. 1–10 (2012)
15. Sung, I., Liu, G., Hwu, W.: DL: A data layout transformation system for hetero-
geneous computing. In: Innovative Parallel Computing (InPar), pp. 1–11. IEEE
(2012)
16. Mei, G., Tian, H.: Performance Impact of Data Layout on the GPU-accelerated
IDW Interpolation. ArXiv e-prints (February 2014)
17. Stamatopoulos, C., Chuang, T.Y., Fraser, C.S., Lu, Y.Y.: Fully automated im-
age orientation in the absence of targets. ISPRS - International Archives of the
Photogrammetry, Remote Sensing and Spatial Information Sciences XXXIX-B5,
303–308 (2012)
18. Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable Parallel Programming
with CUDA. Queue 6, 40–53 (2008)
19. Nvidia, C.: NVIDIA CUDA Programming Guide (version 5.5). NVIDIA Corpora-
tion (2013)
20. Wilt, N.: The CUDA Handbook: A Comprehensive Guide to GPU Programming.
Pearson Education (2013)
Optimizing Memory Usage and Accesses on CUDA 575

21. Nvidia, C.: NVIDIA CUDA C Best Practices Guide - CUDA Toolkit v5.5. NVIDIA
Corporation (2013)
22. Kirk, D.B., Wen-mei, W.H.: Programming Massively Parallel Processors: a Hands-
on Approach, 2nd edn. Newnes (2012)
23. Harris, M., et al.: Optimizing parallel reduction in CUDA. NVIDIA Developer
Technology 2, 45 (2007)

You might also like