Parallel Hashing, Compression and Encryption with OpenCL under OS X

Vasileios Xouris

Master of Science Computer Science School of Informatics University of Edinburgh 2010

Abstract
In this dissertation we examine the efficiency of GPUs with a limited number of stream processors (up to 32), located in desktops and laptops, in the execution of algorithms such as hashing (MD5, SHA1), encryption (Salsa20) and compression (LZ78). For the implementation part, the OpenCL framework was used under OS X. The graphic cards tested were NVIDIA GeForce 9400m and GeForce 9600m GT. We found an efficient block size for each algorithm that results in optimal GPU performance. The results show that encryption and hashing algorithms can be executed on these GPUs very efficiently and replace or assist CPU computations. We achieved a throughput of 159 MB/s for Salsa20, 107.5 MB/s for MD5 and 123.5 MB/s for SHA1. Compression results showed a reduced compression ratio due to GPU memory limitations and reduced speed due to divergent code paths. The combined execution of encryption and compression on the GPU can improve execution times by reducing the latency caused by data transfers between CPU and GPU. In general, a GPU device with 32 stream processors can provide us with enough computation power to replace CPU in the execution of data-parallel computation-intensive algorithms.

i

Acknowledgements
I would like to thank my supervisor, Paul Anderson, for his invaluable help and guidance. I would also like to thank Dr. Zhang Le for his helpful remarks.

I would like to thank my family that always supports me in everything I do.

Finally, I would like to thank Stefania for being patient and supportive during this year.

ii

Declaration
I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.

(Xouris Vasileios)

iii

Table of Contents
Chapter 1 Chapter 2 2.1 2.2 Introduction ................................................................................................1 GPU and OpenCL ......................................................................................3

GPU architecture ................................................................................................... 3 Open Computing Language (OpenCL) ............................................................. 4 Memory model .............................................................................................. 5 Memory access patterns ............................................................................... 7 OpenCL execution model ............................................................................ 8 Encryption on GPU .................................................................................. 10

2.2.1 2.2.2 2.2.3 Chapter 3 3.1 3.2 3.3 3.4

Background .......................................................................................................... 10 GPU advantages and disadvantages ................................................................ 12 Relevant work ...................................................................................................... 13 Implementation of Salsa20 & Results ............................................................... 15 Hashing on GPU ...................................................................................... 21

Chapter 4 4.1 4.2 4.3 4.4

Background .......................................................................................................... 21 GPU advantages and disadvantages ................................................................ 23 Relevant work ...................................................................................................... 23 Implementation of MD5 and SHA1 & Results ................................................ 24 Compression on GPU .............................................................................. 29

Chapter 5 5.1 5.2 5.3 5.4

Background .......................................................................................................... 29 GPU advantages and disadvantages ................................................................ 30 Relevant Work ..................................................................................................... 32 Implementation of LZ78 & Results ................................................................... 32 Putting it all together ............................................................................... 38 Discussion ................................................................................................. 41

Chapter 6 Chapter 7 7.1 7.2

Project difficulties ................................................................................................ 42 Future Work ......................................................................................................... 44

iv

........................................................ 46 v ............................................................... 45 Bibliography ..............................Chapter 8 Conclusion ...............................................................

Until now. encryption and compression. The motivation of this dissertation was a recent research about a fast and secure backup system for Mac laptops [1]. We are planning to implement some specific algorithms for each field and see whether we can get a speedup over a CPU implementation or if we can get execution times that are acceptable and can assist the CPU where possible. For the testing of our implementations. there has been a lot of research focused on efficient implementations of well known algorithms optimized for execution on the Graphics Processor Units (GPU). Of course. entry-level and mid-range GPUs are going to be used with up to 32 streaming processors.Chapter 1 Introduction During the last few years. encryption and compression efficiently. high-end GPUs that offer a much bigger number of stream processors and enormous computation power. which is a number much smaller when compared to high end GPUs containing hundreds of streaming processors. In this dissertation. every published work related to these operations used powerful GPUs with hundreds of stream processors. A mid range GPU device can have around 64 stream processors. The framework that will be used for the programming of our implementations will be 1 . GPUs offer a great architecture that can take advantage of data parallelism very effectively. a number that offers great computation power. Entry level and mid range GPUs can be found in laptops and desktops that are used every day. The main idea of this dissertation is to use the GPU for computations included in a backup system such as data hashing. we plan to research whether GPUs located in desktops and laptops can be used in order to execute operations that include heavy computations such as hashing. GPUs of this kind are usually located in most laptops. there are more specialized.

An implementation and results section with the outputs of our research is also available for each case. 2 . encryption and compression) one by one. After looking to each operation in isolation. Then we will examine the operations mentioned above (hashing. The advantage of OpenCL is that it gives the programmer the ability to control all available processing units in a system including CPUs and GPUs. its capabilities and restrictions. In the final chapters there is a discussion where we describe the difficulties that we faced during our research and implementation and we also propose some ideas for future work. This dissertation is organized in several sections. We will also discuss how each one of them fits on the GPU architecture and mention its advantages and disadvantages. we present some conclusions from the combined execution of encryption and compression on the GPU. a brief background and relevant work on implementations for the GPU is given.the OpenCL (Open Computing Language) framework [16]. For each one of them. We are going to start with a general background section on GPU architecture and a brief description of the OpenCL framework.

Until recently. the development of GPU versions of general algorithms became easier. They can operate on vectors of data very fast. Fortunately. The main difference between a CPU core and a GPU is that CPU is designed to execute a stream of instructions as fast as it can. with the introduction of G80 architecture of NVIDIA. Because of their nature. They are designed to handle complex computations of computer graphics fast and efficiently. 2 special function units. 3 . GPUs contain a number of stream multiprocessors (SM) and each SM contains 8 stream processor cores (SP).Chapter 2 GPU and OpenCL 2. while GPUs have the ability to execute in parallel the same stream of instructions over multiple data. a multithreaded instruction unit and a shared memory. the main problem of General Purpose Computing on Graphic Processor Units was that only floating point arithmetic computations were supported inside pixel shaders. an instruction and a constant cache. GPUs have a parallel throughput architecture that allows many threads to be executed concurrently. programmers started to use them in order to execute more general computation-intensive algorithm by taking advantage of data parallelism. integer data types and bitwise operations are now available [21]. With the introduction of frameworks such as OpenCL and CUDA.1 GPU architecture Graphics Processor Units (GPUs) are specialized processors originally implemented to render 3-dimensional graphics.

we are going to discuss how OpenCL maps to the GPU architecture. The total amount of work. Also the size of cache memories is much smaller because GPUs hide memory latency by executing calculations while waiting for memory access instead of using large cache memories.CPU versus GPU design (source: [4]) In figure 2. 2. OpenCL divides the total amount of work into workgroups. and each work item is executed by a SP.1. In this section.2 Open Computing Language (OpenCL) OpenCL was created originally by Apple Inc. called N-D Range in the OpenCL world. The distribution of workgroups along the available SMs is taking place dynamically by OpenCL itself. Each workgroup is further divided to work items (threads).Figure 2.1 . In OpenCL. To create parallelism. GPUs sacrifice sophisticated control flow in order to have a lot of stream processors on chip. and developed later by Khronos Group. is a collection of workgroups that will be executed in parallel. a data-parallel application (kernel) has to be written in a specific language similar to C99. Threads are further organized by the SM in groups of 32 that are called warps and all threads within a 4 . OpenCL is a framework that gives access to applications to execute on the GPU device. So workgroups are executed on SMs.1 we can see why GPUs are so powerful.1.

OpenCL’s local memory (referred as shared memory in CUDA world) is on-chip which makes it extremely fast. Global memory can be accessed by all work items of all workgroups. otherwise reads of different addresses are serialized. Every time that a group of threads needs to read data from the memory. When a warp is delayed for some reason. Thousands of threads are ready for execution at any time. immediately another group takes its place. A region of global memory is reserved in order to be used as constant read-only memory and is called constant memory. so it must be used to store data that are frequently used for computations and updates. Another useful memory space is the constant cache read-only memory which is usually 64KB and is located on-chip. Accessing the shared memory usually takes 4 to 6 cycles. GPU threads are very lightweight. Unlike CPUs where the cost of switching between threads is hundreds of clock cycles. The shared memory can be accessed by all work-items of a workgroup so it can also be used for communication between work items of the same workgroup. In contrast.2. This memory space is used to speed up 5 . GPUs handle memory latency by switching between workgroups. The main and biggest memory of GPU architecture is the global memory which is off-chip and is usually between 128MB up to several GB for highend GPUs. another warp is selected for execution in order to hide latency.1. 2. When many threads within a workgroup try to read the same constant cache address space it just takes one transaction.2. However the size of shared memory is very small. A diagram of the locations of different memory spaces described in this section can be found in figure 2. all threads within a warp must execute the same instruction in order to take full advantage of parallelism [4]. usually 16KB. Because of the SIMT (Single Instruction Multiple Thread) nature of the SPs. in GPU there is no cost at all.1 Memory model There are several different memory spaces in the OpenCL architecture.1.warp are executed in parallel. This feature is ideal in the case where data needs to be shared among work items. Accessing the global memory requires 400 to 600 cycles and this is the reason why we must be careful when accessing it.

1 . Finally. is also available and it is used in order to speed up reads of image objects. Figure 2. there will be a performance problem which is known as “register pressure”.1.The different memory spaces of GPU (source: [4]) 6 . the private memory (registers) is the fastest memory and distributed privately among the work items of a workgroup by the SM. In case more registers are needed by a workgroup. A similar cache. between 8192 and 16384 32-bit registers (32kb .2.64kb). texture cache.reads from the constant memory by caching frequently used data. Registers are the best solution to store small amounts of data that need to be used frequently [12]. The total amount of memory is limited for each multiprocessor. and is partitioned to threads.

For GPU devices of compute capability of 1. Half warps (groups of 16 threads) that are executed in parallel can be programmed to read global memory in a coalesced way.2 Memory access patterns The way that a group of threads accesses the global GPU memory is very important. Another restriction for this to happen is that data must also be aligned to a multiple of the element that we are reading. the second thread must access the second element and so on.2 or higher this restriction does not apply. NVIDIA GeForce 9600m GT) used for this dissertation have a compute capability less than 1. 8 or 16 bytes in a single transaction.2. a single 128-byte transaction or two 128-byte transactions.2. Accessing the shared memory requires a little different behavior in order to achieve a high bandwidth. Threads within a half warp can access different address spaces within a segment with no order and still result in a single transaction. This can happen if all threads of the half warp access a different element in an aligned segment of global memory (4. For NVIDIA GPU devices of compute capability of 1. 7 . At this point we should note that the graphic cards (NVIDIA GeForce 9400m. Shared memory is split in 16 memory banks and in order to achieve a single transaction each thread of a half warp (groups of 16 threads) needs to access a different memory bank.0 or 1. 8 or 16-byte words) which can result in a single 64-byte transaction. these accesses take place sequentially. As mentioned before. Only in the case that all threads of a half warp request the same memory bank then we have a broadcast that takes place in just one transaction.1 this access must also be further organized so that threads access elements of the segment in order. This means that data of type X must be stored in an address that is a multiple of sizeof(X) [4].2. For example the first thread of the half warp must access the first element of the segment. GPU devices are capable of reading data of 4. each transaction with the global memory can take 400 to 600 cycles so it is important to somehow group memory transactions requested by different work items. In the case that two or more different threads of a half warp request a transaction with the same memory bank.

8 . A representation of the NDRange appears in figure 2.2.2.3. The NDRange has to be large enough because as its size gets bigger it is easier to hide memory latency. so all algorithms can scale to a large number of SMs without problems. The amount of work that needs to be executed is called NDRange in the OpenCL language.2. each workgroup has a unique group Id number and each work item has:   a unique local Id that is used to separate the current work item from other work items of the same workgroup a unique global Id that is used to separate the current work item from all other work items in the NDRange. Each workgroup contains a number of work items (threads) which are executed in parallel.1. NDrange is a grid of thread blocks (workgroups). Each SM has the ability to allow parallel execution a warp. All workgroups are divided in warps for parallel execution. Warps are threads with consecutive local and global Id’s.3 OpenCL execution model The OpenCL framework is responsible for the optimal execution of a data-parallel algorithm. The OpenCL framework discovers how many SMs are available on the current GPU and assigns workgroups on all available SMs where they execute in turns. To keep track of different workgroups and work items during execution.

Representation of the NDrange (grid) of OpenCL (source: [4]) 9 .1 .Figure 2.2.3.

The output of the encryption process is an encrypted message which is usually of the same size as the input and is called ciphertext. there was no integer support on GPU which made encryption algorithms very bad candidates for execution on GPUs due to the fact that these algorithms are consisted of complex operations on integer data types. In general. a secret key. In symmetric encryption. If user A wants to transfer a message to user B. GPUs were mostly used for algorithms with a lot of computations on float data structures. and with the introduction of G80 architecture. encryption algorithms were ready to take a “crash test” on GPUs [21]. since the appearance of General Purpose computing on Graphics Processor Units (GPGPU).1 Background Encryption algorithms are used when there is a need to transfer a message through an unsafe communication channel. is used for encryption and decryption of the data.Chapter 3 Encryption on GPU Traditionally. In the last few years. then it uses B’s public key to encrypt the message and then user B can decrypt it with its own secret key that is only known to him. this was no longer a problem. The problem here is that if each block is encrypted independently of each other 10 . encryption algorithms break the original message in blocks of equal size and process them through a function that applies on them some bitwise operations. each user possesses a secret and a public key. Until recently. In asymmetric encryption. that both communication sides possess. This function is usually repeated several times (rounds) on each block of data. There are two kinds of encryption: symmetric and asymmetric. 3.

on multiple blocks at the same time.with the same key. For the decryption of encrypted data. the key and the nonce must be known. For example. the key. block ciphers have different modes of operation.1. Cipher Block Chaining (CBC) xor’s the ciphertext of the previous block with the next block’s plaintext that was previously xor’d with a nonce and then starts the encryption process. then the ciphertext of each block will always be the same and this may be a serious security problem since it can lead to replay attacks: someone might reuse the same encrypted message and claim to be someone he isn’t or request a valid operation using the same valid encrypted message. Malicious users can use a large number of blocks that were encrypted with the same key in order to find some patterns that can reveal information about the original message. So the information needed for parallel execution is the block number. and the block of data. The problem with CBC and similar modes of operation is that the original message must be processed sequentially. It must be unique for each encryption process. A block cipher mode of operation has the responsibility of mixing each block ciphertext with some kind of information in order to prevent replay attacks and keep encrypted data consistent. Fortunately. which is some initialization variables that are different for each execution of the encryption algorithm. but we add to it the noise that comes out of the encryption of the counter and nonce. The output of the encryption process (keystream) is then xor’d with the original message block and the result of this operation is the ciphertext. The counter is simply a variable that is guaranteed to be unique for a large number of blocks so the most popular option is to use an actual counter that starts from 0 and increases by 1 for each block. For this reason. CTR uses a nonce. there exists a mode of operation that allows us to take advantage of data parallelism in encryption algorithms and it is called Counter Mode (CTR). we are not actually encrypting the message. it combines them in some way (usually by using XOR) and then encrypts the result using a secret key. has to operate on CTR mode. So in this mode. A demonstration of CTR mode appears at the figure 3. The nonce is used so that there is randomness in the output of the XOR operation with the counter and to avoid replay attacks. the nonce. CTR is the mode that we will use for our implementation. 11 . Every encryption algorithm that wants to operate in parallel. and a counter.1 below.

The transfer operation is the bottleneck of many GPU algorithms because it is very costly. So moving data in very large amounts has no real benefit and it is also not possible because of the limited memory on GPUs [15]. the only things that we have to transfer from the host to the GPU are the secret key. So the time needed to transfer this kind of data is insignificant.2 GPU advantages and disadvantages First of all. Transferring data in very small amounts is also inefficient because of the initialization latency mentioned above.Figure 3. We 12 .The CTR mode that can process blocks in parallel for encryption (source: [9]) 3.1. we do not encrypt the original message but the combination of the nonce with the counter offset. the nonce and a counter offset. Fortunately. we need to make sure that the communication bandwidth through the PCI express bus between the two devices is big enough.1 . we need to present the advantages and disadvantages of encryption algorithms on GPU implementations when compared to CPU. The main disadvantage of a GPU implementation is that keystream data need to be repeatedly transferred from the GPU device to the host device. In order to have good results. In the previous paragraph we talked about the problem of transferring data between the host and the GPU. when the encryption algorithm is executed in CTR mode. This is because in CTR mode. The initialization latency of the transfer is usually small and the general trend is that the transfer time grows linearly as the size of the data increases.

The results of these studies are very encouraging and GPU seems to be an ideal platform for the execution of encryption algorithms. The Intel Core Duo processor. the traditional OpenGL graphics pipeline was used to take advantage of GPU computation power. has 25 GFLOPS. We will look briefly some traditional graphics 13 . the benefits of graphics operations could be used in more general operations including integer support. As mentioned in previous chapters. which is an extremely big number.3 Relevant work Several encryption algorithms have been tested on GPUs with various speedups during the last years. all threads that are executed in parallel on a GPU workgroup of threads must execute the same instruction in order to take full advantage of parallelism offered. The computation power of GPUs is by far better than CPU. 3. Fortunately with the introduction of these frameworks things became easier. developers can distribute the encryption process without the need to know very low level of graphics stuff. usually by using XOR.just need to transfer back to the host an encrypted sequence (keystream) for each block that will then be combined with the original text on the CPU. Now the GPU can be seen as a device similar to the CPU and through frameworks such as OpenCL and CUDA. the Nvidia GeForce 9400 graphics card which is located in most Mac Mini's has 54 GFLOPS (Floating Point Operations per Second). GPUs clearly outperform CPUs on computation power and this is their biggest advantage. For example. which is also located in Mac Mini's with the GeForce 9400m. They do not contain branches and this makes them ideal for execution on GPU devices. we can rest assured that at any given time all threads execute the same instruction but on different data and as a result no thread will need to wait for other threads to finish the execution of a different code path. parallel operations on vectors of floating point data and this is where they are really unbeatable. GPUs are designed for fast. With the introduction of GPGPU capable GPUs. Before the appearance of OpenCL and CUDA. Since encryption algorithms do not contain branches in the code. Another advantage is that encryption algorithms are very straightforward algorithms.

The encryption rate achieved was 12 Mb/s. In [14]. In [11]. In this work a GeForce 8600 GT and an ATI Firestream 9270 (800 stream cores) is used. These implementations follow the traditional way of programming the graphics pipeline and use the vertex and fragment processors for parallel computations. a similar algorithm to AES. They identified the bottleneck of their implementation to be the transfer of data between the GPU and the host device due to the limited bandwidth of PCI Express. In [10]. the Compute Unified Device Architecture (CUDA) of NVIDIA is used to create an implementation of AES-256 that gives a peak performance of 8. the XOR operation in this case takes place in ROP. Unfortunately there are no relevant academic papers on GPU implementation of the Salsa20 algorithm but the presented work in this section can be used as a starting point.pipeline implementations and some CUDA/OpenCL implementations in more detail because this is our approach for this dissertation. Salsa20 is going to be implemented. Because of the lack of XOR support in fragment processors for prior to DirectX10 support hardware.86GHz). an AES encryption implementation is created by using the graphics pipeline and the Raster Operations Unit (ROP) which results in 108. Their implementation also appears to be faster when a large number of blocks are transferred on the GPU each time.86 Mb/s. The results show a speedup by a factor of 11 over a sequential implementation on a Dual Core Intel E8500.28 Gbit/s on a GeForce 8800 GTX which contains 128 stream processors. 14 . The OpenGL API is used for these operations. In this dissertation. Most implementations for encryption on the GPU choose the AES algorithm. the Advanced Encryption Standard (AES) is implemented and tested on GeForce 7900 GT which results in 5-6x speed up over a CPU implementation that runs on an Intel Core 2 Duo (1. Data are passed as texture elements to each fragment processor for independent execution and they are stored to the screen frame buffer or to other textures. A pretty much same AES implementation approach appears in [17] but this time the OpenCL framework is used. They chose a block size of 1024 bytes and each processed block is loaded in shared memory for further parallel processing.

and the second part is the action of the actual mixing of the keystream with the original data (by using XOR). It is a stream cipher that in order to operate needs a 128 or 256-bit key. It has the ability to operate on different blocks of a 64kb size in parallel while running in CTR mode that we described in the previous section. This fact makes Salsa20 ideal for systems that require a high throughput like backup systems. This feature. we decided to use the Salsa20 encryption algorithm in CTR mode optimized for execution on the GPU [5]. Salsa20 requires 3. a 64-bit initialization vector and a 64-bit counter.3. The keystream is independent of the original data so we can just calculate the keystream on GPU and then transfer it back to host in order to XOR it. in addition to the fact that it contains a lot of arithmetic and bitwise operations and no branches. apart from the nonce and the secret key. The reason why we decided to implement the Salsa20 algorithm is that it seems to be faster than AES. makes it ideal for execution on GPU. In fact the 20 rounds of Salsa20 algorithm are faster than 14 rounds of AES. For example.93 cycles/byte while AES requires 9. rotation and 32-bit addition. In this way we can store the original data on CPU memory and reduce the transferring time by just transferring the generated keystream back to the host device.2 cycles/byte at its best reported performance [27]. Also Salsa20 is a stream cipher which means that it has the ability to produce encrypted output of equal size as the input. Because of the fact that each work item needs to have knowledge of its counter number. AES is a block cipher meaning that the input size must be a multiple of the block size. Salsa20 is a stream cipher developed by Daniel Bernstein. The first part is the encryption process that creates a block keystream. The first step that we need to take is to split the Salsa20 code in two parts. the designed kernel. This can be a problem because the encrypted output will be slightly bigger than the input and this can cause problems in systems where we need to process a large number of files (like backup systems). we need to add padding to the last block in most occasions. Salsa20’s basic operations are XOR.4 Implementation of Salsa20 & Results For the purposes of this dissertation. To satisfy this condition. takes also as 15 . It consists of 20 rounds of mixing operations.

A block size. uint gsize = get_global_size(0). it has to first calculate its position inside all work items of all work groups and then by taking into account the block size. which contains information about how many bytes have been processed until now.The calculation of counter offset (“myBlock”) for each work item block 16 . uint myLocalId = get_local_id(0). uint to = from + blocksize . which gives information about how many bytes each work item is responsible for in order to create an appropriately sized keystream.1. So the last work item will need to produce a smaller sized keystream. ulong myBlock = (bytesOffset / 64) + ((myGroupId * groupBlockSize) / 64) + ((myLocalId * pr_block_size) / 64).4. if (from >= totalbytes) return. uint lsize = get_local_size(0).1 .parameters the following:   A bytes offset.  The total number of bytes transferred to the GPU this time. if (to >= totalbytes) to = totalbytes-1. uint from = myGroupId*groupBlockSize + myLocalId*blocksize. A demonstration of this method appears below: uint myGroupId = get_group_id(0). uint groupBlockSize = lsize*blocksize. When a work item wants to calculate its block number. This information is used in the case that the total work isn’t divided exactly by the block size. it can calculate its block number by adding the block offset. Figure 3.

will cause global memory problems because of the amount of data of the generated keystream. To hide latency. Finally. let us suppose that our GPU device supports a total of X parallel threads and we decide a block size of Y.The nonce. So for different GPUs this value may vary but not significantly. To test our implementation of Salsa20 we used two different graphic cards. All work items of a workgroup can read the nonce from the same memory address which results in just one transaction. then it is written in global memory. pinned memory was used. Pinned memory can provide higher transfer rates between the two devices which can reach 5GB/s on PCIe x16 Gen2 cards [12]. A large block size. For example. NVIDIA GeForce 9400m and GeForce 9600m GT. We should note that these cards can be found in laptops or desktops of normal users. The specifications of GeForce 9400m and 9600m GT appear below: 17 . bytes offset and block size are passed in global memory and are used by all work items. we chose to process data through registers and not shared memory for better performance. A large block size will not cause private memory problems since the keystream is generated in blocks of fixed size (64 bytes). The product of XYZ must be less than the total amount of supported GPU memory or we will get a compiler error. By saying “block size” we mean the amount of data that is distributed to each work item. Based on previous works on other encryption algorithms that found a relatively small optimal block size of 1024 bytes [11]. This is a data size that can be executed in parallel. The results are written to the global memory in chunks of 16 bytes (128 bit). and the same private memory is used to generate the next keystream block. Here we should note that the optimal block size also depends on the hardware so it has to be decided on the runtime after acquiring the GPU device’s information about the maximum number of work items within a workgroup. The results from these two cards were compared to a single-threaded and a multithreaded implementation on an Intel Core 2 Duo at 2. We need to try different block sizes and find the optimal for this method. for the transfer of data between the host device and GPU global memory. A very important issue is that we need to define an optimal block size. however. we need to pass on the GPU a multiple Z of this size.26 GHz.

18 . OpenCL can be used to handle parallel execution on heterogeneous devices.for the execution on the NVIDIA GeForce 9600m GT using OpenCL.2 . We tried different block sizes in the range of 64 bytes to 16 KB.Model Streaming Processors Memory Clock GeForce 9400m 16 256 Mb 1100 MHz GeForce 9600m GT 32 256 Mb 1250 MHz Table 3. GF 9400m . CPU OpenCL . Execution times were measured by using the average of 10 executions. by using OpenCL for CPU execution we can take advantage of all available CPU cores and gain maximum performance on CPU. we will use the following abbreviations:     GF 9600m GT . The block size refers to the amount of data given to each work item for processing. by distributing work to available cores. The times that appear include tests that used different block sizes. In the next figure. Bigger block sizes were not tested because of the limited GPU memory. including CPUs.4. As mentioned before.Technical specifications of the graphic cards used for testing In the next figure we present the resulting times of the encryption of a 200 MB file. and generally in all figures from now on. CPU 1-thread .for the multi-threaded parallel execution on the CPU (Intel Core 2 Duo) by using the OpenCL framework.for the execution on the NVIDIA GeForce 9400m using OpenCL. So.for the sequential execution of the algorithm on CPU (Intel Core 2 Duo).

Finally the best throughput achieved by a GPU implementation was this of 9600m GT which was equal to 159 Mbytes/s. The execution times of the 32 streaming processors of 9600m GT can compete with the times of a multithreaded GPU implementation. The 9600m GT best execution time is very close to the respective multithreaded CPU time. This is because 9600m GT contains the double amount of streaming processors. By using smaller block sizes we can take advantage of data parallelism between work items more easily and make sure that we are not loosing performance due to memory latency. The respective value for the multithreaded CPU implementation was 180 Mbytes/s. We can see that the performance of both GPU devices is faster than this of a single threaded CPU implementation. The CPU implementations don’t seem to be much affected by the block size.4. 19 . for the single threaded CPU implementation 49 and for the 9400m was 93 Mbytes/s. The important point in this graph is that GPU performance is maximized for relatively small block sizes between 64 and 256 bytes but it is also acceptable for sizes up to 2048 bytes.Execution times of Salsa20 for all devices using different block sizes The results we got are very interesting. The main reason for this is that with large block sizes.6000 5000 Time (ms) 4000 3000 2000 GF 9600m GT GF 9400m CPU OpenCL CPU 1-thread 1000 0 64 512 Block Size (bytes) 4096 Figure 3.3 . For very large block sizes the performance drops considerably. each thread has to write more data in global memory and it is more difficult to hide memory latency. The throughput is almost double for the 9600m GT when compared to 9400m.

We used GPUs with up to 32 stream processors so we cannot make a comparison. we can achieve better times on GPU. the results cannot be compared to related work for two reasons: the first one is that there isn’t published relevant work for the Salsa20 algorithm on GPU and the second is that the related work in this field uses GPUs with huge computation power and hundreds of cores. Our best GPU used 32 stream processors for Salsa20 and achieved 159 MB/s. Finally.Throughput measurements of execution of Salsa20 on different devices In general. 20 . The results show that with a number of stream processors greater than 32. the results that we got show that GPUs with a small number of streaming processors can be used effectively in order to achieve a high throughput for the Salsa20 algorithm.4 . For example. the better the throughput we can achieve. The more stream processors we have available.4. in [11] a throughput of 1035 MB/s is achieved for the AES-256 using a GPU with 128 stream processors.Device GeForce 9400 GeForce 9600m GT CPU single thread CPU OpenCL Throughput (Mbytes/s) 93 159 49 180 Table 3.

which could result in a lot of disk space occupied by 21 . The main characteristics of hashing algorithms are that they can compute a fingerprint from a large data sequence pretty fast. This would only work if we kept the digest of each processed block of data. when someone else wants to copy or download this file he can check if the downloaded file has the same checksum as the original file. In this section. The reason for this is that in order to compute the message digest of a file. but we can instead create and check the digests of different blocks of data transmitted.Chapter 4 Hashing on GPU 4.1 Background Hashing algorithms have the ability to create a fixed-sized data sequence from a variable-sized data sequence. the digest of a file can be generated at some point. we are going to deal with hashing algorithms that are used to compute a message digest (fingerprint) of data sequences. So we are not allowed to split the file in blocks and process them independently in parallel. If not. then he knows that there was an error during transmission and he can try again. These algorithms can help us to identify if there was a transmission error or some other malfunction that resulted in the alteration of some of the original data. with a very low probability. we need to process all data of this file through the hashing algorithm sequentially. Another important point that we need to mention is that this kind of algorithms are not parallelizable. to get the same fingerprint from 2 different inputs. For example. but the reverse procedure is impossible and also it is unlikely. The digest doesn’t have to be generated for a whole file.

another block from each data stream is processed. In general. all blocks that depend on each other will be processed sequentially but in the same time we can take advantage of GPU parallelism. So every block of data processed must take into account the output of previous blocks of the same data stream. this approach requires a large number of different data streams that can be processed in parallel and. In this way. So how can we take advantage of the parallel nature of GPUs in order to compute digests of data faster? There are 2 main approaches. however. the high level of hashing algorithms has this form: 1. Process next block of data stream (fixed size. As mentioned before. usually 512 bits) 3. 22 . The first one is to give at each GPU thread a different block of the same data stream in parallel and keep a digest for each of these blocks for later reference. this number must be much larger than the maximum number of concurrent threads within a GPU device to help hide memory latency. At the next step. If there are more blocks to process from the same stream. this attribute of hashing algorithms is desirable because files with the same content but in different order must generate different digests. Initialize digest variables 2. The second one is to use many independent data streams and let GPU process one block at a time from each data stream in parallel. in fact. Of course. Apply the hashing function on this block (which modifies the digest variables) 4.checksums of different blocks of the same file instead of having a single fixed size digest for the whole file. go to step 2 5. Of course. Output digest variables (fixed size) It is easy to understand that we cannot parallelize this algorithm by processing different blocks because each new block must use the modified variables of the previous block. this would need a lot of extra disk space to store all computed digests.

however. another disadvantage is the large number of blocks that we need to transfer on the GPU. Another very important advantage of hashing algorithms is that the output of a large data sequence is very small (128 to 512 bits depending on the algorithm). MD5 and SHA1 do not contain branches. OR.2 GPU advantages and disadvantages For the purposes of this project. operate in parallel on different data streams. right shifting and addition modulo . This is the approach that we discussed in the previous section which processes many different data streams in parallel. The most well- 23 . We can. the MD5 algorithm produces a 128-bit digest and the SHA1 algorithm a 160-bit digest.4. A digest cracker tries to find a data sequence that can result in a given digest when processed through a specific hashing function. We already know that data transfer to and from the GPU device can be a bottleneck but in the case of hashing we do not have to worry a lot about moving data back to the host because the output of each block has a small fixed size. the amount of data transferred to the GPU can still be a bottleneck. This minimizes the time needed to transfer the results from the GPU device back to the host. and are based on arithmetic and bitwise operations such as XOR. AND.3 Relevant work A lot of background work of GPU hashing in industry was focused on cracking digests. left bit rotation. NOT. 4. At this moment. the MD5 and SHA1 algorithms [7][8] were chosen for testing on GPU. The structure of these algorithms is similar to the one described in the encryption section. many programs available on the Internet exist that are able to use the available GPU devices of a system in order to crack MD5 and SHA1 password digests. Although we do not need to transfer back a lot of information. In fact. The biggest disadvantage of hashing algorithms is their sequential nature that does not allow us to operate on different blocks of the same data stream in parallel. Additionally. The way to do this is to calculate the digest of many relatively small data sequences until the result matches the given digest.

For both algorithms a similar approach is used. For the SHA1 algorithm. These modifications included the removal of not supported code and also a duplication of the “Md5Update” function so that it can support pointer parameters that refer to different address spaces (vector variables in registers and GPU global memory).5 GB/s in a NVIDIA GeForce 9800 GTX+ (128 stream processors). In [20]. MD5 Message Digest Algorithm" [7] was used as a starting point. Some modifications were needed in the code in order to compile for execution on the GPU. In academic literature.4 Implementation of MD5 and SHA1 & Results For the MD5 algorithm. 4. there are a limited number of published papers for MD5 or SHA1 hashing on the GPU. Inc. the "RSA Data Security.known program available is the Lightning Hash Cracker by Elcomsoft reaching a brute-force peak performance of 608 million passwords per second on a GeForce 9800 GX2 (2 x 128 stream processors) [19]. Each thread is assigned to a 512-bit space of shared memory that uses in order to store each processed chunk of data for further processing. The results of this work show a peak performance of 1400Mbps for a large input size using a NVIDIA GeForce 9800 GTX+ (128 stream processors). a simple implementation was used that can be found in [26]. Most academic works so far for algorithms such as MD5 and SHA1 followed a FPGA-based approach. Then the hardware scheduler of the GPU creates 24 . A bigger number of work items would require a bigger shared memory size. Other implementations [18] use the constant memory which can be fast because of the constant memory cache that is located on-chip. the main bottleneck of the implementation appears to be the small bandwidth of PCI express compared to the computation power of the GPU device. there is a detailed implementation of the MD5 algorithm on GPU which computes MD5 digests of small blocks of data of the same size in parallel. Data are passed to the GPU’s global memory in large blocks. The main limitation of this approach is that due to the limited shared memory (16KB) the implementation can be tested for thread workgroups with less than 256 work items. Again. In this paper the SHA-1 algorithm is implemented on GPU and achieves a rate of 2.

Intel Core 2 Duo at 2. Pinned memory can provide higher bandwidth.workgroups according to the given parameters. Each work item of a workgroup can identify its position in a similar way as in the encryption implementation described in previous chapter. For the testing procedure. We decided to use registers for the processing of our data. pinned memory was used just like in the encryption implementation. The digest of SHA1 is 20 bytes (160 bits) Again for the transfer of data between the host device and GPU global memory.26 GHz). The digest of MD5 is exactly 16 bytes (128 bits). Each thread reads its assigned block in little pieces that process sequentially. By using registers we are sure that we will have very small latency when reading our data. each block of data was treated as a separate data stream in order to simulate an environment with multiple independent data streams. In the next figures we present the resulting times of the MD5 and SHA1 hashing of a 200 MB file. The times that appear include tests that used different block sizes.2. The same vector type was used when storing the calculated digest back to the global memory. Please note that in both cases. For specifications please refer to table 3. We knew from the beginning that this will force us to use small block sizes but the parallel nature of GPU can support this decision. The size we chose for these pieces was 16 bytes. we used the same graphic cards and CPU as in the encryption (GeForce 9400m. The modified code was compiled as an OpenCL kernel. The block size refers to the amount of data given to each work item for the calculation of an independent MD5 hash. A large file of 200MB was used to run the tests and to simulate a multiple data streams parallel operation.4. The execution times were acquired using the average of 10 executions. GPU and CPU implementation. The reason for this is that by using this size we can use the built-in vector type of OpenCL “char16” and we can achieve aligned access to global memory. 25 . GeForce 9600m GT.

The 9600m GT performance is almost 2 times better than this of 9400m. For example. In this case. Both the single threaded and the OpenCL CPU implementations are not affected a lot by the block size. the single-threaded CPU implementation appears to be faster than the 9400m GPU.1 . Both GPU implementations are faster than the single threaded on CPU. Very small block sized are not good for GPU implementations. The CPU execution times appear to be almost irrelevant of the block size. there is not significant increase for the 9400m GPU.4. hiding latency is not very efficient because of the small number of computations that each work item has to do when compared to the amount of data that are read and written back to the global memory.Execution times of MD5 for all devices using different block sizes In figure 4. As a result of figure 4. we can say that an optimal block size for each work item in the MD5 GPU implementation is between 1024 and 4096 bytes. After the 4KBytes block size.1.1.4. This is due to the fact that more and more work items require transactions with the global memory in order to read data. As the block size grows we can see that the 9400m GPU takes a significant lead in front of the single threaded CPU implementation. The big difference in execution times between 9400m and 9600m GT comes from their difference in the number of stream processors (16 vs 32) and in their clock frequency. We can see that for small block sizes. a block size of 26 . the results of MD5 algorithm are presented.4.7000 6000 5000 Time (ms) 4000 3000 2000 GF 9600m GT GF 9400m CPU OpenCL CPU 1-thread 64 512 4096 1000 0 Block Size (bytes) Figure 4. The multithreaded CPU implementation seems to be the fastest but by using a more powerful GPU with more stream processors we can get a speedup.

we can say with certainty that GPU devices with a small number of stream processors. available in most desktop and laptops. The difference between the hashing algorithm and the encryption algorithm that we discussed in the previous chapter is that in this implementation each work item also needs to read data from the global memory and this appears to be the bottleneck here. In figure 4.5 MBytes/s was observed with the GeForce 9600m GT.8 190.5 48. This is natural since SHA1 is based on the principles of MD5.Throughput measurements of execution of MD5 on different devices To conclude the MD5 section.2 107. The analysis of the results is also similar to MD5. Device GeForce 9400m GeForce 9600m GT CPU single thread CPU OpenCL Throughput (Mbytes/s) 57. Again the multithreaded CPU implementation seems to be the fastest but execution times on GPU devices are improved as the number of multiprocessors grows (16 vs 32 stream processors of 9400m and 9600m GT respectively). while a block size of 64 bytes requires the same transaction executed 128 times more.2 .4.2. The general trend is that as the block size grows. We can see that the results are pretty similar to those of MD5. the execution times are improved but after the block size of 512 bytes there is not a significant improvement.4. the throughput achieved appears measured in Mbytes/s.8192 bytes computes and requires a write transaction of 128 bits every 4096 bytes.4.4.3 and table 4. The maximum GPU throughput of 107. A number of at least 32 stream processors or more is desired in order to achieve a good performance. So a GPU device with 32 or more stream processors can really assist or 27 . can be used for MD5 computations efficiently and also can be used in co-operation with CPUs for maximum results.4 that can be found below we present the results of SHA1 implementation.5 Table 4. In table 4.

4 155 Table 4. The 32 stream processors of 9600m GT seem to be enough to replace the CPU in the calculation of SHA1 digests.Throughput measurements of execution of SHA1 on different devices 28 .5 30.4.9 123.4. 8000 7000 6000 Time (ms) 5000 4000 3000 2000 1000 0 64 512 4096 GF 9600m GT GF 9400m CPU OpenCL CPU 1-thread Block Size (bytes) Figure 4.3 – Execution times of SHA1 for all devices using different block sizes Device GeForce 9400m GeForce 9600m GT CPU single thread CPU OpenCL Throughput (Mbytes/s) 51.replace the CPU in SHA1 hashing computations.4 .

on files where the main characteristics are still recognizable when their quality is not so good. Directory-based algorithms are often used because of their simplicity and simple algorithms operate better on the GPU. but after decompression we are able to get back the file that was originally compressed. videos and. Lossy compression is used on photos. The LZ78 algorithm uses a dictionary that updates while traversing the available data and it also keeps a copy of the largest sequence found in the dictionary (called prefix). A lot of data are compressed every day in order to reduce their size and make them more suitable for transfer over the Internet. executable files etc. There are many different compression algorithms that take advantage of the fact that data sequences contain large identical sub-sequences that we can encode with smaller representations.Chapter 5 Compression on GPU 5. Input is processed byte by byte.1 Background Compression is an essential operation. Lossy compression refers to compression algorithms that try to reduce the size of a file with a cost on its quality. lossless compression refers to compression algorithms that can compress a file by reducing its size. On the other hand. We are going to implement the dictionary-based Lempel Ziv 78 (LZ78) algorithm [13] for execution on GPU so this is a good place for a brief description of this algorithm. Each time a new character 29 . more generally. In this section. There are two different types of compression: Lossy and lossless compression. sounds. This kind of compression is mostly used on files such as text files. we are going to research a little further the prospects of lossless data compression on GPU.

The only efficient way to implement a lossless compression algorithm on GPU is to sacrifice compression ratio in order to get the wanted parallelization. Synchronization. At that point. This operation is optimized when we can have a central dictionary structure that controls the execution of the algorithm and optimizes the compression ratio by keeping as much information as possible. This is a compressed sequence. The main idea behind compression algorithms is to find repeated sequences of characters in a file and replace them with a shorter representation depending on their frequency in the file. Those restrictions would make the algorithm even more complex. GPU likes to execute a lot of threads in parallel meaning that these threads must operate on independent data. This also means that each thread cannot make use of information gathered by other threads unless there is some kind of synchronization between them which would slow down the whole procedure. we reset the prefix and we output the sequence {position of the prefix in the dictionary + new character}. This can be done by compressing different blocks of 30 . we update the dictionary with a new entry that contains the sequence {prefix + new character}. we update the prefix with the new character and we keep reading characters following the same procedure until a match in the dictionary cannot be found. This procedure continues by constantly updating the dictionary with new sequences and by outputting references to it until there is no more input. a search is taking place to find out if the sequence {prefix + new character} is present in the dictionary. As mentioned in previous chapters. we will present which of their characteristics prevent the full exploitation of the GPU’s computation power and how we can deal with these problems. Here we should note that the problems of moving data to and from the GPU discussed in the section about encryption and hashing. Then those data should be moved back and forth between the GPU and the host device in order to feed next blocks. also apply here.2 GPU advantages and disadvantages After getting a more clear understanding on how lossless compression algorithms work. If it is present. follows the same technique by constructing a similar dictionary and by following the references. The opposite operation. 5.is read. decompression.

All problems discussed above plus the complex nature of compression algorithms must be taken into account. As a result different threads execute different instructions which results in a sequential execution at some parts of the code between threads. Compression algorithms contain a lot of branches in their code. the amount of memory needed for the compressed/decompressed data is not always known in advance. Apart from this. a lot of “if” and “while” statements that force different threads to follow different paths of execution some times. There is not much that we can do to avoid this in a GPU implementation so this is an important disadvantage. This requires a re-implementation of the compression algorithm in order to follow the GPU framework standards. Complex and branched algorithms. the compressed size of a block of data can have a maximum size equal to the original size plus some header information about the compressed block. we will need to know in advance the size of the original block by reading the appropriate header information so that we can easily allocate the memory required for decompression. The limited GPU global memory and the large number of concurrent threads that deal with different blocks is an important problem that needs to be solved. A successful GPU implementation must supply enough pre-allocated memory to the (de)compression kernel in order to successfully (de)compress all blocks without running out of memory resources. To decompress a block. A way to overtake this problem is to make some conventions that will help us deal with it. For example. The main structure of the algorithm needs to be optimized and modified in order to satisfy all GPU restrictions and to take 31 .data independently treating them as different streams of data. when dealing with GPUs. When dealing with compression and decompression. Dynamic memory allocation is not supported in running kernels so we need to know in advance information about the size of the current block. Also compression algorithms do not contain arithmetic operations and is all about searching for patterns. compression algorithms need to allocate memory for a number of suboperations. So we cannot take advantage of the computation power of GPUs. Another important issue. is the limited memory supplied and the restrictions for memory allocation of current parallel programming frameworks for GPUs like OpenCL and CUDA. This will reduce our compression ratio a little but will speed up the whole procedure.

advantage of all GPU benefits. In [22] a parallel block compression approach is used in order to achieve speedup to dictionary based compression algorithms. Because the parallel processing of blocks may result in a reduced compression ratio with independent dictionaries. there is relevant work on algorithms for parallel block compression in general. the LZ78 algorithm was chosen.4 Implementation of LZ78 & Results For the compression algorithm. many other zip libraries were examined such as bzip. A very famous block compression program is bzip2 [23] which uses a combination of some famous compressions algorithms including the Burrows– Wheeler transform [24] and the Huffman coding algorithm [25]. which makes it a bad candidate for GPUs due to limited memory. 5. 5. This is the reason why we decided to 32 . Before this choice. This algorithm works on blocks and compresses each block independently. Nevertheless. usually between 100 and 900 Kbytes.3 Relevant Work There are no relevant academic papers on lossless data compression on GPU. In contrary there are a lot of research papers on lossy compression and especially on lossy image compression algorithms on GPU because of the fact the GPUs are optimized for handling image files. a joint dictionary construction is proposed where different compression processes reference a shared dictionary. which is the method that we will use in the implementation part. The fact that there are no relevant academic papers can be explained because of the nature of lossless compression algorithms. As described in the previous section. these algorithms do not fit well on the GPU architecture. The problem is that it operates on large blocks. gzip and others but these libraries were too complicated for the GPU architecture: too large code with a lot of branches and heavy memory operations.

The problem with this approach is that synchronization will lead to delays and will reduce efficient use of parallelism. The main idea for the compression on GPU is to split the data and process blocks in parallel that will be compressed independently. For the implementation part. the better compression ratio we will achieve. it was clear that our bottleneck would be the transferring process to and from the GPU. Of course. This approach will not be implemented but can be considered as a potential future work of this dissertation. we have to choose a relatively large global size of data to be compressed each time with respect to the total available memory of the GPU device. From the beginning. we will use this approach. we must keep in mind that these parameters depend on our hardware and the PCI express bandwidth. we need to make sure that the full bandwidth is used. For this reason. Another issue is the dictionary size of each work item. For the purposes of this dissertation. Due to the large number of threads that the GPU platform needs. We must note that this implementation was created for the GPU architecture. The first approach can lead to better compression ratio but it needs some kind of synchronization between work threads. we decide to create an implementation that can fit to the GPU architecture and test it in several devices. we do so by replacing the oldest entry of the dictionary 33 . When the dictionary is full and we want to add a new entry. other CPU implementations can be a lot faster than this because of their large memory and freedom of memory allocations. We can follow 2 approaches here: either give a block of data to a workgroup or give a block of data to each work item. seems faster but will result in a reduced compression ratio. The idea is to create a shared dictionary for each workgroup that all work items within it will be able to update and to reference it. assigning an independent small block to each work item.implement a LZ78 version that can fit well on the GPU and then test it in practice. The second approach. On different systems. the dictionary size has to be small. Dictionary-based compression algorithms are often used because of their simplicity. LZ78 uses a dynamic dictionary that is created during the compression process but because of memory issues on GPU we need to set a limit to its size. Our main concern was to find ways to speed up the compression process as much as possible. The bigger the size.

Another reason is that our implementation uses a sequential search to find a match in the dictionary so a large dictionary size would result in more search time. So we can pre-allocate the memory needed. For the current implementation. For the encryption part. 3. up to 16. at the time of decompression the decryption function will know that the next compressed data size read will result in an unzipped sequence of a fixed block size. Shared memory. we chose to use a small dictionary size of 256 entries for a number of reasons. Instead of using registers to store the dictionary.with the new sequence. can serve multiple transactions. So a 256-entry of the dictionary can be referenced with 8 bits. we chose to bypass the shared memory and copy small chunks of data each time into registers for faster execution. 2. The first and most important reason is because of the limited GPU memory. 1. 34 . For our implementation. then the input will be stored unchanged. To bypass memory allocation issues we will make some conventions: When each work item completes the compression of a block of data. The second reason is because we need a small number of bits to represent a reference in the dictionary. the OpenCL framework does not support dynamic memory allocation and this may be a problem in compression/decompression functions because we cannot be sure of the compressed and decompressed size. In this way. we can also use the shared workgroup memory that has a bigger capacity. unlike global memory. usually 16Kb. As we said before. Each work item must have a small dictionary if we want to guarantee that we are not going to have memory problems. by different work items in parallel. and can be as fast as accessing registers when there are no memory bank conflicts between threads asking for a transaction. it also needs to save the compressed data size. We will make a convention that if the encrypted data results in a bigger size than the input. buffers of the same size as the input data size were preallocated to store the encrypted data.

For the testing procedure.Execution times of LZ78 for all devices using different block sizes In figure 5.26 GHz).4. The performance of 9600m GT is always better than the 35 . we used the same graphic cards and CPU as in the encryption (GeForce 9400m. The execution time is reduced in GPUs when using a relatively small block size between 128 and 1024 bytes. For specifications please refer to table 3. As the block size grows more and more threads follow different paths.4.4. Again. The block size refers to the amount of data given to each work item for the compression of a different block of data. In this section we present the resulting times of compressing a 9. this can be explained from their difference in stream processors (16 vs 32). For all tests. The 9400m performance is always slower than the sequential CPU implementation. The 9600m GT execution seems improved: execution times are reduced by 50% when compared to those of 9400m.1. This is because small block sizes don’t contain a lot of information in order to take full advantage of the dictionary and as a result there are fewer replacements and this cause fewer threads to follow different paths.1 . we can see the results of the implemented LZ78 algorithm. Intel Core 2 Duo at 2. GeForce 9600m GT. 22000 17000 Time (ms) 12000 7000 2000 64 256 1024 4096 CPU OpenCL GF 9400m GF 9600m GT CPU 1-thread Block Size (bytes) Figure 5.3MB file by using our LZ78 implementation.2. a dictionary size of 256 entries was used. The times that appear include tests that used different block sizes.

3 MB file used after parallel block compression with different block sizes. nearly unaffected. the LZ78 algorithm performs better when running as a multithreaded CPU program (OpenCL CPU). 512 bytes cannot take full advantage of it because each entry can contain several bytes. The chosen dictionary size was 256 entries so data of 64.5 8 7.5 7 6.sequential CPU implementation but for large block sizes the performance drops. Before making any assumptions. Fewer dictionary entries mean fewer possible compressed sequences. The next figure presents the compressed size achieved for the 9. 256. 9 Compressed Size (MB) 8. 128. the compressed size is too large.5 6 64 512 Block Size (bytes) 4096 Compressed size Figure 5. This is because small block sizes don’t give the capability to the algorithm to fill all available positions of the dictionary. In general.4. we have to look how these block sizes behave when it comes to the compression ratio achieved. That is why we see an improvement after a block 36 .Compressed size achieved with different block sizes by using our specific LZ78 implementation with a small fixed sized dictionary We can see that for very small block sizes.2 .

an optimal block size for each work item would be between 512 and 1024 bytes because these sizes give good execution times and a relatively good compression ratio. 37 . the results show that GPU memory limitations can be very harmful for the resulting compressed size. Also. As a result from figures 5. the nature of compression algorithms doesn’t allow the exploitation of GPU computation power. GPUs are not yet ready for this task.4.1 and 5.2 we can state that for the specific parameters we selected.size 512 bytes.4. A CPU implementation with an infinite (or very large) dictionary size would give much improved compressed sizes. To conclude.

1. We could also say that a compressed stream of data results in a reduced amount of data for encryption.Chapter 6 Putting it all together In this chapter. In figure 6. Hashing algorithms. compress them. 38 .1b it is clear that by combining encryption and compression we can reduce the total time needed to move data between the two devices. are strictly sequential and have to operate on each block in order. on the other hand. Heavy transfers through the PCI express are reduced from 3 to 2. then encrypt them and finally transfer them back to the host. The idea is to move blocks on the GPU. we can reduce the time required to transfer data from the host device to the GPU and back when compared to executing encryption and compression independently (figure 6. we will examine how we can combine some algorithms discussed earlier for execution on the GPU in order to process a single stream of data more efficiently. so buffers must be allocated and data need to be transferred back for the worst possible scenario. Unfortunately the exact size of compressed data cannot be known in advance. the red arrows represent operations that need recurrent transfer of data and make heavy usage of the PCI express bandwidth.1a and 6. So from figures 6.1). We already know that a stream of data can be divided in little blocks for parallel encryption and compression. So in this section we will discuss how compression and encryption can be combined on the GPU in order to get the maximum performance. By combining these two operations on the GPU. Combining a sequential algorithm with parallel algorithms is not optimal on GPU. green arrows indicate operations that happen immediately. On the other hand.

larger block sizes result in an improved compression ratio but if we get into account the incapability of GPUs to supply enough memory we soon realize that we cannot use very large block sizes. (b) Combined execution Our goal at this moment is to decide on an efficient block size that will fit both to encryption and compression. An efficient block size for both encryption and compression seems to be between 1024 and 2048 bytes. and each one will be assigned to a block of data. the results from the compression indicate that block sizes smaller than 1024 bytes suffer from reduced compression ratio. Each work item is responsible for a block of data equal to the chosen block size. On the other hand. small block sizes up to 2048 bytes have the best performance. This information is also needed for the decryption/decompression operation.Figure 6. After this.(a) Encryption and compression executed separately.1 . The size information is needed because the host needs to know how many bytes the output size of each block was in order to recover it. The procedure of this combined operation appears in figure 6. We need to have a very large number of threads on the fly. According to the results that we presented in the encryption chapter. 39 .2. It compresses the block and then it encrypts the output of compression. So the limited GPU memory available prevents us from satisfying both conditions. In general. it stores the final output size of the compressed block and the compressed/encrypted block (C/E) in the appropriate place in the global memory.

Figure 6.2 .Each work item (Wn) compresses a block and then encrypts the compressed output 40 .

The algorithms used were of different nature and for some of them (i. the compression part) had to be re-implemented from scratch in order to fit on the GPU architecture. In fact block sizes of 64 and 128 bytes seem to be optimal for our implementation. The results of hashing and encryption are very straightforward: GPU implementations are much more effective than this of a single threaded CPU version. the compression implementation was the trickiest. In this section we will discuss in detail the results that we derived by making a critical evaluation. encryption and compression. we achieved acceptable results for small block sizes between 64 and 2048 bytes. Mid range GPUs can also be very efficient in these tasks and assist or replace CPUs. For Salsa20. This fact results in very different algorithms which may or may not fit on the GPU. We also examined how encryption and compression can be executed on the GPU with just one call by determining an optimal block size for both of them. all available algorithms are very similar and results can somehow be more general and not specific for Salsa20 or MD5 and SHA1 algorithm. but acceptable performance was in the range of 512 to 4096 bytes.e.Chapter 7 Discussion In previous sections we examined algorithms and developed some GPU implementations. The results of MD5 and SHA1 gave us a peak performance for block sizes of 1024 or 2048 bytes. We tried to choose an algorithm that was relatively simple and could be parallelized easily but by sacrificing speed and compression ratio. We did a research on three different categories of algorithms: hashing. On the other hand. The reason for this is that there exist many compression algorithms and each one is based on a different approach. For the hashing and encryption part. 41 . The results also show that more powerful GPUs can easily overcome a multithreaded CPU implementation.

The results for the compression part are not very encouraging. but compression had many problems. this cannot happen for the calculation of a digest using a hashing algorithm. So the ratio of computations over amount of data is increased and this is the whole point of parallelism on the GPU: use less data for more computations in parallel. Some characteristics of compression algorithms are reduced as we explained in the relevant section such as compression ratio. encryption. compression). the combined execution of encryption and compression operations can have an improvement of performance. by taking into account all the results of this dissertation. At this point. Finally. A large block of data can be divided into multiple sub-blocks which can then be used for independent encryption and compression.1 Project difficulties During the implementation phase of this project we run into a number of difficulties. we would like also to discuss the results based on our primary motivation which was the use of GPU computation power for assisting the CPU in operations required for a backup system (hashing. According to the results. In general. were 1024 and 2048 bytes. 7. 42 . including speed and compression ratio. Hashing and encryption results were very promising on the GPU. we can state that the performance of each algorithm is improved nearly by 50% when the number of available stream processors is doubled from 16 to 32. efficient systems need to use GPU devices with 32 or more stream processors. This is a natural result because every block of data is staying for a longer time on the GPU and is used for more computations. It would be good if the hashing part could be combined for execution on GPU with the other two operations but as mentioned before its sequential nature prevents this. However. So an efficient backup system could use the CPU to compress files and then send them to the GPU for the encryption and hashing part in a pipelined way. Bigger block sizes resulted in an improved compression ratio but had an impact on speed. The block sizes that resulted in an efficient performance. In this section we are going to discuss some of these difficulties that appeared to be the most important.

We found some implementations of these algorithms that we tried to modify in order to fit in the GPU architecture. we had to execute small parts of the kernel until we reach the point of the problem. For this reason. We had to create an extra buffer space on the GPU global memory where we stored any information that we needed to know for the debugging process and then checked the content of those variables by transferring them and outputting them on the host device. Also the debugging process appeared to be much more difficult than we thought at the beginning.For the purposes of this project we had to implement a number of algorithms of different kind. The problem was that they were designed for optimal execution on CPU and GPU compiler doesn’t support the entire set of the C language. 43 . We had to coordinate the execution of hundreds of threads which was difficult at the beginning but only until we had our first algorithm running. a simple implementation of LZ78 compression algorithm was created. Checking the content of some variables during runtime wasn’t an easy task. The problem with this approach is that when there was a bug in the code that was forcing the kernel to crash then we couldn’t reach the part where we could send the data back to the host for examination. GPU devices do not support at the moment output functions such as printf. Another difficulty was that GPU has many different address spaces (described in previous chapters) so for optimal execution we had to transfer data between these address spaces. Compression algorithms were the most difficult to be modified in order to fit to the GPU because of their complex memory operations and of their big size. As in most parallel and distributed systems. For example. the debugging of many different instances that are executed in parallel was difficult. The same method of coordination and debugging was used for all algorithms. In this case. So when there was a need for copying memory we had to do it manually by replacing these functions. functions related to memory operations such as memcpy etc are not supported by GPU kernels.

The proposed approach for the LZ78 algorithm for the shared/synchronized dictionary between work items of the same workgroup can be examined as a future work of this dissertation. we could test these algorithms on more powerful. Also as future work. The GPUs that we used for our testing (NVIDIA GeForce 9400m. we realized that there was very limited information and academic reference to the data compression part on the GPU. search techniques of the dictionary instead of a sequential search could be tried such as hash tables. Also other. encryption and compression. NVIDIA GeForce 9600m GT) were entry-level and mid-range GPUs but served well the purpose of this dissertation that was to examine whether laptop and desktop GPUs could be used to speed up these operations. The problem of limited GPU memory and of the incapability for dynamic memory allocation prevented us from following this approach. more efficient. Another possible extension of this dissertation can research in detail different ways in which the CPU and GPU device can cooperate in order to achieve maximum performance for hashing. We did our best to create algorithms that can execute efficiently on the GPU but there is always room for improvements. we didn’t have enough time to go very deep in this section but we think that this research can be used as a starting point for future implementations. 44 . Because of the limited time given for this project. A research of how we can create an efficient hash table with a fixed size for the GPU platform would be very helpful for the LZ78 algorithm and would speed up the process by a large factor.7. encryption and compression in a pipelined fashion: how can these operations be synchronized and what speedups can be achieved over a pure CPU implementation. high end GPUs.2 Future Work The subject of this dissertation included many different areas of study such as hashing. During our study for the behavior of such algorithms on GPU.

we believe that. In general. we can say that GPUs with 32 or more stream processors can be used as a powerful computation device in any algorithm that involves intensive computations. In previous chapters. We believe that. In this dissertation. this will not be a problem anymore and GPUs will have bigger and faster memories. As our results show. 45 . in a few years. As a result of this. more and more computationally intensive fields start to use it in order to achieve greater speedups.Chapter 8 Conclusion The computation power of GPU devices grows year by year. Most of these speedups were achieved by using expensive high end GPUs with a very large number of stream processors and high clock frequencies. There is too much unexploited computation power at this moment in most users’ desktop and laptop GPUs. we referred many times to the limited GPU memory. GPUs will be an essential computation device in every user’s computer. this power can be used to maximize the performance of many algorithms. either assisting CPUs in computation intensive problems or even replacing them. Compression algorithms need to be implemented with many restrictions in mind in order to run on GPU devices and these restrictions have a cost on speed and compression ratio. we proved that even entry level and mid range GPUs can be used for encryption and hashing effectively. The results that we got from the Salsa20 and the MD5 algorithm are very encouraging. As this power grows. Unfortunately. in the near future. there are fields such as compression that are not yet ready to take full advantage of GPU devices. Encryption and hashing algorithms have been already tried on the GPU architecture and showed great speedups.

Eastlake and P. Xie and D. Zhang. Jones. 46 .com/GPUGems2/gpugems2_chapter32. “Chapter 32. 84-97. [6] T. pp. Johannesburg. Rivest.J. Feng. NVIDIA Corporation. [7] R. [8] D. “Intel microprocessor export compliance metrics”. [9] Wikipedia. Motorola and Cisco systems. “The Salsa20 Family of Stream Ciphers”.. NVIDIA Corporation. Cryptology ePrint Archive.1”. Springer-Verlag. MIT and RSA Data Security. RFC 1321. 2001. Pilkington and B. 1992.com/support/processors/sb/cs-023143.nvidia. under publication in 24th Large Installation System Administration Conference (LISA 2010).intel. [2] Intel. Report 2009/223. Irwin “A Canonical Implementation Of The Advanced Encryption Standard On The Graphics Processing Unit”.developer. 7 . http://en.Bibliography [1] P. 2008.htm [3] GPU Gems 2. “The MD5 Message-Digest Algorithm”. Anderson and L. CA. http://www. Taking the Plunge into GPU Computing”. 2009. November 7–12 2010.9 July 2008. “Fast and Secure Laptop Backups with Encrypted De-duplication”. “How to Find Weak Input Differences for MD5 Collision Attacks”. New Stream Cipher Designs: The eSTREAM Finalists.org/wiki/Block_cipher_modes_of_operation [10] N. [5] D. RFC 3174. In the Innovative Minds Conference. Bernstein. http://http. Inc. South Africa. “Block cipher modes of operation”. San Jose. “US Secure Hash Algorithm 1 (SHA1)”. Version 3.html [4] “OpenCL Programming Guide for the CUDA Architecture. 2009. 2009.wikipedia.

pp. pp. Proceedings of the 4th International Conference on. pp. 2007. Waldron. D. 1978. [15] Accelereyes . [17] O. and F. [14] O. www.elcomsoft. [18] Lin Zhou and Wenbao Han.download. Ziv and A. International Conference on. vol.Ltd. pp. GPUs and Cell Processors”. IEEE Transactions on.accelereyes. Computational Science and Its Applications (ICCSA). “The AES Implantation Based on OpenCL for Multi/many Core Architecture”. Harrison and J. Version 1. Ubiquitous Information Technologies & Applications.org/opencl/. Proceedings of the 9th international workshop on Cryptographic Hardware and Embedded Systems. Engineering Computation. 129-134. ICSPC 2007. 209-226. 2009. 2010 International Conference on. Lempel. Gervasi. 2009. 2007. “AES Encryption Implementation and Analysis on Commodity Graphics Processing Units”. 101-104.com/wiki/index. Information Theory. ElcomSoft Co.[11] S. Ext gpu shader4 opengl extension.0”. and Benxiong Huang. NVIDIA Corporation.nvidia. “CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography”. 1-5. Vienna. Vella. http://wiki. “Compression of individual sequences via variablerate coding”. Jianhua Ma. “A Brief Implementation Analysis of SHA-1 on FPGAs. http://www. ICUT '09. “High Throughput Implementation of MD5 Algorithm on GPU”. 2009. 2007.The open standard for parallel programming of heterogeneous systems. http://developer. 2010. 2007.txt. [19] Lightning Hash Cracker. 2009.html [20] Guang Hu. pp. ICEC '09. 65-68. IEEE International Conference on. Khronos Group. 24. “GPU Memory Transfer”. pp.. Manavski. 47 . 530-536. Signal Processing and Communications. [21] NVIDIA.php/GPU_Memory_Transfer/ [16] OpenCL . [13] J.com/lhc. [12] “NVIDIA OpenCL Best Practices Guide.khronos.com/opengl/specs/GL_EXT_gpu_shade r4. 2009. Russo. Austria: Springer-Verlag.

Burrows. “Parallel compression with cooperative dictionary construction”. Thomas.[22] P. D. 200-209. 1098-1101. Bernstein. Data Compression Conference. 1994.to/streamciphers/why. Technical Report 124.J. “Why switch from AES to a new stream cipher?”. http://www. Burrows. [25] D. Proceedings of the IRE. pp. and D. J. http://cr. pp. [26] Secure Hashing Algorithm (SHA-1) C implementation. Wheeler. DCC '96. http://www. “A Method for the Construction of Minimum-Redundancy Codes”. M. 40. 1996. [23] [24] Bzip2 compression algorithm.packetizer. Julian Seward.com/security/sha1/ [27] D.yp.org/ M.html 48 .J. Packetizer Inc. Wheeler. 1996. Digital Equipment Corporation. and J.. “A block-sorting lossless data compression algorithm”. Robinson. Proceedings. Huffman. vol.bzip. J. Franaszek. 1952.