You are on page 1of 13

COMPUTER ARCHITECTURE AND ORGANISATION

DIGITAL ASSIGNMENT – I

DATA STRUCTURES IMPLEMENTATION USING


CUDA PROGRAMMING

PROJECT MEMBERS
1. Sidharth PJ – 22BRS1086
2. Aravind N – 22BRS1099
3. Jesseman Devamirtham – 22BRS1112

TEACHER UNDER GUIDANCE – Prof. Nivedita M.


COURSE CODE – BCSE205L
SLOT – F1

1
CONTENTS
1. Introduction to CUDA 3

2. CUDA Programming 5

3. CUDA Memory Management 5

4. CUDA Device Specific Properties 6

5. Description of Device Properties 8

6. Example CUDA code for simple addition 10

7. CUDA Project Objectives 11

8. Description for Project 12

2
INTRODUCTION:
CUDA (Compute Unified Device Architecture) is a parallel
computing platform and application programming interface (API)
model created by NVIDIA. It allows developers to harness the power
of NVIDIA GPUs (Graphics Processing Units) for general-purpose
processing tasks, including data structure implementations and
algorithmic computations. By leveraging CUDA, developers can
achieve significant speedup compared to traditional CPU-based
implementations.

CPU (Central Processing Unit):


• Designed for general-purpose computation.
• Typically has fewer cores (e.g., 4 to 64 cores).
• Optimized for low-latency task execution.
• Well-suited for tasks requiring complex decision-making and
branching logic.

Simple “Hello World” Code in CPU:


int main( void ) {
printf( "Hello, World!\n" ); return 0;}

3
GPU (Graphics Processing Unit):
• Originally designed for rendering graphics but can be used for
general-purpose computation.
• Has a large number of cores (e.g., hundreds to thousands of
cores).
• Optimized for parallel processing, ideal for tasks with high data
parallelism.
• Well-suited for tasks requiring high-throughput computation on
large datasets.

”Hello World” in GPU:


#include <iostream>
_global_ void kernel( void ) {
printf( "Hello, World!\n" );}

int main( void ) {


kernel<<<1,1>>>();
return 0; }

CUDA C adds the __global__ qualifier to standard C. This mechanism


alerts the compiler that a function should be compiled to run on a device
instead of the host. In this simple example, nvcc gives the function
kernel() to the compiler that handles device code, and it feeds main()
to the host compiler as it did in the previous example.

4
CUDA PROGRAMMING:
CUDA provides a set of functions and syntax for programming
NVIDIA GPUs. Here's an overview of some key concepts:
Kernel Function: A function that runs on the GPU. It's executed in
parallel by multiple threads. Defined using _global_ keyword.
Kernel Call: Launching a kernel function from the CPU code. Done
using <<<...>>> syntax.
Threads: Individual units of execution within a GPU. Thousands of
threads can run in parallel.
Blocks: Threads are organized into blocks for better management.
Blocks can contain multiple threads.
Grid: Collection of blocks. A kernel is executed as a grid of blocks.

CUDA MEMORY MANAGEMENT:


A. cudaMalloc: Allocates memory on the GPU.
B. cudaMemcpy: Copies data between CPU and GPU memory.
C. cudaFree: Frees memory allocated on the GPU.
Thread Blocks: The GPU organizes threads into groups called "thread
blocks." Each thread block can contain a maximum number of threads,
which depends on the GPU architecture. the number of threads per
block and the maximum number of threads per block can vary between
different GPU models and architectures. For example, for devices with
a compute capability of 2.x, the maximum number of threads per block
is 1024.
Grids: Threads blocks are organized into a grid. The grid is the
collection of all thread blocks launched to execute the kernel.

5
Thread Indexing: Inside a kernel, each thread has a unique index that
identifies its position within the grid and block structure. These indices
are used to calculate which data element each thread should operate on.
Since the maximum number of Threads differ from one GPU to other,
we need to know the configurations and architecture of the device one
is working with.

CUDA DEVICE SPECIFIC PROPERTIES:


There can be more than one cuda capable devices which are interfaced
together. To count the number of devices , we use
cudaGetDeviceCount().

int count;
HANDLE_ERROR( cudaGetDeviceCount( &count ) );

After calling the above function , we can iterate through the device
properties and query about relevant info. All these properties are stored
in a C structure of type cudaDeviceProp.
Example structure :

struct cudaDeviceProp {
char name[256];
size_t totalGlobalMem;
size_t sharedMemPerBlock;
int regsPerBlock;
int warpSize;

6
size_t memPitch;
int maxThreadsPerBlock;
int maxThreadsDim[3];
int maxGridSize[3];
size_t totalConstMem;
int major;
int minor;
int clockRate;
size_t textureAlignment;
int deviceOverlap;
int multiProcessorCount;
int kernelExecTimeoutEnabled; int integrated;
int canMapHostMemory;
int computeMode;
int maxTexture1D;
int maxTexture2D[2];
int maxTexture3D[3];
int maxTexture2DArray[3];
int concurrentKernels;
}

7
DESCRIPTION OF DEVICE PROPERTIES

8
9
EXAMPLE CUDA CODE FOR SIMPLE ADDITION:
#include <stdio.h>
_global_ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n) {
c[index] = a[index] + b[index];
}
}
int main() {
int n = 1000;
int *a, *b, *c; // Host copies of a, b, c
int *d_a, *d_b, *d_c; // Device copies of a, b, c
int size = n * sizeof(int);
// Allocate memory on the GPU
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);

10
cudaMalloc((void **)&d_c, size);
// Initialize arrays a and b on the host
// ...
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with n blocks
add<<<(n+255)/256, 256>>>(d_a, d_b, d_c, n);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;}

CUDA PROJECT OBJECTIVES:


• Implement Common Data Structures: Implement linear data
structures (arrays, linked lists) and non-linear data structures
(trees, graphs) in CUDA C.
• Implement Algorithms: Implement searching algorithms (e.g.,
linear search, binary search) and sorting algorithms (e.g., bubble
sort, quicksort) using CUDA C for these data structures.
• Conduct Comparative Study: Measure and compare the
runtime performance of these algorithms on CPU and GPU for
varying input sizes.

11
• Optimization Techniques: Explore and implement optimization
techniques specific to GPU programming (e.g., memory
coalescing, thread divergence reduction) to improve
performance.
• Analyzing Results: Analyze the performance results to
understand the impact of GPU parallelism on different data
structures and algorithms.

DESCRIPTION FOR PROJECT:


• In the project, we aim to harness the power of CUDA (Compute
Unified Device Architecture) programming to expedite output
processing tasks by employing a variety of data structures
optimised for parallel computing. CUDA, developed by
NVIDIA, allows for efficient utilisation of GPU (Graphics
Processing Unit) resources to accelerate the computational tasks.
• We will explore a diverse range of data structures such as arrays,
linked lists, trees, graphs, queues and stacks, adapting them to the
parallel processing paradigm offered by CUDA. By leveraging
the massive parallelism inherent in GPU architecture, we can
significantly enhance the speed and efficiency of output
processing tasks compared to traditional CPU – based
approaches.
• The project will involve implementing these data structures in
CUDA, optimising memory access patterns, and utilising CUDA
threads efficiently to distribute workloads access GPU cores.
Through careful design and tuning, we aim to achieve substantial
performance gains in tasks like data manipulation, sorting,
searching and other output processing operations.
• By combining CUDA programming with a comprehensive set of
data structures, this project offers a compelling solution for

12
accelerating output processing tasks, making it suitable for a wide
range of applications including scientific computing, image
processing, data analytics and more.

13

You might also like