Professional Documents
Culture Documents
DIGITAL ASSIGNMENT – I
PROJECT MEMBERS
1. Sidharth PJ – 22BRS1086
2. Aravind N – 22BRS1099
3. Jesseman Devamirtham – 22BRS1112
1
CONTENTS
1. Introduction to CUDA 3
2. CUDA Programming 5
2
INTRODUCTION:
CUDA (Compute Unified Device Architecture) is a parallel
computing platform and application programming interface (API)
model created by NVIDIA. It allows developers to harness the power
of NVIDIA GPUs (Graphics Processing Units) for general-purpose
processing tasks, including data structure implementations and
algorithmic computations. By leveraging CUDA, developers can
achieve significant speedup compared to traditional CPU-based
implementations.
3
GPU (Graphics Processing Unit):
• Originally designed for rendering graphics but can be used for
general-purpose computation.
• Has a large number of cores (e.g., hundreds to thousands of
cores).
• Optimized for parallel processing, ideal for tasks with high data
parallelism.
• Well-suited for tasks requiring high-throughput computation on
large datasets.
4
CUDA PROGRAMMING:
CUDA provides a set of functions and syntax for programming
NVIDIA GPUs. Here's an overview of some key concepts:
Kernel Function: A function that runs on the GPU. It's executed in
parallel by multiple threads. Defined using _global_ keyword.
Kernel Call: Launching a kernel function from the CPU code. Done
using <<<...>>> syntax.
Threads: Individual units of execution within a GPU. Thousands of
threads can run in parallel.
Blocks: Threads are organized into blocks for better management.
Blocks can contain multiple threads.
Grid: Collection of blocks. A kernel is executed as a grid of blocks.
5
Thread Indexing: Inside a kernel, each thread has a unique index that
identifies its position within the grid and block structure. These indices
are used to calculate which data element each thread should operate on.
Since the maximum number of Threads differ from one GPU to other,
we need to know the configurations and architecture of the device one
is working with.
int count;
HANDLE_ERROR( cudaGetDeviceCount( &count ) );
After calling the above function , we can iterate through the device
properties and query about relevant info. All these properties are stored
in a C structure of type cudaDeviceProp.
Example structure :
struct cudaDeviceProp {
char name[256];
size_t totalGlobalMem;
size_t sharedMemPerBlock;
int regsPerBlock;
int warpSize;
6
size_t memPitch;
int maxThreadsPerBlock;
int maxThreadsDim[3];
int maxGridSize[3];
size_t totalConstMem;
int major;
int minor;
int clockRate;
size_t textureAlignment;
int deviceOverlap;
int multiProcessorCount;
int kernelExecTimeoutEnabled; int integrated;
int canMapHostMemory;
int computeMode;
int maxTexture1D;
int maxTexture2D[2];
int maxTexture3D[3];
int maxTexture2DArray[3];
int concurrentKernels;
}
7
DESCRIPTION OF DEVICE PROPERTIES
8
9
EXAMPLE CUDA CODE FOR SIMPLE ADDITION:
#include <stdio.h>
_global_ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n) {
c[index] = a[index] + b[index];
}
}
int main() {
int n = 1000;
int *a, *b, *c; // Host copies of a, b, c
int *d_a, *d_b, *d_c; // Device copies of a, b, c
int size = n * sizeof(int);
// Allocate memory on the GPU
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
10
cudaMalloc((void **)&d_c, size);
// Initialize arrays a and b on the host
// ...
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with n blocks
add<<<(n+255)/256, 256>>>(d_a, d_b, d_c, n);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;}
11
• Optimization Techniques: Explore and implement optimization
techniques specific to GPU programming (e.g., memory
coalescing, thread divergence reduction) to improve
performance.
• Analyzing Results: Analyze the performance results to
understand the impact of GPU parallelism on different data
structures and algorithms.
12
accelerating output processing tasks, making it suitable for a wide
range of applications including scientific computing, image
processing, data analytics and more.
13