“A Technology that makes Supercomputer Personal”




•A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation. Unit Petaflops • •Supercomputers are used for highly calculation-intensive tasks. Used in Molecular Modeling, Climate Research etc.

•Developed by companies like Cray, IBM, & Hewlett Packard.


•As of July 2009, CRAY Jaguar was the fastest supercomputer. • •In July 2010, “Nebulae” of NSCS overtook CRAY Jaguar by peak performance at 2.98 petaflops per second (PFlop/s). • •Nebulae Used NVidia's Tesla C2050 GPU with CUDA Technology to boost its Performance. • •

Nebulae Specification


GPU •A Graphics processing unit or GPU • (VPU) is a specialized processor that • offloads 3D or 2D graphics rendering • from the microprocessor. •Used in embedded systems, • mobile phones, personal computers, • workstations, and game consoles. • GPU
• •

•Predominantly GPU was a supplement to CPU. Now nVidia Tesla GPU can compute 14Hybrid faster than CPU Computations. Inbuilt Dedicated times [ nVidia GeForce GTX 280 GPU vs Intel i7 960]


GPU Computing

•The excellent floating point performance in GPUs led to the advent of General Purpose Computing on GPU’s(GPGPU) •GPU computing is the use of a GPU to do general purpose scientific and engineering computing •The model for GPU computing is to use a CPU and GPU together in a heterogeneous computing model.


• • • • • •
•Sequential Part  CPU •Computational Part  GPU •Computations And Sequencing Take Part Simultaneously. •Huge Performance Boost considering Traditional way of computing. •Known as GPGPU. GPGPU (General Purpose GPU) •Recognized Graphical Programming languages like OpenGL, and CG.


Disadvantages of GPGPU

•Required Graphical Languages. •Difficult for the users to program in Graphical Languages. •Developers should make scientific applications look like Graphical applications. •Stream Processing.


• •CUDA – Compute Unified Device Architecture •Parallel Computing Architecture. •Parallel or “Many-core” architecture runs thousands of threads simultaneously •Computing Engine •Scientific Application Compiled Directly. •Programmers use “C for CUDA” ( C with nVidia extensions ) •Compiled through ‘PathScale Open64 C’ Compiler. •Third Party wrappers are available for Python, FORTRAN, Java and MATLAB. •



Advantages of CUDA

•CUDA with industry-standard C ü Write a program for one thread ü Instantiate it on many parallel threads ü Familiar programming model and language • •CUDA is a scalable parallel programming model ü Program runs on any number of processors without recompiling

•Scattered Reads (Arbitrary Addressing) •Shared Memory (16 KB) •Faster Downloads and Read backs to and from GPU. •Full Support for integer and bitwise operation, including integer texture Lookups •

CUDA Programming Model •Parallel code (kernel) is launched and executed on a device by many threads •Threads are grouped into thread blocks •Parallel code is written for a thread üEach thread is free to execute a unique code path üBuilt-in thread and block ID variables

CUDA Architecture
•The CUDA Architecture Consists of several components •Parallel compute engines •OS kernel-level support •User-mode driver ( Device API ) •ISA (Instruction Set Architecture) for parallel computing

Tesla 10 Series
• • • • •

•CUDA Computing with Tesla T10 •240 SP processors at 1.45 GHz: 1 TFLOPS peak •30 DP processors at 1.44Ghz: 86 GFLOPS peak •128 threads per processor: 30,720 threads total

Thread Hierarchy •Threads launched for a parallel section are partitioned into thread blocks. Ø ØGrid = all blocks for a given launch • •Thread block is a group of threads that can üSynchronize their execution üCommunicate via shared memory

Warps and Half Warps

GPU Memor y Allocation / Release
•Host (CPU) manages device (GPU) memory: •cudaMalloc (void ** pointer, size_t nbytes) •cudaMemset (void * pointer, int value, size_t count) •cudaFree (void* pointer)

Why should I use a GPU as a Processor
•When compared to the latest quad-core CPU, Tesla 20series GPU computing processors deliver equivalent performance at 1/20th the power consumption and 1/10th the cost •When computational fluid dynamics problem is solved it takes Ø9 minutes on a Tesla S870(4GPUs) Ø12 hours on one 2.5 GHz CPU core

Double Precision Performance • • Intel core i7 980XE is 107.6 GFLOPS • AMD Hemlock 5970 is 928 GFLOPS (GPU) • nVidia's Tesla S2050 & S2070 is 2.1 TFlops - 2.5 Tflops(GPU)

• •GeForce 8800 GTX - 346 GFLOPs(GPU). •Core 2 Duo E6600 - 38 GFLOPs. •Athlon 64 X2 4600+ - 19 GFLOPs. •

1.Accelerated Rendering of 3D Graphics 2.Video Forensic 3.Molecular Dynamics 4.Computational Chemistry 5.Life Sciences 6.Bio Informatics 7.Medical Imaging

8.Gaming Industry 9.Weather and Ocean Modeling 10.Electronic Design Automation (Real
Time Cloth Simulation

11.Video Imaging 12.Video Acceleration

The G80 Architecture

•Support C •Used Scalar Thread Processor, which eliminated need of Merging vector registers. •SIMT (Single Instruction Multiple Thread) •Shared Memory •Next major version GT200 (GeForce GTX 280, QuadroFX 5800, Tesla T10 ) •Increased CUDA Cores from 128 to 240. •Double Precision floating point introduced.

Next Generation CUDA Architecture
•The next generation CUDA architecture, codenamed Fermi is the most advanced GPU architecture ever built. Its features include •512 CUDA cores •3.2 billion transistors •Nvidia Parallel Datacache™ Technology •Nvidia Gigathread™ Engine •ECC Support

Next Generation CUDA Compute Architecture • Double precision performance as well. • ECC Memory. Triple Memory Redundancy. • True cache • 16 KB of SM shared memory to speed up their applications. • faster context switches between application programs and faster graphics and computing interoperation. • Users requested faster read-modify-write atomic operations for their parallel algorithms.


Architectural Improvements • Third Generation Streaming Multiprocessor (SM) • Second Generation Parallel Thread Execution ISA • Improved Memory Subsystem • NVIDIA GigaThread™ Engine

Third Generation Streaming Multiprocessor (SM) •32 CUDA cores per SM, 4x over GT200 •8x the peak double precision floating point performance over GT200. •Dual Warp Scheduler that schedules and dispatches two warps of 32 threads per clock •64 KB of RAM with a configurable partitioning of shared memory and L1 cache

Second Generation Parallel Thread Execution ISA •Unified Address Space with Full C++ Support Optimized for OpenCL and DirectCompute •Full IEEE 754-2008 32-bit and 64-bit precision •Full 32-bit integer path with 64-bit extensions Memory access instructions to support transition to 64-bit addressing •Improved Performance through Predication

Improved Memory Subsystem •NVIDIA Parallel DataCache™ hierarchy with Configurable L1 and Unified L2 Caches •First GPU with ECC memory support •Greatly improved atomic memory operation performance

NVIDIA GigaThread™ Engine • •10x faster application context switching •Concurrent kernel execution •Out of Order thread block execution •Dual overlapped memory transfer engines

•Nexus  First Development Environment. •Supports massively parallel CUDA C, OpenCL, Direct Compute. •Works along with Visual Studio 2010. •Manage Massive Parallelism. •Real-time Benchmarking.


The CUDA computing represents a new direction towards the parallel computing. It is the result of Radical rethinking of role, purpose, & capability of the GPU. With this technology atomic operations can be operated Up to twenty times faster. On software side it gives more power to the computation, and provide high performance. It can be used in Massively parallel GPU computation appz. In general we can say CUDA is a technology which make “Supercomputer Personal” With its combination of Ground breaking performance, functionality, and simplicity CUDA Computing Represents a revolution in GPU Computing.

1)“GPU Acceleration of Object Classification algorithms Using nVidia CUDA”, Jesse Patrick Harvey, Rochester Institute of Technology, 2009 2) ”CUDA by Example: An Introduction to General Purpose GPU”, Jason Sanders, Edward Kandrot, Addison-Wesley 2010 3) ”High Performance and Hardware aware computing” Rainer Buchty,Jan Philipp, Addison-Wesley,2009 4) nVidia Whitepaper from 5)

Any Questions ???

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.