Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Standard view
Full view
of .
Look up keyword or section
Like this

Table Of Contents

Chapter 1 Introduction
1.1 From Graphics Processing to General-Purpose Parallel Computing
Figure 1-2. The GPU Devotes More Transistors to Data Processing
1.2 CUDA™: a General-Purpose Parallel Computing Architecture
1.3 A Scalable Programming Model
1.4 Document’s Structure
Chapter 2 Programming Model
2.1 Kernels
2.2 Thread Hierarchy
Figure 2-1. Grid of Thread Blocks
2.3 Memory Hierarchy
2.4 Heterogeneous Programming
2.5 Compute Capability
Chapter 3 Programming Interface
3.1 Compilation with NVCC
3.1.1 Compilation Workflow Offline Compilation Just-in-Time Compilation
3.1.2 Binary Compatibility
3.1.3 PTX Compatibility
3.1.4 Application Compatibility
3.1.5 C/C++ Compatibility
3.1.6 64-Bit Compatibility
3.2 CUDA C Runtime
3.2.1 Initialization
3.2.2 Device Memory
3.2.3 Shared Memory
Figure 3-1. Matrix Multiplication without Shared Memory Mapped Memory
3.2.5 Asynchronous Concurrent Execution Concurrent Execution between Host and Device Overlap of Data Transfer and Kernel Execution Concurrent Kernel Execution Concurrent Data Transfers Streams Events Synchronous Calls
3.2.6 Multi-Device System Device Enumeration Device Selection Stream and Event Behavior Peer-to-Peer Memory Access Peer-to-Peer Memory Copy
3.2.7 Unified Virtual Address Space
3.2.8 Error Checking
3.2.9 Call Stack
3.2.10 Texture and Surface Memory Texture Memory Surface Memory CUDA Arrays Read/Write Coherency
3.2.11 Graphics Interoperability OpenGL Interoperability Direct3D Interoperability SLI Interoperability
3.3 Versioning and Compatibility
3.4 Compute Modes
3.5 Mode Switches
3.6 Tesla Compute Cluster Mode for Windows
Chapter 4 Hardware Implementation
4.1 SIMT Architecture
4.2 Hardware Multithreading
Chapter 5 Performance Guidelines
5.1 Overall Performance Optimization Strategies
5.2 Maximize Utilization
5.2.1 Application Level
5.2.2 Device Level
5.2.3 Multiprocessor Level
5.3 Maximize Memory Throughput
5.3.1 Data Transfer between Host and Device
5.3.2 Device Memory Accesses Global Memory Local Memory Shared Memory Constant Memory Texture and Surface Memory
5.4 Maximize Instruction Throughput
5.4.1 Arithmetic Instructions
5.4.2 Control Flow Instructions
5.4.3 Synchronization Instruction
Appendix A CUDA-Enabled GPUs
Appendix B C Language Extensions
B.1 Function Type Qualifiers
B.1.1 __device__
B.1.2 __global__
B.1.3 __host__
B.1.4 __noinline__ and __forceinline__
B.2 Variable Type Qualifiers
B.2.1 __device__
B.2.2 __constant__
B.2.3 __shared__
B.2.4 __restrict__
B.3 Built-in Vector Types
B.3.2 dim3
B.4 Built-in Variables
B.4.1 gridDim
B.4.2 blockIdx
B.4.3 blockDim
B.4.4 threadIdx
B.4.5 warpSize
B.5 Memory Fence Functions
B.6 Synchronization Functions
B.7 Mathematical Functions
B.9.1 surf1Dread()
B.9.2 surf1Dwrite()
B.9.3 surf2Dread()
B.9.4 surf2Dwrite()
B.9.5 surf3Dread()
B.9.6 surf3Dwrite()
B.9.7 surf1DLayeredread()
B.9.8 surf1DLayeredwrite()
B.9.9 surf2DLayeredread()
B.9.10 surf2DLayeredwrite()
B.9.11 surfCubemapread()
B.9.12 surfCubemapwrite()
B.9.13 surfCubemabLayeredread()
B.9.14 surfCubemapLayeredwrite()
B.10 Time Function
B.11 Atomic Functions
B.11.1 Arithmetic Functions
B.11.1.1 atomicAdd()
B.11.1.2 atomicSub()
B.11.1.3 atomicExch()
B.11.1.4 atomicMin()
B.11.1.5 atomicMax()
B.11.1.6 atomicInc()
B.11.1.7 atomicDec()
B.11.1.8 atomicCAS()
B.11.2 Bitwise Functions
B.11.2.1 atomicAnd()
B.11.2.2 atomicOr()
B.11.2.3 atomicXor()
B.12 Warp Vote Functions
B.13 Profiler Counter Function
B.14 Formatted Output
B.14.1 Format Specifiers
B.14.2 Limitations
B.14.3 Associated Host-Side API
B.14.4 Examples
B.15 Dynamic Global Memory Allocation
B.15.1 Heap Memory Allocation
B.15.2 Interoperability with Host Memory API
B.15.3 Examples
B.15.3.1 Per Thread Allocation
B.15.3.2 Per Thread Block Allocation
B.15.3.3 Allocation Persisting Between Kernel Launches
B.16 Execution Configuration
B.17 Launch Bounds
B.18 #pragma unroll
Appendix C Mathematical Functions
C.1 Standard Functions
C.1.1 Single-Precision Floating-Point Functions
C.1.2 Double-Precision Floating-Point Functions
C.2 Intrinsic Functions
C.2.1 Single-Precision Floating-Point Functions
C.2.2 Double-Precision Floating-Point Functions
Appendix D C/C++ Language Support
D.1 Code Samples
D.1.1 Data Aggregation Class
D.1.2 Derived Class
D.1.3 Class Template
D.1.4 Function Template
D.1.5 Functor Class
D.2 Restrictions
D.2.1 Qualifiers
D.2.1.1 Device Memory Qualifiers
D.2.1.2 Volatile Qualifier
D.2.2 Pointers
D.2.3 Operators
D.2.3.1 Assignment Operator
D.2.3.2 Address Operator
D.2.4 Functions
D.2.4.1 Function Parameters
D.2.4.2 Static Variables within Function
D.2.4.3 Function Pointers
D.2.4.4 Function Recursion
D.2.5 Classes
D.2.5.1 Data Members
D.2.5.2 Function Members
D.2.5.3 Constructors and Destructors
D.2.5.4 Virtual Functions
D.2.5.5 Virtual Base Classes
D.2.5.6 Windows-Specific
D.2.6 Templates
Appendix E Texture Fetching
E.1 Nearest-Point Sampling
E.2 Linear Filtering
E.3 Table Lookup
Appendix F Compute Capabilities
F.1 Features and Technical Specifications
F.2 Floating-Point Standard
F.3 Compute Capability 1.x
F.3.1 Architecture
F.3.2 Global Memory
F.3.2.1 Devices of Compute Capability 1.0 and 1.1
F.3.2.2 Devices of Compute Capability 1.2 and 1.3
F.3.3 Shared Memory
F.3.3.1 32-Bit Strided Access
F.3.3.2 32-Bit Broadcast Access
F.3.3.3 8-Bit and 16-Bit Access
F.3.3.4 Larger Than 32-Bit Access
F.4 Compute Capability 2.x
F.4.1 Architecture
F.4.2 Global Memory
F.4.3 Shared Memory
F.4.3.1 32-Bit Strided Access
F.4.3.2 Larger Than 32-Bit Access
F.4.4 Constant Memory
Appendix G Driver API
G.1 Context
G.2 Module
G.3 Kernel Execution
G.4 Interoperability between Runtime and Driver APIs
0 of .
Results for:
No results containing your search query
P. 1
CUDA C Programming Guide

CUDA C Programming Guide

|Views: 326|Likes:
Published by maurolw

More info:

Published by: maurolw on Apr 06, 2012
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





You're Reading a Free Preview
Pages 4 to 6 are not shown in this preview.
You're Reading a Free Preview
Pages 10 to 38 are not shown in this preview.
You're Reading a Free Preview
Pages 42 to 52 are not shown in this preview.
You're Reading a Free Preview
Pages 56 to 103 are not shown in this preview.
You're Reading a Free Preview
Pages 107 to 169 are not shown in this preview.

Activity (2)

You've already reviewed this. Edit your review.
1 hundred reads
Amauri Antunes Filho liked this

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->