Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Standard view
Full view
of .
Look up keyword or section
Like this

Table Of Contents

Chapter 1 Introduction
1.1 From Graphics Processing to General-Purpose Parallel Computing
Figure 1-2. The GPU Devotes More Transistors to Data Processing
1.2 CUDA™: a General-Purpose Parallel Computing Architecture
1.3 A Scalable Programming Model
1.4 Document’s Structure
Figure 2-1. Grid of Thread Blocks
2.3 Memory Hierarchy
2.4 Heterogeneous Programming
2.5 Compute Capability
Chapter 3 Programming Interface
3.1 Compilation with NVCC
3.1.1 Compilation Workflow
3.1.2 Binary Compatibility
3.1.3 PTX Compatibility
3.1.4 Application Compatibility
3.1.5 C/C++ Compatibility
3.1.6 64-Bit Compatibility
3.2 CUDA C
3.2.1 Device Memory
3.2.2 Shared Memory
Figure 3-1. Matrix Multiplication without Shared Memory
Figure 3-2. Matrix Multiplication with Shared Memory
3.2.3 Multiple Devices
3.2.4 Texture Memory Texture Reference Declaration Runtime Texture Reference Attributes Texture Binding
3.2.5 Surface Memory
3.2.6 Page-Locked Host Memory Portable Memory Write-Combining Memory Mapped Memory
3.2.7 Asynchronous Concurrent Execution Concurrent Execution between Host and Device Overlap of Data Transfer and Kernel Execution Concurrent Kernel Execution Concurrent Data Transfers Stream Event Synchronous Calls
3.2.8 Graphics Interoperability OpenGL Interoperability Direct3D Interoperability
3.2.9 Error Handling
3.3 Driver API
3.3.1 Context
3.3.2 Module
3.3.3 Kernel Execution
3.3.4 Device Memory
3.3.5 Shared Memory
3.3.6 Multiple Devices
3.3.7 Texture Memory
3.3.8 Surface Memory
3.3.9 Page-Locked Host Memory
3.3.10 Asynchronous Concurrent Execution Stream Event Management Synchronous Calls
3.3.11 Graphics Interoperability OpenGL Interoperability Direct3D Interoperability
3.3.12 Error Handling
3.4 Interoperability between Runtime and Driver APIs
3.5 Versioning and Compatibility
3.6 Compute Modes
3.7 Mode Switches
Chapter 4 Hardware Implementation
4.1 SIMT Architecture
4.2 Hardware Multithreading
4.3 Multiple Devices
Chapter 5 Performance Guidelines
5.1 Overall Performance Optimization Strategies
5.2 Maximize Utilization
5.2.1 Application Level
5.2.2 Device Level
5.2.3 Multiprocessor Level
5.3 Maximize Memory Throughput
5.3.1 Data Transfer between Host and Device
5.3.2 Device Memory Accesses Global Memory Local Memory
5.4.1 Arithmetic Instructions
5.4.2 Control Flow Instructions
5.4.3 Synchronization Instruction
Appendix A CUDA-Enabled GPUs
B.2.3 __shared__
B.2.4 volatile
B.2.5 Restrictions
B.3 Built-in Vector Types
B.3.2 dim3
B.4 Built-in Variables
B.4.1 gridDim
B.4.2 blockIdx
B.4.3 blockDim
B.4.4 threadIdx
B.4.5 warpSize
B.4.6 Restrictions
B.5 Memory Fence Functions
B.6 Synchronization Functions
B.7 Mathematical Functions
B.8 Texture Functions
B.8.1 tex1Dfetch()
B.8.2 tex1D()
B.8.3 tex2D()
B.8.4 tex3D()
B.9 Surface Functions
B.9.1 surf1Dread()
B.9.2 surf1Dwrite()
B.9.3 surf2Dread()
B.9.4 surf2Dwrite()
B.10 Time Function
B.11 Atomic Functions
B.11.1 Arithmetic Functions
B.11.1.1 atomicAdd()
B.11.1.2 atomicSub()
B.11.1.3 atomicExch()
B.11.1.4 atomicMin()
B.11.1.5 atomicMax()
B.11.1.6 atomicInc()
B.11.1.7 atomicDec()
B.11.1.8 atomicCAS()
B.11.2 Bitwise Functions
B.11.2.1 atomicAnd()
B.11.2.2 atomicOr()
B.11.2.3 atomicXor()
B.12 Warp Vote Functions
B.13 Profiler Counter Function
B.14 Formatted Output
B.14.1 Format Specifiers
B.14.2 Limitations
B.14.3 Associated Host-Side API
B.14.4 Examples
B.15 Execution Configuration
B.16 Launch Bounds
Appendix C Mathematical Functions
C.1 Standard Functions
C.1.1 Single-Precision Floating-Point Functions
C.1.2 Double-Precision Floating-Point Functions
C.1.3 Integer Functions
C.2 Intrinsic Functions
C.2.1 Single-Precision Floating-Point Functions
C.2.2 Double-Precision Floating-Point Functions
C.2.3 Integer Functions
Appendix D C++ Language Constructs
D.1 Polymorphism
D.2 Default Parameters
D.3 Operator Overloading
D.4 Namespaces
D.5 Function Templates
D.6 Classes
D.6.1 Example 1 Pixel Data Type
D.6.2 Example 2 Functor Class
Appendix E NVCC Specifics
E.1 __noinline__ and __forceinline__
E.2 #pragma unroll
E.3 __restrict__
Appendix F Texture Fetching
F.1 Nearest-Point Sampling
F.2 Linear Filtering
F.3 Table Lookup
Appendix G Compute Capabilities
G.1 Features and Technical Specifications
G.2 Floating-Point Standard
G.3 Compute Capability 1.x
G.3.1 Architecture
G.3.2 Global Memory
G.3.2.1 Devices of Compute Capability 1.0 and 1.1
G.3.2.2 Devices of Compute Capability 1.2 and 1.3
G.3.3 Shared Memory
G.3.3.1 32-Bit Strided Access
G.3.3.2 32-Bit Broadcast Access
G.3.3.3 8-Bit and 16-Bit Access
G.3.3.4 Larger Than 32-Bit Access
G.4 Compute Capability 2.0
G.4.1 Architecture
G.4.2 Global Memory
G.4.3.2 Larger Than 32-Bit Access
0 of .
Results for:
No results containing your search query
P. 1
NVIDIA CUDA C Programming Guide 3.1

NVIDIA CUDA C Programming Guide 3.1

|Views: 735|Likes:
Published by 邱吉震

More info:

Published by: 邱吉震 on Sep 08, 2010
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





You're Reading a Free Preview
Pages 4 to 16 are not shown in this preview.
You're Reading a Free Preview
Pages 20 to 31 are not shown in this preview.
You're Reading a Free Preview
Pages 35 to 99 are not shown in this preview.
You're Reading a Free Preview
Pages 103 to 110 are not shown in this preview.
You're Reading a Free Preview
Pages 114 to 173 are not shown in this preview.

Activity (5)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
yonny_septian liked this
hhaoshell liked this
jetang liked this

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->