Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword or section
Like this
8Activity

Table Of Contents

Chapter 1 Parallel Computing with CUDA
1.1 Heterogeneous Computing with CUDA
1.1.1 Differences Between Host and Device
1.1.2 What Runs on a CUDA-Enabled Device?
1.1.3 Maximum Performance Benefit
1.2 Understanding the Programming Environment
1.2.1 CUDA Compute Capability
1.2.2 Additional Hardware Data
1.2.3 C Runtime for CUDA and Driver API Version
1.2.4 Which Version to Target
1.3 CUDA APIs
1.3.1 C Runtime for CUDA
1.3.2 CUDA Driver API
1.3.3 When to Use Which API
1.3.4 Comparing Code for Different APIs
2.1.1 Using CPU Timers
2.1.2 Using CUDA GPU Timers
2.2 Bandwidth
2.2.1 Theoretical Bandwidth Calculation
2.2.2 Effective Bandwidth Calculation
2.2.3 Throughput Reported by cudaprof
Chapter 3 Memory Optimizations
3.1 Data Transfer Between Host and Device
3.1.1 Pinned Memory
3.1.2 Asynchronous Transfers and Overlapping Transfers with Computation
3.1.3 Zero Copy
3.2 Device Memory Spaces
3.2.1 Coalesced Access to Global Memory
3.2.1.1 A Simple Access Pattern
3.2.1.2 A Sequential but Misaligned Access Pattern
3.2.1.3 Effects of Misaligned Accesses
3.2.1.4 Strided Accesses
3.2.2 Shared Memory
3.2.2.1 Shared Memory and Memory Banks
3.2.2.2 Shared Memory in Matrix Multiplication (C = AB)
3.2.2.3 Shared Memory in Matrix Multiplication (C = AAT
3.2.2.4 Shared Memory Use by Kernel Arguments
3.2.3 Local Memory
3.2.4 Texture Memory
3.2.4.1 Textured Fetch vs. Global Memory Read
3.2.4.2 Additional Texture Capabilities
3.2.5 Constant Memory
3.2.6 Registers
3.2.6.1 Register Pressure
3.3 Allocation
Chapter 4 Execution Configuration Optimizations
4.1 Occupancy
4.2 Calculating Occupancy
4.3 Hiding Register Dependencies
4.4 Thread and Block Heuristics
4.5 Effects of Shared Memory
Chapter 5 Instruction Optimizations
5.1 Arithmetic Instructions
5.1.1 Division and Modulo Operations
5.1.2 Reciprocal Square Root
5.1.3 Other Arithmetic Instructions
5.1.4 Math Libraries
5.2 Memory Instructions
Chapter 6 Control Flow
6.1 Branching and Divergence
6.2 Branch Predication
6.3 Loop counters signed vs. unsigned
Chapter 7 Getting the Right Answer
7.1 Debugging
7.2 Numerical Accuracy and Precision
7.2.1 Single vs. Double Precision
7.2.2 Floating-Point Math Is Not Associative
7.2.3 Promotions to Doubles and Truncations to Floats
7.2.4 IEEE 754 Compliance
7.2.5 x86 80-bit Computations
Chapter 8 Multi-GPU Programming
8.1 Introduction to Multi-GPU
8.2 Multi-GPU Programming
8.3 Selecting a GPU
8.4 Inter-GPU communication
8.5 Compiling Multi-GPU Applications
8.6 Infiniband
Appendix A Recommendations and Best Practices
A.1 Overall Performance Optimization Strategies
A.2 High-Priority Recommendations
A.3 Medium-Priority Recommendations
A.4 Low-Priority Recommendations
Appendix B Useful NVCC Compiler Switches
B.1 NVCC
Appendix C Revision History
C.1 Version 3.0
C.2 Version 3.1
C.3 Version 3.2
0 of .
Results for:
No results containing your search query
P. 1
CUDA_C_Best_Practices_Guide

CUDA_C_Best_Practices_Guide

Ratings: (0)|Views: 5,146 |Likes:
Published by mike_in_england

More info:

Published by: mike_in_england on Feb 27, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/30/2012

pdf

text

original

You're Reading a Free Preview
Pages 4 to 18 are not shown in this preview.
You're Reading a Free Preview
Pages 20 to 73 are not shown in this preview.

Activity (8)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
Dwi Rochma liked this
waltee liked this
Nancy Iskander liked this
Arun Kumar liked this
Priyanka Sah liked this

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->