Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword or section
Like this
2Activity

Table Of Contents

PREFACE
WHAT IS THIS DOCUMENT?
WHO SHOULD READ THIS GUIDE?
ASSESS, PARALLELIZE, OPTIMIZE, DEPLOY
Assess
Parallelize
Optimize
Deploy
RECOMMENDATIONS AND BEST PRACTICES
ASSESSING YOUR APPLICATION
Chapter 1 HETEROGENEOUS COMPUTING
1.1 DIFFERENCES BETWEEN HOST AND DEVICE
1.2 WHAT RUNS ON A CUDA-ENABLED DEVICE?
Chapter 2 APPLICATION PROFILING
2.1 PROFILE
2.1.1 Creating the Profile
2.1.2 Identifying Hotspots
2.1.3 Understanding Scaling
PARALLELIZING YOUR APPLICATION
Chapter 3 GETTING STARTED
3.1 PARALLEL LIBRARIES
3.2 PARALLELIZING COMPILERS
3.3 CODING TO EXPOSE PARALLELISM
Chapter 4 GETTING THE RIGHT ANSWER
4.1 VERIFICATION
4.1.1 Reference Comparison
4.1.2 Unit Testing
4.2 DEBUGGING
4.3 NUMERICAL ACCURACY AND PRECISION
4.3.1 Single vs. Double Precision
4.3.2 Floating-Point Math Is Not Associative
4.3.3 Promotions to Doubles and Truncations to Floats
4.3.4 IEEE 754 Compliance
4.3.5 x86 80-bit Computations
OPTIMIZING CUDA APPLICATIONS
Chapter 5 PERFORMANCE METRICS
5.1 TIMING
5.1.1 Using CPU Timers
5.1.2 Using CUDA GPU Timers
5.2 BANDWIDTH
5.2.1 Theoretical Bandwidth Calculation
5.2.2 Effective Bandwidth Calculation
5.2.3 Throughput Reported by Visual Profiler
Chapter 6 MEMORY OPTIMIZATIONS
6.1 DATA TRANSFER BETWEEN HOST AND DEVICE
6.1.1 Pinned Memory
6.1.2 Asynchronous Transfers and Overlapping Transfers with Computation
6.1.3 Zero Copy
6.1.4 Unified Virtual Addressing
6.2 DEVICE MEMORY SPACES
6.2.1 Coalesced Access to Global Memory
6.2.2 Shared Memory
6.2.3 Local Memory
6.2.4 Texture Memory
6.2.5 Constant Memory
6.2.6 Registers
7.2 HIDING REGISTER DEPENDENCIES
7.3 THREAD AND BLOCK HEURISTICS
7.4 EFFECTS OF SHARED MEMORY
Chapter 8 INSTRUCTION OPTIMIZATIONS
8.1 ARITHMETIC INSTRUCTIONS
8.1.1 Division and Modulo Operations
8.1.2 Reciprocal Square Root
8.1.3 Other Arithmetic Instructions
8.1.4 Math Libraries
8.1.5 Precision-related Compiler Flags
8.2 MEMORY INSTRUCTIONS
Chapter 9 CONTROL FLOW
9.1 BRANCHING AND DIVERGENCE
9.2 BRANCH PREDICATION
9.3 LOOP COUNTERS SIGNED VS. UNSIGNED
DEPLOYING CUDA APPLICATIONS
Chapter 10. UNDERSTANDING THE PROGRAMMING ENVIRONMENT
10.1 CUDA COMPUTE CAPABILITY
10.2 ADDITIONAL HARDWARE DATA
10.3 CUDA RUNTIME AND DRIVER API VERSION
10.4 WHICH COMPUTE CAPABILITY TO TARGET
10.5 CUDA RUNTIME
Chapter 11 PREPARING THE APPLICATION FOR DEPLOYMENT
11.1 ERROR HANDLING
11.2 DISTRIBUTING THE CUDA RUNTIME AND LIBRARIES
Chapter 12 DEPLOYMENT INFRASTRUCTURE TOOLS
12.1 NVIDIA-SMI
12.2 NVML
12.3 CLUSTER MANAGEMENT TOOLS
12.4 COMPILER JIT CACHE MANAGEMENT
Appendix A RECOMMENDATIONS AND BEST PRACTICES
A.1 Overall Performance Optimization Strategies
Appendix B NVCC COMPILER SWITCHES
B.1 NVCC
Appendix C REVISION HISTORY
C.1 Version 4.1
0 of .
Results for:
No results containing your search query
P. 1
CUDA C Best Practices Guide

CUDA C Best Practices Guide

Ratings: (0)|Views: 146|Likes:
Published by Tran Hoang Hai

More info:

Published by: Tran Hoang Hai on Jul 27, 2012
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

08/03/2013

pdf

text

original

You're Reading a Free Preview
Pages 5 to 51 are not shown in this preview.
You're Reading a Free Preview
Pages 56 to 80 are not shown in this preview.

Activity (2)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->