Professional Documents
Culture Documents
1
GeForce 8 and 9 Series GPU Programming Guide
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND
OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA
MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE
MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY,
AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no
responsibility for the consequences of use of such information or for any infringement of patents or other rights of
third parties that may result from its use. No license is granted by implication or otherwise under any patent or
patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without
notice. This publication supersedes and replaces all information previously supplied. NVIDIA Corporation products
are not authorized for use as critical components in life support devices or systems without express written
approval of NVIDIA Corporation.
Trademarks
NVIDIA, the NVIDIA logo, GeForce, and NVIDIA Quadro are registered trademarks of NVIDIA Corporation. Other
company and product names may be trademarks of the respective companies with which they are associated.
Copyright
© 2008 by NVIDIA Corporation. All rights reserved.
2
GeForce 8 and 9 Series GPU Programming Guide
Table of Contents
3
GeForce 8 and 9 Series GPU Programming Guide
4
GeForce 8 and 9 Series GPU Programming Guide
5
GeForce 8 and 9 Series GPU Programming Guide
Chapter 1.
About This Document
1.1. Introduction
This guide will help you to get the highest graphics performance out of your
application, graphics API, and graphics processing unit (GPU). Understanding
the information in this guide will help you to write better graphical applications.
This document specifically focuses on the GeForce 8 and 9 Series GPUs,
however many of the concepts and techniques can be applied to graphics
programming in general. For specific advice and programming guide on earlier
GPUs such as the GeForce 6 & 7 series please consult the earlier “GPU
Programming Guide : GeForce 6 & & Series (and earlier)”.
For information about anything and everything graphics and GPU related
please check http://developer.nvidia.com/page/documentation.html
7
About This Document
8
GeForce 8 and 9 Series GPU Programming Guide
This section reviews the typical steps to find and remove performance
bottlenecks in a graphics application.
Also refer to http://developer.nvidia.com/object/practical_perf_analysis.html
for more information on performance analysis.
9
How to Optimize Your Application
Disable vertical sync. This ensures that your frame rate is not limited by
your monitor’s refresh rate. This can be overridden in the NVIDIA control
panel 3D settings. Look for “Vertical Sync” in “Manage 3D settings”
section and switch to “Force OFF”.
Run on the target hardware. If you’re trying to find out if a particular
hardware configuration will perform sufficiently, make sure you’re running
on the correct CPU, GPU, and with the right amount of memory on the
system. Bottlenecks can change significantly as you move from a low-end
system to a high-end system.
Confirm system power/performance settings. On some systems,
especially laptops and notebooks the power settings will artificially lower
the CPU performance to preserve battery power. Make sure CPU is at
100%.
Favor QueryPerformanceCounter() over RDTSC. When manually
timing sections of code it is important to have reliable data.
Benchmark/Time with averages. Never time a single run, or a single
function execution. Computers are very complex and multi-threaded,
pipelined architectures. The timings of individual operations will vary
wildly during execution. The best method is to execute a test a number of
times and average the results.
Your ultimate goal is higher frames per second. Measuring how a given
change or optimization changes the fps of your application as averaged over
a handful of frames (~1 second) is probably the best process to go about
optimization.
10
GeForce 8 and 9 Series GPU Programming Guide
12
GeForce 8 and 9 Series GPU Programming Guide
13
How to Optimize Your Application
14
GeForce 8 and 9 Series GPU Programming Guide
15
GeForce 8 and 9 Series GPU Programming Guide
Chapter 3.
General GPU Performance Tips
This chapter presents the top performance tips that will help you achieve
optimal performance on GeForce Series GPUs. For your convenience, the tips
are organized by pipeline stage. Within each subsection, the tips are roughly
ordered by importance, so you know where to concentrate your efforts first.
A great place to get an overview of modern GPU pipeline performance is the
Graphics Pipeline Performance chapter of the book GPU Gems: Programming
Techniques, Tips, and Tricks for Real-Time Graphics. The chapter covers bottleneck
identification as well as how to address potential performance problems in all
parts of the graphics pipeline.
Graphics Pipeline Peformance is freely available at
http://developer.nvidia.com/object/gpu_gems_samples.html.
18
GeForce 8 and 9 Series GPU Programming Guide
19
General GPU Performance Tips
Investigate ways that you can reduce the number of batches and API
calls in your engine.
See section 0 for some information about instancing in DirectX 10.
20
GeForce 8 and 9 Series GPU Programming Guide
You can also use our own NVTriStrip utility to create optimized cache-friendly
meshes. NVTriStrip is a standalone program that is available at
http://developer.nvidia.com/object/nvtristrip_library.html.
3.4. Shaders
High-level shading languages provide a powerful and flexible mechanism that
makes writing shaders easy. Unfortunately, this means that writing slow shaders
is easier than ever. If you’re not careful, you can end up with a spontaneous
explosion of slow shaders that brings your application to a halt.
GeForce 8 series and later NVIDIA GPUs make use of a unified shader
architecture. This means that all shader types use the same hardware to execute
instructions. The driver will balance which hardware is allocated to vertex,
geometry or pixel dynamically allowing the GPU to adjust itself to more
efficiently execute differing shader loads. Pre-G80 GPUs will still need to
manually balance vertex and pixel workloads.
This doesn’t mean you should not take care to execute the appropriate
operations on the appropriate shader!
The following tips will help you avoid writing inefficient shaders for simple
effects. In addition, you’ll learn how to take full advantage of the GPU’s
computational power. Used correctly, you can achieve an instructions/ clock
rate many times higher than a naïve implementation.
21
General GPU Performance Tips
10 for all shaders and not worry about trying to fit your code into older limits.
But keep in mind that it is still important to make sure your shader code is as
optimal as it can be.
22
GeForce 8 and 9 Series GPU Programming Guide
For instance, the result of any normalize can be half-precision, as can colors.
Positions can be half-precision as well, but they may need to be scaled in the
vertex shader to make the relevant values near zero.
For instance, moving values to local tangent space, and then scaling positions
down can eliminate banding artifacts seen when very large positions are
converted to half precision.
23
General GPU Performance Tips
Written this way, the reflection vector can be computed independent of the
length of the normal or incident vectors. However, shader authors frequently
want at least the normal vector normalized in order to perform lighting
calculations. If this is the case, then a dot product, a reciprocal, and a scalar
multiply can be removed from reflect(). Optimizations like these can
dramatically improve performance.
24
GeForce 8 and 9 Series GPU Programming Guide
25
General GPU Performance Tips
note that newer hardware such as GeForce 8 and 9 Series GPUs will allow more
complex pixel shaders before becoming shader-bound.
Please read the next section 3.4.10 before pushing operations to the VS as this
may actually REDUCE performance.
One more thing to check when you find Pixel Shader bottlenecks is how much
of the pixel shading work was wasted effort because some of the shaded pixels
were later occluded by other objects. PerfHUD can help you determine this to
some degree. If you conclude this is a real problem you should consider some
of the following:
1. Perform a depth prepass of the scene, taking advantage of double speed Z
(Section 3.6.1).
2. Ensure you are getting the most out of coarse and fine-grained Z cull
optimization. See section 4.8 for more details.
NOTE: although alpha-tested geometry (or geometry rendered with
shaders that use texkil, discard or clip) will not get Double-speed Z in
some cases, such us when using costly Pixel Shaders, it may be still worth
to render this geometry in a depth-only prepass of the scene, since that will
guarantee that no shading is wasted in the final shading pass thanks to the
EarlyZ optimization.
3. If not performing a depth prepass, render opaque objects front to back.
4. Use a deferred shading implementation.
27
General GPU Performance Tips
3.5. Texturing
28
GeForce 8 and 9 Series GPU Programming Guide
Using a 2D Texture
One common situation where a texture can be useful is in per-pixel lighting.
You can use a 2D texture that you index with (N dot L) on one axis and (N
dot H) on the other axis. At each (u, v) location, the texture would encode:
max(N dot L,0) + Ks*pow((N dot L>0) ? max(N dot H,0) : 0), n)
This is the standard Blinn lighting model, including clamping for the diffuse and
specular terms.
29
General GPU Performance Tips
Using a 3D Texture
You can also add the specular exponentiation to the mix by using a 3D texture.
The first two axes use the 2D texture technique described in the previous
section, and the third axis encodes the specular exponent (shininess).
Remember, however, that cache performance may suffer if the texture is too
large. You may want to encode only the most frequently used exponents and
keep the size very small.
3.6. Rasterization
This is a list of some ideas to enhance performance that will require some work
in adjusting the architecture of your graphics engine. While these require more
up-front work they can often provide a much larger performance benefit than
simply optimizing a shader.
30
GeForce 8 and 9 Series GPU Programming Guide
31
General GPU Performance Tips
3.7. Antialiasing
All GeForce Series GPUs have powerful antialiasing engines. They perform
best with antialiasing enabled, so we recommend that you enable your
applications for antialiasing.
If you need to use techniques that don’t work with antialiasing, contact us—
we’re happy to discuss the problem with you and to help you find solutions.
With DirectX 9, the StretchRect() call can copy the back buffer to an off-
screen texture in concert with multisampling. This allows applications to make
use of multi-sampling with post-processing.
With DirectX 10, you can use ResolveSubresource() to copy a MSAA
resource to a non-MSAA resource.
For instance, if 4x multisampling is enabled, on a 100 × 100 back buffer, the
driver actually internally creates a 200 × 200 back buffer and depth buffer in
order to perform the antialiasing. If the application creates a 100 × 100 off-
screen texture, it can ResolveSubresource() the entire back buffer to
the off-screen surface, and the GPU will filter down the antialiased buffer into
the off-screen buffer.
Then glows and other post-processing effects can be performed on the 100 ×
100 texture, and then applied back to the main back buffer.
This resolution mismatch between the real back buffer size (200 × 200) and the
application’s view of it (100 × 100) is the reason why you can’t attach a
multisampled z buffer to a non-multisampled render target.
32
GeForce 8 and 9 Series GPU Programming Guide
33
GeForce 8 and 9 Series GPU Programming Guide
Chapter 4.
GeForce 8 and 9 Series
Programming Tips
This chapter presents several useful tips that help you fully use the capabilities
of GeForce 8 & 9 Series and GTX 2xx Series as well as G8x based(and later)
Quadro GPUs. These are mostly feature oriented, though some may affect
performance as well. This chapter focuses on Microsoft’s DirectX API.
However, many concepts can be directly translated to OpenGL as well.
34
GeForce 8 and 9 Series Programming Tips
Shader length 65535+ Unlimited Allows more complex shading, lighting, and
procedural materials
Constants Register Constant Allows more general use of shaders without the
file buffers and input/output register restrictions of previous
Texture buffers models.
35
GeForce 8 and 9 Series GPU Programming Guide
36
GeForce 8 and 9 Series Programming Tips
Be aware that using the SV_ semantics do add a fixed attribute overhead
or 8 scalar attributes to the shader using it. So if you are attribute bound
and trying to use a system value semantic to reduce your attribute count
make sure you can reduce it more than 8 attributes or it will not make
much of a difference.
37
GeForce 8 and 9 Series GPU Programming Guide
become the bottleneck and slow down rendering by starving the rest of the
GPU pipeline.
38
GeForce 8 and 9 Series Programming Tips
4.4.4.2. Suggestions
1. First check for and remove unused attributes
This is the easiest and least likely to be found. Sometimes a programmer will
allocate a float4 while developing and never use the data in the w component.
This is an extra attribute and will possibly increase the setup time. You can
easily check the assembly for these (unused attributes are removed from the
input signature) and use that space for some other valuable attributes.
2. Try to perform logical grouping of your attributes to reduce the total
number of vector attributes used
Logical grouping means combining a number of separate attributes into a single
attribute, up to float4, all components of which will be used in the same passes.
For example, if you are using a pair of texture coordinates, it is better to pack
them into a single float4, than using two separate float2 attributes. Almost
always vertex position requires just a single float3 value, if you can logically
combine it with a separate float value used, do it.
3. Recalculate attributes
You do not need to pass down all the data you will be using. In many cases
there are dependencies in the data such that you can pass down a subset and
recalculate the rest. For example, you do not need to pass a normal, binormal
and tangent to the vertex shader. Passing down the normal and tangent alone
and recalculating the binormal using a cross product will result in more shader
operations but a smaller vertex size. Depending on your bottleneck this could
be a significant speedup.
39
GeForce 8 and 9 Series GPU Programming Guide
40
GeForce 8 and 9 Series Programming Tips
thread must preserve the order in which it is generated, the GPU must support
buffering the output data in this correct order for a number of threads running
in parallel.
GeForce 8, 9 and GTX2xx series have limited output buffer sizes, which is at
least sufficient to support the maximum output size allowed in Direct3D 10
(1024 32-bit scalars). The output buffer size may vary by GPU.
The performance of a GS is inversely proportional to the output size (in scalars)
declared in the Geometry Shader, which is the product of the vertex size and
the number of vertices (maxvertexcount). This performance degradation
however occurs at particular output sizes, and is not smooth.
For example, on a GeForce 8800 GTX a GS that outputs at most 1 to 20
scalars in total would run at peak performance, whereas a GS would run at 50%
of peak performance if the total maximum output was declared to be between
27 and 40 scalars.
More concretely, if your vertex declaration is:
float3 pos: POSITION;
float2 tex: TEXCOORD;
Each vertex is 5 scalar attributes in size.
If the GS defines a maxvertexcount of 4 vertices, then it will run at full speed
(20 scalar attributes).
But if you increase the vertex size by adding a float3 normal, the number of
scalars will increase by 4 * 3, putting your total at 32 scalar attributes. This will
result in the GS running at 50% peak performance.
Thus, it is important to understand that the main use of the geometry shader is
NOT for doing heavy output algorithms such as tessellation.
In addition, because a GS runs on primitives, per-vertex operations will be
duplicated for all primitives that share a vertex. This is potentially a waste of
processing power.
A geometry shader is most useful when doing operations on small vertices or
primitive data that requires outputting only small amounts of new data. But in
general, the potential for wasted work and performance penalties for using a GS
makes it an often unused feature of Shader model 4.
41
GeForce 8 and 9 Series GPU Programming Guide
rendered at least twice. Once for the shadow map and once for the main
In some deferred rendering engines, or if there are more than one shadow
casting light the number of times rendered may be even more. By pre-
computing (each frame) the animated character the engine can save the cost of
re-evaluating the animation palette and at least a matrix multiply per vertex.
A sample render flow is below.
1. Render animated character mesh using simple vertex shader
2. Using stream out bypass rasterization and save posed mesh to memory
3. Render shadow map generation sourcing posed mesh
• Vertex shader does not do skinning operations
4. Render main render sourcing posed mesh
• Vertex shader does not do skinning operations
Coarse Z/Stencil culling (also known as ZCULL) will not be able to cull any
pixels in the following cases:
1. If you don’t use Clears (instead of fullscreen quads that write depth) to
clear the depth-stencil buffer.
2. If the pixel shader writes depth.
3. If you change the direction of the depth test while writing depth.
ZCULL will not cull any pixels until the next depth buffer Clear.
4. If stencil writes are enabled while doing stencil testing (no stencil
culling)
5. On GeForce 8 series, if the DepthStencilView has
Texture2D[MS]Array dimension
Also note that ZCULL will perform less efficiently in the following
circumstances
1. If the depth buffer was written using a different depth test direction
than that used for testing
2. If the depth of the scene contains a lot of high frequency information
(i.e.: the depth varies a lot within a few pixels)
3. If you allocate too many large depth buffers.
4. If using DXGI_FORMAT_D32_FLOAT format
Similarly, fine-grained Z/Stencil culling (also known as EarlyZ) is disabled in
the following cases:
1. If the pixel shader outputs depth
2. If the pixel shader uses the .z component of an input attribute with the
SV_Position semantic (only on GeForce 8 series in D3D10)
3. If Depth or Stencil writes are enabled, or Occlusion Queries are
enabled, and one of the following is true:
• Alpha-test is enabled
• Pixel Shader kills pixels (clip(), texkil, discard)
• Alpha To Coverage is enabled
• SampleMask is not 0xFFFFFFFF (SampleMask is set in
D3D10 using OMSetBlendState and in D3D9 setting the
D3DRS_MULTISAMPLEMASK renderstate)
44
GeForce 8 and 9 Series GPU Programming Guide
With the introduction of DirectX 10, we have seen a significant change in the
performance characteristics of 3D applications. It is not necessarily the case
that simply converting your DirectX 9 application to DirectX 10 will result in a
speedup. In fact, in many cases, a slowdown is perceived. This is often due to a
naïve porting effort. It is essential to understand the important differences
between DirectX 9 and 10 in order to properly effect a codebase transition.
In DirectX10 Microsoft has moved validation of api call parameters from
runtime to creation time. This means that, generally, runtime calls to DirectX10
have less CPU overhead than DirectX9. This translates to an increase in the
number of draw calls you can issue per frame and still maintain a high frame
rate.
However, naively porting your DirectX9 code to DirectX10 will induce
slowdowns. In particular, the most common source of a performance
slowdown when porting is due to improper use of shader constants and
constant buffers. In addition are to keep focus on when porting would be
resource updates. Refer to the specific subsections in this chapter for
information about these concepts.
In addition, you can gain additional performance benefit by using built in the
new features in DirectX10 such as instancing, texture arrays, stream out, etc.
Refer to Microsoft’s “Direct3D 9 to Direct3D 10 considerations” for more information.
45
DirectX 10 Considerations
cbuffer PerFrameConstants
{
float4x4 mView;
float fTime;
float3 fWindForce;
// etc.
};
46
GeForce 8 and 9 Series GPU Programming Guide
All constants in PerFrameCosntants changes only per frame, they are constant
between meshes and draw calls, and thus the buffer only needs to be uploaded
once per frame.
cbuffer SkinningMatricesConstants
{
float4x4 mSkin[64];
};
In contrast to SkinningMatricesConstants which is constant per mesh, and will only
need to be re-uploaded when the mesh changes. Since many meshes have more
than one draw call and a matrix pallete for animation is very large, separating
this into its own buffer will often save a lot of bandwidth.
In general the graphics engineer will need to strike a good balance between:
• Amount of constant data to upload
• Number calls required to do it (== # of cbuffers)
• Don’t go overboard (3-5 constant buffers is enough)
Another key concept to remember is that it is entirely possible to share constant
buffers between pixel and vertex shaders in the same call.
Example:
• PerFrameGlobal (time, per-light properties)
• PerView (main camera xforms, shadowmap xforms)
• PerObjectStatic (world matrix, static light indices)
• PerObjectDynamic (skinning matrices, dynamic lightIDs)
47
DirectX 10 Considerations
{
float4 diffuse = tex2D0.Sample(mipmapSampler,
in.Tex0);
float ndotl = dot(in.Normal, vLightVector.xyz);
return ndotl * vLightColor * diffuse;
}
The above GOOD and BAD orderings illustrate what you should do when
ordering constants in a block. Due to the fact that vLightVector and
vLightColor are accessed in order, placing a large block of unused data in
between then inside a constant block will result in a cache miss when loading
vLightColor. Placing them together in the block will result in a higher
likelihood that the cach line loaded when accessing vLightVector will include
vLightColor.
48
GeForce 8 and 9 Series GPU Programming Guide
5.1.3.1. D3D10_SHADER_ENABLE_BACKWARDS_COMPATIBILITY
When compiling SM3 shaders to SM4+ and using the
D3D10_SHADER_ENABLE_BACKWARDS_COMPATIBILITY flag, you
can group constants into constant block with a preprocessor conditional to
declare cbuffers. This allows you to share shaders between DirectX 9 and
DirectX 10 without sacrificing constant buffer performance.
For example:
#ifdef DX10
cbuffer MyBuffer {
#endif
tbuffers have larger latency for data loading, but if you can mask that latency
with shader operations then they will be as fast as cbuffers.
49
DirectX 10 Considerations
5.2.2.2. Buffers
When updating buffers in general you have two options.
1. Map(D3D10_MAP_WRITE_DISCARD,…);
2. UpdateSubResource();
In general, for buffers either one works well. With Map the CPU can avoid
updating data in the buffer that the shader doesn’t care about. However, the
entire buffer will need to be copied to the GPU after being updated.
For very large dynamic Vertex/Index buffers
This is a special case for very large VB/IBs. The recommendation in this case is
to use a large shared ring buffer type construct. This would be essentially
placing the separate buffers back to back in a large buffer and adjusting an
offset. Then,
- Map(D3D10_MAP_DISCARD,…) when buffer is full or the first time
initializing it.
50
GeForce 8 and 9 Series GPU Programming Guide
51
DirectX 10 Considerations
52
GeForce 8 and 9 Series GPU Programming Guide
This chapter covers general advice about programming GPUs that can be
leveraged across multiple GPU families.
53
General Advice
Device IDs are also not a substitute for caps (DirectX 9) and extension
strings(OpenGL). Caps have and do change over time, due to various reasons.
Mostly, caps will be turned on over time, but caps also get turned off, due to
specs being tightened up and clarified, or simply the difficulty or cost of
maintaining certain driver or hardware capabilities.
One benefit of using DirectX 10 is that Microsoft has removed the concept of
caps bits and a GPU that supports DirectX 10 must support all features of
DirectX 10.
Render target and texture formats also have been turned off from time to time,
so be sure to check for support.
For an up to date list of Device IDs refer to
http://developer.nvidia.com/object/device_ids.html
If you are having problems with Device IDs, please contact our Developer
Relations group at devrelfeedback@nvidia.com.
The current list of Device IDs for all NVIDIA GPUs is here:
http://developer.nvidia.com/object/device_ids.html.
54
GeForce 8 and 9 Series GPU Programming Guide
Depth Bounds Test and is available on all GPUs from GeForce 6 Series and
later.
Depth Bounds Test (DBT) allows the programmer to enable an additional
criterion to allow discarding of a pixel before blending to the rendertarget.
The extension adds a new per-fragment test that is, logically, after the scissor
test and before the alpha test. The depth bounds test compares the depth value
stored at the location given by the incoming fragment's (xw,yw) coordinates to a
user-defined minimum and maximum depth value. If the stored depth value is
outside the user-defined range (exclusive), the incoming fragment is discarded.
Unlike the depth test, the depth bounds test has NO dependency on the
fragment's window-space depth value.
55
General Advice
To disable it, simply change the fourCC code for the adaptiveness.
dev->SetRenderState(D3DRS_ADAPTIVETESS_X,0);
56
GeForce 8 and 9 Series GPU Programming Guide
6.4.1.1. Usage
Below is a code example showing how to check for hardware shadow map
support and create an appropriate NULL render target.
57
General Advice
m_zFormat = D3DFMT_D24X8;
m_colorFormat = D3DFMT_A8R8G8B8;
m_nullFormat = (D3DFORMAT)MAKEFOURCC('N','U','L','L');
// Check for hardware shadowmap support
if(FAILED(CheckResourceFormatSupport(m_pd3dDevice, m_zFormat,
D3DRTYPE_TEXTURE, D3DUSAGE_DEPTHSTENCIL)))
{
MessageBox(NULL, _T("Device/driver does not support hardware shadow
maps!"), _T("ERROR"), MB_OK|MB_SETFOREGROUND|MB_TOPMOST);
return E_FAIL;
}
// Check for NULL render target support
if(FAILED(CheckResourceFormatSupport(m_pd3dDevice, m_nullFormat,
D3DRTYPE_SURFACE, D3DUSAGE_RENDERTARGET)))
{
MessageBox(NULL, _T("Device/driver does not support hardware shadow
maps with NULL colorbuffers!"), _T("ERROR"),
MB_OK|MB_SETFOREGROUND|MB_TOPMOST);
return E_FAIL;
}
if(FAILED(m_pd3dDevice->CreateRenderTarget ( TEXDEPTH_WIDTH,
TEXDEPTH_HEIGHT, m_nullFormat, D3DMULTISAMPLE_NONE, (DWORD)0, FALSE,
&m_pSMColorSurface, NULL)))
return E_FAIL;
m_pSMColorTexture = NULL;
58
GeForce 8 and 9 Series GPU Programming Guide
6.4.2.1. Usage
"No z-compare" z-buffers are exposed as a separate FOURCC format ('RAWZ'
on NV4x and 'INTZ' on G8x).
To create a RAWZ z-buffer, just do:
1. On GeForce 6/7 series use: (D3DFORMAT)MAKEFOURCC('R','A','W','Z')
2. On GeForce 8 series use: (D3DFORMAT)MAKEFOURCC('I','N','T','Z')
m_pd3dDevice->CreateTexture(TEXDEPTH_WIDTH, TEXDEPTH_HEIGHT, 1,
D3DUSAGE_DEPTHSTENCIL, (D3DFORMAT)MAKEFOURCC('I','N','T','Z'),
D3DPOOL_DEFAULT, &m_pSMZTexture, NULL)
59
GeForce 8 and 9 Series GPU Programming Guide
Chapter 7.
Performance Tools Overview
This section describes several of our tools that will help you identify and remedy
performance bottlenecks as well as assist you in content creation. Our tools are
constantly being updated and new tools being developed. For up to date
information as well as the latest releases and usage information please visit
http://developer.nvidia.com/page/tools.html
This chapter will overview some of the most used tools.
7.1. PerfKit
http://developer.nvidia.com/PerfKit
NVIDIA PerfKit is a comprehensive suite of performance tools to help debug
and profile OpenGL and Direct3D applications. It gives you access to low-level
performance counters inside the driver and hardware counters inside the GPU
itself. The counters can be used to determine exactly how your application is
using the GPU, identify performance issues, and confirm that performance
problems have been resolved.
NVIDIA PerfKit includes support for 32-bit and 64-bit Windows XP and
Vista platforms.
The performance counters are available directly in your OpenGL and DirectX
applications and in tools such as Intel® VTune™ for Windows and Graphic
Remedy’s gDEBugger via the Windows Management Instrumentation (WMI)
Performance Data Helper (PDH) interface. A plug-in supporting Microsoft PIX
60
Performance Tools Overview
7.1.1. PerfHUD
http://developer.nvidia.com/PerfHUD
NVIDIA PerfHUD is a
powerful real-time performance
analysis tool for Direct3D
applications. PerfHUD is widely
used by the world's best game
developers and was a 2007
Game Developer Magazine
Frontline Award Finalist.
Check out screenshots of
PerfHUD running in some
today's most successful games,
testimonials by game developers
all over the world, or reviews of PerfHUD 5 from iXBT and Beyond3D.
• GeForce GTX 200 Series GPU Support
• A robust input layer for intercepting the mouse and keyboard
• Works with standard drivers on Windows Vista
• SLI Support
• Texture Visualization and Overrides
• API Call List
• Dependency View
• CPU/GPU Timing graph
7.1.2. PerfSDK
http://developer.nvidia.com/object/nvperfsdk_home.html
NVPerfSDK is a component of NVPerfKit that provides a programmatic API
for accessing performance counters in the graphics driver and GPU. It allows
you to query performance counters from your own applications, enabling you to
61
GeForce 8 and 9 Series GPU Programming Guide
7.1.3. GLExpert
http://developer.nvidia.com/object/glexpert_home.html
GLExpert is a component of NVPerfKit that helps OpenGL developers by
identifying errors and performance issues.
GLExpert is controlled using the GLExpert tab of the NVIDIA Developer
Control panel. You can choose what level of debugging information to report,
as well as where the output is sent.
GLExpert provides the following features:
• Outputs to console/stdout or debugger
• Displays different groups and levels of information detail
• OpenGL errors: print as they arise
• Software Fallbacks: indicate when the driver is using
software emulation
• GPU Programs: print errors during compilation or linking
• VBOs: show where buffers reside along with mapping details
• FBOs: print reasons why a configuration is unsupported
You can expect even more functionality in the future as GLExpert's
feature list grows with driver advances.
62
Performance Tools Overview
To learn more about GLExpert, we recommend that you take a look at our
"GPU Performance Tuning with NVIDIA Performance Tools" talk from GDC
2006.
7.1.4. ShaderPerf
http://developer.nvidia.com/ShaderPerf
NVIDIA ShaderPerf is a
command-line shader profiling
utility and C API that reports
detailed shader performance
metrics for a wide range of
GPUs.
A graphical user interface (GUI)
for shader performance analysis
is available in FX Composer 2.5,
and was built using the
ShaderPerf API.
63
GeForce 8 and 9 Series GPU Programming Guide
featured pixel shader debugger that allows shaders to be debugged just like CPU
code.
The NVIDIA Shader Debugger is a plug-in for FX Composer 2.5 that supports
debugging of pixel shaders in the following shading languages:
7.3. FX Composer
http://developer.nvidia.com/FXComposer
FX Composer empowers
developers to create high-
performance shaders for
DirectX and OpenGL in an
integrated development
environment with unique
real-time preview and
optimization features. FX
Composer was designed with
the goal of making shader
64
Performance Tools Overview
Developer Tools
Questions and Feedback
We would like to receive your feedback on our tools. Please visit our Developer
Forums at http://developer.nvidia.com/forums.
65