You are on page 1of 29

Optimizing DirectX* Graphic Applications using Software Vertex Processing

Ronen Zohar/Kim Pallister Intel Corporation


1

Meltdown 2001

Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

Agenda
Do I need SW vertex processing? The PSGP Using SW vertex processing for maximum performance: memory, batching and render-states SW vertex processing and DirectX*s 8.0 new features
Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

Do I need SW vertex processing?


Your publisher wants:
Eye-candy graphics, using all the latest 3D features Lower the minimum system requirements and many more

Problem: older systems does not support all the eyecandy features Solution1: Disable features for low-end systems Solution2: Use SW vertex processing (at least for the features that you can) and keep some features

Meltdown 2001

Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

Inside DirectX* Graphics


Application DirectX run-time API Front-end

HW vertex processing path

SW Vertex processing (PSGP)

Communication to the driver (DDI) Driver


Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

PSGP Processor Specific Geometry Pipeline


Part of DirectX graphics responsible for the SW vertex processing algorithms, optimized for the clients processor DirectXs 8.0 PSGP is optimized for:
Intel Pentium III processor Intel Pentium 4 processor

Meltdown 2001

Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

The PSGP
Fixed function path Transformation Lighting Tex Gen Format data to outputFVF

VB
Map stream to registers Execute vertex shader code

Vertex shader path

Clipper

IB To
Meltdown 2001

driver

Internal temporary VBs


6

Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

PSGP Principles
Use SIMD to process multiple vertices in each iteration
Vertical processing Data is swizzled on the fly

Prefetch input streams to hide memory latency Write output to temporary VBs based on XYZRHW FVF code
In system memory if need to read back transformed vertices In driver memory if no read-back is required More on this later
7

Meltdown 2001

Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

Input Stream Memory Allocation


Create SW processed primitives in system memory (using the D3DUSAGE_SOFTWAREPROCESSING usage create flags). If the same VB is processed both in SW and HW
Try to avoid it If you must - create multiple copies, one in system and one in driver memory

If the primitive is never clipped, use the D3DUSAGE_DONOTCLIP usage flag


Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

Primitive Batching
Batch all the SW processed primitives together SW processes the entire VB range that you submit, if multiple primitives are using the same VB squeeze the vertices range As with HW, bigger primitives are always better (the PSGP have long setup)

Meltdown 2001

Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

Primitive Batching (Cont)


The PSGP is batching the processed vertices before sending them to HW (to reduce HWs VB changes) Primitives are batched as long as their output FVF is equal:
and XYZ | DIFFUSE | TEX1 have the same output FVF (XYZRHW | DIFFUSE | TEX1) In SW mode, changing the VB FVF does not mean a slowdown (unlike HW)
XYZ | NORMAL | TEX1
Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

10

Clipping Render-state
When clipping is enabled, the PSGP
Stores its output to system memory buffer
As it need to read vertices in order to clip
Driver need to copy it across the AGP

When clipping disabled writes to driver allocated buffer


No Copy here!

Calculates clip flags (out-codes) for each vertex


more execution cycles per vertex

Clips

Minimize the amount of clipping


Use bounding boxes/spheres on your objects Dont forget to take the guard-band into account
Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

11

Clipping Render-state (Cont)


Pseudo-code to minimize clipping
If (BB is outside screen)
Dont render primitive

Elseif (BB is inside guard-band)


Render with clipping off

Else
Render with clipping on

Typical game scene should have <10% of primitives clipped


Biggest problem is front plane clipping
Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

12

Performance Render-states
Specular very expensive LocalViewer smaller performance impact than HW, but still costs more NormalizeNormals extra work for the PSGP, use only when needed Fog written as specular alpha, can change PSGPs output FVF
Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

13

DirectX* 8.0 Graphics New Features


Point sprites Tweening Indexed vertex blending/ Indexed palette skinning Vertex Shaders
Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

14

Point Sprites
PSGP writes in native FVF format If HW does not support
Each point is expanded to quad, using the point size calculated The quad list is submitted to the driver

Very slow solution if no HW support for point sprites, try to avoid it


Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

15

Tweening
Tween the position and normal before transformation (in SIMD) After tweening continuous the standard PSGP flow Costs very few cycles
But, for tweening and transformation only a vertex shader would run faster Try to compare your exact scenario to a vertex shader
Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

16

Indexed Skinning
Transforms all vertices to matrix0 space
Using scalar code, with lookup for the needed matrix

Than continuous the normal PSGP flow DirectX* 7 style skinning is supported by some HW and may run faster, but requires multiple models and DrawPrimitive calls
Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

17

Vertex shaders
At vertex shader creation
The shader code is compiled to equivalent IA32 code Using all possible assembly optimizations and instructions available on clients CPU to achieve fastest code

At vertex shader execution


Calling the generated code

SW vertex shaders have excellent performance


Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

18

SW Vs. HW Vertex Shaders


Calculates more than one vertex in a single iteration
Based on the processor SIMD width

Not every shader instruction is 1 clock


But, the CPU runs with much higher frequency than todays 3D graphics chips

Meltdown 2001

Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

19

SW Vs. HW Vertex Shaders


(Cont)

Simple compilation sample:


Mul r0.xyz,v0,c0
Movaps Mulps Movaps Mulps Movaps Mulps Movaps Movaps Movaps xmm0,[v0.x] xmm0,[c0.x] xmm1,[v0.y] xmm1,[c0.y] xmm2,[v0.z] xmm2,[c0.z] [r0.x],xmm0 [r0.y],xmm1 [r0.z],xmm2
20

Meltdown 2001

Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

SW Vs. HW Vertex Shaders


(cont)

Data that you write, is data that the CPU have to calculate
Write only needed data (using the vertex shader write mask) Use the swizzle modifiers, and dont duplicate written data

Vertex shader instructions are blended to achieve maximum performance


But, keeping dependency chains squeezed will help the compiler in physical register assignments
Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

21

Performance Tips for SW Vertex Shaders


m?x? macros have better performance than the un-expanded macros Try to minimize the use of the address register
Due to the parallelism of the SW vertex shader Sort the VB by values used in the address register
Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

22

Performance Tips for SW Vertex Shaders


lit, expp and logp are big cycle consumers
Use the worse accuracy (i.e. expp.x) when possible Use either .x or .z (but not both) exp and log are worse than expp, logp

Dont implicitly saturate color values


it is done automatically

Meltdown 2001

Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

23

Optimized Vertex Shader


dp4 oPos.x, v0, c2 dp4 oPos.y, v0, c3 dp4 oPos.z, v0, c4 dp4 oPos.w, v0, c5 add r1, c6,-v0 dp3 r2, r1, r1 rsq r2, r2 mov oT0, v2 mul r1,r1,r2 dp3 r3, v1, r1 max r3,r3,c8 add r3, r3, c7 min oD0,r3,c9
Meltdown 2001

m4x4 oPos, v0, c[2] add r1.xyz, c6,-v0 dp3 r2.w, r1, r1 rsq r2.w, r2.w mul r1.xyz,r1,r2.w dp3 r2.w, v1, r1 max r2.w,r2.w,c8 add oD0.xyz, r2.w, c7 mov oT0, v2

Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

24

Questions??
Ronen.Zohar@intel.com Kim.Pallister@intel.com
Intel, Pentium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright 2001 Intel Corp.

Meltdown 2001

Copyright 2001 Intel Corporation.

25

*Other names and brands may be claimed as the property of others.

Backup

Meltdown 2001

Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

26

Tweening + transformation vertex shader


Mul Mad M4x4 Mov Mov r0.xyz,v0,c0.x // c0.x r0.xyz,v1,c0.y,r0 // c0.y (1- ) oPos,r0,c1 oD[0].xyz,v2 oT[0].xy,v3

Meltdown 2001

Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

27

Not Equal Address Value


Address register (x4) 1 2 1 2 Const register file (x4) 1.0f 1.0f 1.0f 1.0f 2.0f 2.0f 2.0f 2.0f 3.0f 3.0f 3.0f 3.0f

Need to re-arrange a combination register for the SIMD instruction to use

(costs ~20 cycles)

1.0f 2.0f 1.0f 2.0f Instruction argument


Copyright 2001 Intel Corporation.

Meltdown 2001

28

*Other names and brands may be claimed as the property of others.

Equal Address Value


Address register (x4) 2 2 2 2
Accessing directly the x4 constant register file. No penalty for re-arranging vertices

Const register file (x4) 1.0f 1.0f 1.0f 1.0f 2.0f 2.0f 2.0f 2.0f 3.0f 3.0f 3.0f 3.0f

Instruction argument

Address accessing mode is selected when storing address value


Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.

29