You are on page 1of 304

GPU Zen 2

GPU Zen 2
Advanced Rendering Techniques

Edited by Wolfgang Engel

Black Cat Publishing Inc.


Encinitas, CA
Editorial, Sales, and Customer Service Office

Black Cat Publishing Inc.


144 West D Street Suite 204
Encinitas, CA 92009
http://www.black-cat.pub/

Copyright © 2019 by Black Cat Publishing Inc.

All rights reserved. No part of the material protected by this copyright notice may be
reproduced or utilized in any form, electronic or mechanical, including photocopying,
recording, or by any information storage and retrieval system, without written permis-
sion from the copyright owner.

ISBN 13: 978-1-79758-314-3

Printed in the United States of America


12 11 10 09 08 10 9 8 7 6 5 4 3 2 1
Contents

Preface xi

I Rendering 1
Patrick Cozzi, editor

1 Adaptive GPU Tessellation with Compute Shaders 3


Jad Khoury, Jonathan Dupuy, and Christophe Riccio
1.1 Introduction .................................................................................................. 3
1.2 Implicit Triangle Subdivision ....................................................................... 4
1.3 Adaptive Subdivision on the GPU ................................................................ 9
1.4 Discussion .................................................................................................. 13
1.5 Acknowledgments ...................................................................................... 14
Bibliography ....................................................................................................... 15

2 Applying Vectorized Visibility on All Frequency Direct


Illumination 17
Ho Chun Leung, Tze Yui Ho, Zhenni Wang, Chi Sing Leung, and
Eric Wing Ming Wong
2.1 Introduction ................................................................................................ 17
2.2 The Precomputed Radiance Transfer .......................................................... 18
2.3 Rewriting the Radiance Equation ............................................................... 20
2.4 The Vectorized Visibility ........................................................................... 22
2.5 Lighting Evaluation .................................................................................... 23
2.6 Shader Implementation for the Generalized SAT Lookup ......................... 28
2.7 Dynamic Tessellation ................................................................................. 30
2.8 Results ........................................................................................................ 34

v
vi Contents

2.9 Conclusion ..................................................................................................37


2.10 Acknowledgments .....................................................................................38
Bibliography .......................................................................................................38

3 Nonperiodic Tiling of Noise-based Procedural Textures 41


Aleksandr Kirillov
3.1 Introduction.................................................................................................41
3.2 Wang Tiles ..................................................................................................42
3.3 Nonperiodic Tiling of Procedural Noise Functions .....................................43
3.4 Tiled Noise Filtering ...................................................................................50
3.5 Tiling Improvements ...................................................................................52
3.6 Results.........................................................................................................54
3.7 Performance ................................................................................................54
3.8 Limitations ..................................................................................................58
3.9 Conclusion ..................................................................................................58
3.10 Future Work ..............................................................................................59
Bibliography .......................................................................................................60

4 Rendering Surgery Simulation with Vulkan 63


Nicholas Milef, Di Qi, and Suvranu De
4.1 Introduction.................................................................................................63
4.2 Overview .....................................................................................................63
4.3 Render Pass Architecture ............................................................................64
4.4 Handling Deformable Meshes.....................................................................69
4.5 Memory Management System ....................................................................71
4.6 Performance and results ..............................................................................73
4.7 Case Study: CCT .........................................................................................75
4.8 Conclusion and Future Work.......................................................................76
4.9 Source Code ................................................................................................77
4.10 Acknowledgments .....................................................................................77
Bibliography .......................................................................................................77

5 Skinned Decals 79
Hawar Doghramachi
5.1 Introduction.................................................................................................79
5.2 Overview .....................................................................................................79
5.3 Implementation ...........................................................................................80
Contents vii

5.4 Pros and Cons ............................................................................................. 86


5.5 Results ........................................................................................................ 87
5.6 Conclusion.................................................................................................. 88
Bibliography ....................................................................................................... 88

II Environmental Effects 89
Wolfgang Engel, editor

1 Real-Time Fluid Simulation in Shadow of the Tomb Raider 91


Peter Sikachev, Martin Palko, and Alexandre Chekroun
1.1 Introduction ................................................................................................ 91
1.2 Related Work .............................................................................................. 91
1.3 Simulation .................................................................................................. 92
1.4 Engine Integration .................................................................................... 104
1.5 Optimization ............................................................................................. 108
1.6 Future Work.............................................................................................. 110
Acknowledgments ............................................................................................ 110
Bibliography ..................................................................................................... 110

2 Real-time Snow Deformation in Horizon Zero Dawn: The


Frozen Wilds 113
Kevin Örtegren
2.1 Introduction .............................................................................................. 113
2.2 Related work ............................................................................................. 114
2.3 Implementation ........................................................................................ 114
2.4 Results ...................................................................................................... 120
2.5 Conclusion and Discussion....................................................................... 122
Bibliography ..................................................................................................... 123

III Shadows 125


Mauricio Vives, editor

1 Soft Shadow Approximation for Dappled Light Sources 127


Mariano Merchante
1.1 Introduction .............................................................................................. 127
1.2 Detecting Pinholes .................................................................................... 129
viii Contents

1.3 Shadow Rendering ....................................................................................133


1.4 Temporal Filtering.....................................................................................135
1.5 Results.......................................................................................................137
1.6 Conclusion and Future Work.....................................................................139
Bibliography .....................................................................................................140

2 Parallax-Corrected Cached Shadow Maps 143


Pavlo Turchyn
2.1 Introduction...............................................................................................143
2.2 Parallax Correction Algorithm ..................................................................144
2.3 Applications of Parallax Correction ..........................................................149
2.4 Results.......................................................................................................150
Bibliography .....................................................................................................152

IV 3D Engine Design 155


Wessam Bahnassi, editor

1 Real-Time Layered Materials Compositing Using Spatial


Clustering Encoding 157
Sergey Makeev
1.1 Introduction...............................................................................................157
1.2 Overview of Current Techniques ..............................................................158
1.3 Introduced Terms ......................................................................................159
1.4 Algorithm Overview .................................................................................159
1.5 Algorithm Implementation ........................................................................164
1.6 Results.......................................................................................................172
1.7 Conclusion and Future Work.....................................................................172
Acknowledgments .............................................................................................175
Bibliography .....................................................................................................175

2 Procedural Stochastic Textures by Tiling and Blending 177


Thomas Deliot and Eric Heitz
2.1 Introduction...............................................................................................177
2.2 Tiling and Blending ..................................................................................178
2.3 Precomputing the Histogram Transformations ..........................................184
2.4 Improvement: Using a Decorrelated Color Space .....................................188
2.5 Improvement: Prefiltering the Look-up Table ...........................................190
Contents ix

2.6 Improvement: Using Compressed Texture Formats.................................... 195


2.7 Results ...................................................................................................... 196
2.8 Conclusion................................................................................................ 197
Acknowledgments ............................................................................................ 197
Bibliography ..................................................................................................... 200

3 A Ray Casting Technique for Baked Texture Generation 201


Alain Galvan and Jeff Russell
3.1 Baking in Practice .................................................................................... 202
3.2 GPU Considerations ................................................................................. 211
3.3 Future Work.............................................................................................. 213
Bibliography ..................................................................................................... 214

4 Writing an Efficient Vulkan Renderer 215


Arseny Kapoulkine
4.1 Memory Management .............................................................................. 216
4.2 Descriptor Sets ......................................................................................... 219
4.3 Command Buffer Recording and Submission........................................... 229
4.4 Pipeline Barriers ....................................................................................... 233
4.5 Render Passes ........................................................................................... 238
4.6 Pipeline Objects ........................................................................................ 240
4.7 Conclusion................................................................................................ 245
Acknowledgments ............................................................................................ 247

5 glTF—Runtime 3D Asset Delivery 249


Marco Hutter
5.1 The Goals of glTF .................................................................................... 249
5.2 Design Choices ......................................................................................... 250
5.3 Feature Summary ..................................................................................... 251
5.4 Ecosystem................................................................................................. 256
5.5 Tools and Workflows ................................................................................ 257
5.6 Extensions ................................................................................................ 261
5.7 Application support .................................................................................. 263
5.8 Conclusion................................................................................................ 264

V Real-time Ray Tracing 265


Anton Kaplanyan, editor
x Contents

1 Real-Time Ray-Traced One-Bounce Caustics 267


Holger Gruen
1.1 Introduction...............................................................................................267
1.2 Previous Work ...........................................................................................269
1.3 Algorithm Overview .................................................................................270
1.4 Implementation Details .............................................................................273
1.5 Results.......................................................................................................275
1.6 Future work ...............................................................................................275
1.7 Demo ........................................................................................................276
Bibliography .....................................................................................................278

2 Adaptive Anti-Aliasing using Conservative Rasterization and


GPU Ray Tracing 279
Rahul Sathe, Holger Gruen, Adam Marrs, Josef Spjut,
Morgan McGuire, and Yury Uralsky
2.1 Introduction...............................................................................................279
2.2 Overview ...................................................................................................280
2.3 Pixel Classification using Conservative Rasterization ...............................280
2.4 Improved Coverage and Shading Computation .........................................284
2.5 Image Quality and Performance................................................................288
2.6 Future work ...............................................................................................291
2.7 Demo ........................................................................................................291
Bibliography .....................................................................................................292
Preface

This book—like its long line of predecessors—is created with the intend to helping
readers to better achieve their goals.
For generations, books were used to preserve valuable information. They are an
important source of knowledge in our modern world. With the rise of social media,
information is obscured and transformed into whatever the agenda of the poster is. It
became acceptable to bother other people with information that is sometimes tasteless,
mindless and/or nonsensical in all areas of life, including graphics programming. Po-
litical parties and companies drive large scale misinformation activities (sometimes
called marketing or information warfare) with noise levels that are hard to bear.
This book is meant to provide an oasis of peace and intellectual reflection. All of
us who worked on it, tried to make sure this collection of articles is practically useful,
stimulating for your mind and a joy to read.
The awesome screenshot on the cover is provided by Jeroen Roding with permis-
sion by Guerilla Games. Thank you!
I would like to thank Eric Lengyel for editing the articles and creating the beautiful
page layout. I would also like to thank Anton Kaplanyan, Mauricio Vives, Patrick Cozzi
and Wessam Bahnassi for being the section editors.
I also want to thank at this point everyone for supporting this book series and its
predecessors since 2001. These books started friendships, careers, companies and
much more over the years. They certainly changed my life in awesome ways!

Love and Peace,

—Wolfgang Engel

xi
I
Rendering

Real-time rendering is an exciting field in part because how rapidly it evolves and ad-
vances but also because of the graphics community’s eagerness and willingness to
share their new ideas, opening the door for others to learn and share in the fun!
In this section we introduce five new rendering techniques that will be relevant to
game developers, hobbyist and anyone else interested in the world of graphics.
The article “Adaptive GPU Tessellation with Compute Shaders” by Jad Khoury,
Jonathan Dupuy, and Christophe Riccio suggests to make rasterization more efficient
for moderately distant polygons by procedurally refining coarse meshes as they get
closer to the camera with the help of compute shaders. They achieve this by manipu-
lating an implicit (triangle-based) subdivision scheme for each polygon of the scene in
a dedicated compute shader that reads from and writes to a compact, double-buffered
array.
The article “Applying Vectorized Visibility on All frequency Direct Illumination”
by Ho Chun Leung, Tze Yui Ho, Zhenni Wang, Chi Sing Leung, Eric Wing Ming Wong
describes a new PRT approach with visibility functions represented in vector graphics
form. This results in a different set of strengths and weaknesses compared to other PRT
approaches. This new approach can preserve the fidelity of high frequency shadows
and accurately account for a huge number of light sources, even with coarsely tessel-
lated 3D models. It can also handle the specular component from mirror to blurry
reflections.
The article “Non-periodic Tiling of Noise-based Procedural Textures” by Ale-
ksandr Kirillov shows a method to combine noise-based procedural texture synthesis
with a non-periodic tiling algorithm. It describes modifications to several popular
procedural noise functions that directly produce texture maps containing the smallest
possible complete Wang tile set. It can be used as a preprocessing step or during appli-
cation runtime.
The article “Rendering Surgery Simulation with Vulkan” by Nicholas Milef, Di
Qi, and Suvranu De shows a rendering system design around surgery simulation in-
cluding how higher-level design decisions propagate to lower-level usage of Vulkan.
The last article in the section “Skinned Decals” by Hawar Doghramachi describes
a way on how to dynamically apply decals to a character for example to show the impact
position of a projectile. This technique is overcoming the drawback of deferred decals
in that scenario, where in case the target area is influenced by several bones, the decals
are “swimming” on top of the target mesh.

—Patrick Cozzi

1
1
I

Adaptive GPU Tessellation


with Compute Shaders
Jad Khoury, Jonathan Dupuy,
and Christophe Riccio

1.1 Introduction
GPU rasterizers are most efficient when primitives project into more than a few pixels.
Below this limit, the Z-buffer starts aliasing, and shading rate decreases dramatically
[Riccio 2012]; this makes the rendering of geometrically-complex scenes challenging,
as any moderately distant polygon will project to subpixel size. In order to minimize
such subpixel projections, a simple solution consists in procedurally refining coarse
meshes as they get closer to the camera. In this chapter, we are interested in deriving
such a procedural refinement technique for arbitrary polygon meshes.
Traditionally, mesh refinement has been computed on the CPU via recursive algo-
rithms such as quadtrees [Duchaineau et al. 1997, Strugar 2009] or subdivision sur-
faces [Stam 1998, Cashman 2012]. Unfortunately, CPU-based refinement is now
fundamentally bottlenecked by the massive CPU-GPU streaming of geometric data it
requires for high resolution rendering. In order to avoid these data transfers, extensive
work has been dedicated to implement and/or emulate these recursive algorithms di-
rectly on the GPU by leveraging tessellation shaders (see, e.g., [Niessner et al. 2012,
Cashman 2012, Mistal 2013]). While tessellation shaders provide a flexible, hardware-
accelerated mechanism for mesh refinement, they remain limited in two respects. First,
they only allow up to log 2 64  6 levels of subdivision. Second, their performance
drops along with subdivision depth [AMD 2013].
In the following sections, we introduce a GPU-based refinement scheme that is
free from the limitations incurred by tessellation shaders. Specifically, our scheme al-
lows arbitrary subdivision levels at constant memory costs. We achieve this by manip-
ulating an implicit (triangle-based) subdivision scheme for each polygon of the scene
in a dedicated compute shader that reads from and writes to a compact, double-buffered

3
4 1. Adaptive GPU Tessellation with Compute Shaders

array. First, we show how we manage our implicit subdivision scheme in Section 1.2.
Then, we provide implementation details for rendering programs we wrote that lever-
age our subdivision scheme in Section 1.3.

1.2 Implicit Triangle Subdivision

1.2.1 Subdivision Rule


Polygon refinement algorithms build upon a subdivision rule. The subdivision rule de-
scribes how an input polygon splits into subpolygons. Here, we rely on a binary triangle
subdivision rule, which is illustrated in Figure 1.1(a). The rule splits a triangle into
two similar subtriangles 0 and 1, whose barycentric-space transformation matrices are
respectively
 1 2 1 2 1 2 
 
M 0  1 2 1 2 1 2 , (1.1)
 
 0 0 1

and
 1 2 1 2 1 2 
 
M 1    1 2  1 2 1 2 . (1.2)
 
 0 0 1 
 
Listing 1.1 shows the GLSL code we use to procedurally compute either M 0 or M 1
based on a binary value. It is clear that at subdivision level N  0 , the rule produces
2 N triangles; Figure 1.1(b) shows the refinement produced at subdivision level N  4,
which consists of 2 4  16 triangles.

Figure 1.1. The (a) subdivision rule we apply on a triangle (b) uniformly and (c) adaptively.
The subdivision levels for the red, blue, and green nodes are respectively 2, 3, and 4.
1.2 Implicit Triangle Subdivision 5

mat3 bitToXform(in uint bit)


{
float s = float(bit) - 0.5;
vec3 c1 = vec3( s, -0.5, 0);
vec3 c2 = vec3(-0.5, -s, 0);
vec3 c3 = vec3(+0.5, +0.5, 1);

return mat3(c1, c2, c3);


}

Listing 1.1. Computing the subdivision matrix M 0 or M 1 from a binary value.

1.2.2 Implicit Representation


By construction, our subdivision rule produces unique subtriangles at each step. There-
fore, any subtriangle can be represented implicitly via concatenations of binary words,
which we call a key. In this key representation, each binary word corresponds to the
partition (either 0 or 1) chosen at a specific subdivision level; Figure 1.1(b, c) shows
the keys associated to each triangle node in the context of (b) uniform and (c) adaptive
subdivision. We retrieve the subdivision matrix associated to each key through succes-
sive matrix multiplications in a sequence determined by the binary concatenations. For
example, letting M 0100 denote the transformation matrix associated to the key 0100, we
have

M 0100  M 0  M 1  M 0  M 0 (1.3)

In our implementation, we store each key produced by our subdivision rule as a 32-bit
unsigned integer. Below is the bit representation of a 32-bit word, encoding the key
0100. Bits irrelevant to the code are denoted by the ‘_’ character.

MSB LSB
____ ____ ____ ____ ____ ____ ___1 0100

Note that we always prepend the key’s binary sequence with a binary value of 1 so we
can track the subdivision level associated to the key easily. Listing 1.2 provides the
GLSL code we use to extract the transformation matrix associated to an arbitrary key.
Since we use 32-bit integers, we can store up to a 32 1  31 levels of subdivision,
which includes the root node. Naturally, more levels require longer words. Because
longer integers are currently unavailable on many GPUs, we emulate them using integer
vectors, where each component represents a 32-bit wide portion of the entire key. For
more details, see our implementation, where we provide a 63-level subdivision algo-
rithm using the GLSL uvec2 datatype.
6 1. Adaptive GPU Tessellation with Compute Shaders

mat3 keyToXform(in uint key)


{
mat3 xf = mat3(1);

while (key > 1u)


{
xf = bitToXform(key & 1u) * xf;
key = key >> 1u;
}

return xf;
}

Listing 1.2. Key to transformation matrix decoding routine.

1.2.3 Iterative Construction


Subdivision is recursive by nature. Since GPU execution units lack stacks, implement-
ing GPU recursion is difficult. In order to circumvent this difficulty, we store the trian-
gles produced by our subdivision as keys inside a buffer that we update iteratively in a
ping-pong fashion; we refer to this double-buffer as the subdivision buffer. Because
our keys consists of integers, our subdivision buffer is very compact. At each iteration,
we process the keys independently in a compute shader, which is set to write in the
second buffer. We allow three possible outcomes for each key: it can be subdivided to
the next level, downgraded to the previous subdivision level, or conserved as is. Such
operations are very straightforward to implement thanks to our key representation. The
following bit representations match the parent of the key given in our previous example
along with its two children:

MSB LSB
parent: ____ ____ ____ ____ ____ ____ ____ 1010
key: ____ ____ ____ ____ ____ ____ ___1 0100
child1: ____ ____ ____ ____ ____ ____ __10 1000
child2: ____ ____ ____ ____ ____ ____ __10 1001

Note that compared to the key representation, the other keys are either 1-bit expansions
or contractions. The GLSL code to compute these representations is shown in List-
ing 1.3; it simply consists of bit shifts and logical operations, and is thus very cheap.
Listing 1.4 provides the pseudocode we typically use for updating the subdivision
buffer in a GLSL compute shader. In practice, if a key needs to be split, it emits two
new words, and the original key is deleted. Conversely, when two sibling keys must
merge, they are replaced by their parent's key. In order to avoid generating two copies
of the same key in memory, we only emit the key once from the 0-child, identified
1.2 Implicit Triangle Subdivision 7

using the test provided in Listing 1.5. We also provide some unit tests we perform on
the keys to avoid producing invalid keys in Listing 1.6. For the keys that do not require
any modification, they are simply re-emitted, unchanged.

uint parentKey(in uint key)


{
return (key >> 1u);
}

void childrenKeys(in uint key, out uint children[2])


{
children[0] = (key << 1u) | 0u;
children[1] = (key << 1u) | 1u;
}

Listing 1.3. Implicit subdivision procedures in GLSL.

buffer keyBufferOut { uvec2 u_SubdBufferOut[]; };


uniform atomic_uint u_SubdBufferCounter;

// write a key to the subdivision buffer


void writeKey(uint key)
{
uint idx = atomicCounterIncrement(u_SubdBufferCounter);
u_SubdBufferOut[idx] = key;
}

// general routine to update the subdivision buffer


void updateSubdBuffer(uint key, int targetLod)
{
// extract subdivision level associated to the key
int keyLod = findMSB(key);

// update the key accordingly


if (/* subdivide ? */ keyLod < targetLod && !isLeafKey(key))
{
uint children[2]; childrenKeys(key, children);

writeKey(children[0]);
writeKey(children[1]);
}
else if (/* keep ? */ keyLod == targetLod)
{
writeKey(key);
8 1. Adaptive GPU Tessellation with Compute Shaders

}
else /* merge ? */
{
if (/* is root ? */ isRootKey(key))
{
writeKey(key);
}
else if (/* is zero child ? */ isChildZeroKey(key))
{
writeKey(parentKey(key));
}
}
}

Listing 1.4. Updating the subdivision buffer on the GPU.

bool isChildZeroKey(in uint key) { return (key & 1u == 0u); }

Listing 1.5. Determining if the key represents the 0-child of its parent.

bool isRootKey(in uint key) { return (key == 1u); }


bool isLeafKey(in uint key) { return findMSB(key) == 31; }

Listing 1.6. Determining whether a key is a root key or a leaf key.

It should be clear that our approach maps very well to the GPU. This allows us to
compute adaptive subdivisions such as the one shown in Figure 1.1(c). Note that an
iteration only permits a single refinement or coarsening operation per key. Thus when
more are needed, multiple buffer iterations should be performed. In our rendering im-
plementations, we perform a single buffer iteration at the beginning of each frame.

1.2.4 Conversion to Explicit Geometry


For the sake of completeness, we provide here some additional details on how we con-
vert our implicit subdivision keys into actual geometry. We achieve this easily with
GPU instancing. Specifically, we instantiate a triangle for each subdivision key located
in our subdivision buffer. For each instance, we determine the location of the triangle
vertices using the routines of Listing 1.7. Note that these routines focus on computing
the coordinates of the vertices of the subdivided triangles; extending them to handle
other attributes such as normals or texture coordinates is straightforward.
1.3 Adaptive Subdivision on the GPU 9

// barycentric interpolation
vec3 berp(in vec3 v[3], in vec2 u)
{
return v[0] + u.x * (v[1] - v[0]) + u.y * (v[2] - v[0]);
}

// subdivision routine (vertex position only)


void subd(in uint key, in vec3 v_in[3], out vec3 v_out[3])
{
mat3 xf = keyToXform(key);
vec2 u1 = (xf * vec3(0, 0, 1)).xy;
vec2 u2 = (xf * vec3(1, 0, 1)).xy;
vec2 u3 = (xf * vec3(0, 1, 1)).xy;

v_out[0] = berp(v_in, u1);


v_out[1] = berp(v_in, u2);
v_out[2] = berp(v_in, u3);
}

Listing 1.7. Compute the vertices v_out of the subtriangle associated to a subdivision key
generated from a triangle defined by vertices v_in.

1.3 Adaptive Subdivision on the GPU

1.3.1 Overview
In this section, we describe a tessellation technique for polygonal geometry that lever-
ages our implicit subdivision scheme. Our technique computes an adaptive subdivision
for each polygon in the scene, so as to control their extent in screen-space and hence
minimize subpixel projections; we describe how we compute such subdivisions using
a distance-based LOD criterion in Section 1.3.2. Since adaptive subdivisions usually
lead to T-junction polygons, we also discuss how we avoid them entirely; we discuss
the issue of T-junctions in Section 1.3.3.
In practice, our technique requires three GPU kernels with OpenGL 4.5; Fig-
ure 1.2 diagrams the OpenGL pipeline of our implementation. The first kernel (Lod-
Kernel) updates the subdivision buffer in a compute shader using the algorithms
described in the previous section. In addition, we perform view-frustum culling for
each key and write the visible ones to a buffer (CulledSubdBuffer) using an atomic
counter. Next, we launch a second compute kernel (IndirectBatcherKernel) that pre-
pares an indirect compute dispatch call for the next subdivision update (i.e., the next
invocation of LodKernel), as well as an indirect draw call for the third and final kernel.
The final kernel (RenderKernel) executes the indirect drawing commands to render the
10 1. Adaptive GPU Tessellation with Compute Shaders

Figure 1.2. OpenGL pipeline of our compute-based tessellation shader. The green, red, and
gray boxes respectively denote GPU memory buffers, GPU code execution, and CPU code
execution.

final geometry to the framebuffer (FrameBuffer). It instances a grid of triangles (In-


stancedGeometryBuffers) for each key located in the frustum-culled subdivision buffer
(CulledSubdBuffer).

1.3.2 LOD Function


In order to guarantee that the transformed vertices produce rasterizer-friendly poly-
gons, we rely on a distance-based criterion to determine how to update the subdivision
buffer. Indeed, under perspective projection, the image plane size s at distance z from
the camera scales according to the relation
θ 
s  z   2 z tan  ,
 2 

where θ   0, π  is the horizontal field of view. Based on this observation, we derive


the following routine to determine the ideal subdivision level k that each key should
target:

float distanceToLod(float z)
{
float tmp = s(z) * targetPixelSize / screenResolution;
return -log2(clamp(tmp, 0.0, 1.0));
}

Here, the parameter z denotes the distance from the camera to the subtriangle associ-
ated to the key being processed. Listing 1.8 provides the GLSL pseudocode we execute
in LodKernel.
1.3 Adaptive Subdivision on the GPU 11

buffer VertexBuffer { vec3 u_VertexBuffer[]; };


buffer IndexBuffer { uint u_IndexBuffer[]; };
buffer SubdBufferIn { uvec2 u_SubdBufferIn[]; };

void main()
{
// get threadID (each key is associated to a thread)
int threadID = gl_GlobalInvocationID.x;

// get coarse triangle associated to the key


uint primID = u_SubdBufferIn[threadID].y;
vec3 v_in[3] = vec3[3](
u_VertexBuffer[u_IndexBuffer[primID * 3 ]],
u_VertexBuffer[u_IndexBuffer[primID * 3 + 1]],
u_VertexBuffer[u_IndexBuffer[primID * 3 + 2]],
);

// compute distance-based LOD


uint key = u_SubdBufferIn[threadID].x;
vec3 v[3];
subd(key, v_in, v);
float z = distance((v[1] + v[2]) / 2.0, camPos);
int targetLod = int(distanceToLod(z));

// write to u_SubdBufferOut
updateSubdBuffer(key, targetLod);
}

Listing 1.8. Adaptive subdivision using a distance-based criterion.

1.3.3 T-Junction Removal


As for any other adaptive polygon-refinement scheme, our technique can produce
T-junction triangles whenever two neighboring keys differ in subdivision level. For in-
stance, Figure 1.1(c) shows a T-junction between the neighboring triangles associated
to the keys 00, 0101 and 0100. T-junctions are problematic for rendering because they
lead to visible cracks whenever the vertices are displaced by a smoothing function or
a displacement map. Fortunately, our subdivision scheme has the property that it does
not produce T-junctions as long as two neighboring keys differ by no more than one
subdivision level; this is noticeable for the green and blue keys of Figure 1.1(c). In
order to guarantee such key configurations, we apply our distance-based criteria to the
centroid of the hypotenuse of each subtriangle; see Listing 1.8. We observed that this
approach guarantees crack-free renderings for any target edge length lower than 16 pix-
els (we noticed some T-junctions above this value when the instanced grid is highly
12 1. Adaptive GPU Tessellation with Compute Shaders

tessellated). Therefore, we chose to rely on such a system as it avoids the need for a
sophisticated T-junction removal system; Listing 1.9 shows the code we use in the ver-
tex shader of our RenderKernel.

buffer VertexBuffer { vec3 u_VertexBuffer[]; };


buffer IndexBuffer { uint u_IndexBuffer[]; };
in vec2 i_InstancedVertex;
in uvec2 i_PerInstanceKey;

void main() {
// get coarse triangle associated to the key
uint primID = i_PerInstanceKey.y;
vec3 v_in[3] = vec3[3](
u_VertexBuffer[u_IndexBuffer[primID * 3 ]],
u_VertexBuffer[u_IndexBuffer[primID * 3 + 1]],
u_VertexBuffer[u_IndexBuffer[primID * 3 + 2]],
);

// compute vertex location


uint key = i_PerInstanceKey.x;
vec3 v[3]; subd(key, v_in, v);
vec3 finalVertex = berp(v, i_InstancedVertex);

// displace, deform, project, etc.


}

Listing 1.9. Adaptive subdivision using a distance-based criterion.

1.3.4 Results
To demonstrate the effectiveness of our method, we wrote a renderer for displacement-
mapped terrains, and another one for meshes; our source code is available on github at
https://github.com/jadkhoury/TessellationDemo, and a terrain rendering result is
shown in Figure 1.3. In Table 1.1, we give the CPU and GPU timings of a zoom-
in/zoom-out sequence in the terrain at 1080p. The camera's orientation was fixed, look-
ing downwards, so that the terrain would occupy the whole framebuffer, thus maintain-
ing constant rasterization activity. We configured the renderer to target an average
triangle edge length of 10 pixels; Figure 1.3 shows the wireframe of such a target. The
testing platform is an Intel i7-8700k CPU, running at 3.70 GHz, and an Nvidia GTX
1080 GPU with 8 GiB of memory. Note that the CPU activity only consists of OpenGL
uniform variables and driver management. On current implementations, such tasks run
asynchronously to the GPU.
1.4 Discussion 13

Figure 1.3. Crack-free, multiresolution terrain rendered entirely on the GPU using compute-
based subdivision and displacement mapping. The alternating colors show the different subdivi-
sion levels.

Kernel CPU (ms) GPU (ms) CPU stdev GPU stdev


LOD 0.038 0.042 0.160 0.031
Batch 0.028 0.003 0.011 0.001
Render 0.035 0.184 0.018 0.013
Table 1.1. CPU and GPU timings and their respective standard deviation over a zoom-in se-
quence of 5000 frames.

As demonstrated by the reported numbers, the performance of our implementation


is both fast and stable. Naturally, the average GPU rendering time depends on how the
terrain is shaded. In our experiment, we use a constant color so that the reported per-
formances correspond exactly to the overhead caused by vertex processing of our sub-
division technique.

1.4 Discussion
We introduced a novel compute-based subdivision algorithm that runs entirely on the
GPU thanks to an implicit representation. In future work, we would like to explore the
14 1. Adaptive GPU Tessellation with Compute Shaders

feasibility of this representation for more complex subdivision schemes such as Cat-
mull-Clark. In the meantime, we provide next a few additional considerations that we
think can be relevant in the context of our work.

How much memory should be allocated for the buffers containing the subdivision
keys? This depends on the target polygon density in screen space. The buffers should
be able to store at least 3 max_level  1 nodes, and do not need to exceed a capacity
of 4 max_level nodes. The lower bound corresponds to a perfectly restricted subdivision,
where each neighboring triangle differ by one level of subdivision at most. The higher
bound gives the number of cells at the finest level in case of uniform subdivision.

Is our subdivision technique prone to Floating-point precision issues? There are


no issues regarding the implicit subdivision itself, as each key is represented with bit
sequences only. However, problems may occur when computing the transformation
matrices in Listing 1.1. Our 31-level subdivision implementation does not have this
issue, but higher levels will, eventually. A simple solution to delay the problem on
OpenGL 4+ hardware is to use double precision, which should provide sufficient com-
fort for most applications.

How about combining this technique with tessellation shaders to overcome the
subdivision limits of the hardware? We have actually implemented such an ap-
proach. Our open-source implementation is available on github at https://github.com/
jdupuy/opengl-framework (see the demo-isubd-terrain demo). With both approached
at hand, we leave it up to the developer to decide which approach is best given his
software and hardware constraints.

There are two ways to control polygon density. Either use the implicit subdivision,
or refine the instanced triangle grid. Which approach is best? This will naturally
depend on the platform. Our code provides tools to modify the tessellation of the in-
stanced triangle grid, so that its impact can be thoroughly measured; Figure 1.4 plots
the performance evolution that we measured on our platform.

Can our implicit subdivision scheme smooth input meshes? Our implicit subdivi-
sion scheme offers the same functionality as tessellation shaders. Therefore, any
smoothing technique that runs with tessellation shaders run with our subdivision
shaders. For instance, the mesh renderer we provide implements PN-triangles [Vlachos
et al. 2001] and Phong Tessellation [Boubekeur and Alexa 2008] to smooth the surface
of the coarse meshes we refine; Figure 1.5 shows our mesh renderer applying either
bilinear interpolation or Phong Tessellation to a coarse triangle mesh.

1.5 Acknowledgments
This chapter is the result of Jad Khoury’s master thesis, which was supervised by Jonathan
Dupuy. All authors conducted this work at Unity Technologies.
Bibliography 15

Figure 1.4. Performance evolution with respect to the level of subdivision of the instanced
triangle grid on an NVidia GTX 1080.

Figure 1.5. Our subdivision technique applied on (a) a triangle mesh using (b) bilinear inter-
polation and (c) Phong tessellation [Boubekeur and Alexa 2008].

Bibliography
AMD 2013. GCN Performance Tweets. List of all GCN performance tweets that were released
during the first few months of 2013. URL: http://developer.amd.com/wordpress/media/2013/
05/GCNPerformanceTweets.pdf).
BOUBEKEUR, T., AND ALEXA, M. 2008. Phong Tessellation. ACM Transactions on Graphics
(Proc. SIGGRAPH Asia 2008) 27:5.
CASHMAN, T. 2012. Beyond Catmull Clark? A Survey of Advances in Subdivision Surface
Methods. Comput. Graph. Forum 31:1, 42–61. URL: https://doi.org/10.1111/j.1467-8659.
2011.02083.x).
16 1. Adaptive GPU Tessellation with Compute Shaders

DUCHAINEAU, M., WOLINSKY, M., SIGETI, D., MILLER, M., ALDRICH, C., AND MINEEV-WEIN-
STEIN, M. 1997. ROAMing terrain: real-time optimally adapting meshes. In Proceedings of the
8th Conference on Visualization ‘97, pp. 81–88. IEEE Computer Society Press.
MISTAL, B. 2013. Gpu terrain subdivision and tessellation. In GPU Pro 4, 3–20.
NIESSNER, M., LOOP, C., MEYER, M., AND DEROSE, T. 2012. Feature-adaptive GPU Rendering
of Catmull-Clark Subdivision Surfaces. ACM Trans. Graph. 31:1, 6:1–6:11.
RICCIO, C. 2012. Southern Islands in deep dive. SIGGRAPH Tech Talk. URL: https://www.
g-truc.net/doc/Siggraph2012%20Tech%20talk.pptx.
Stam, J. 1998. Exact Evaluation of Catmull-Clark Subdivision Surfaces at Arbitrary Parameter
Values. In Proceedings of the 25th Annual Conference on Computer Graphics and Interactive
Techniques, SIGGRAPH ‘98, pp. 395–404. URL: http://doi.acm.org/10.1145/280814.280945.
STRUGAR, F. 2009. Continuous distance-dependent level of detail for rendering heightmaps.
Journal of graphics, GPU, and game tools 14:4, 57–74.
VLACHOS, A., PETERS, J., BOYD, C., AND MITCHELL, J. 2001. Curved PN Triangles. In Proceed-
ings of the 2001 Symposium on Interactive 3D Graphics, I3D ‘01, pp. 159–166. URL:
http://doi.acm.org/10.1145/364338.364387.
2
I

Applying Vectorized Visibility


on All Frequency Direct
Illumination
Ho Chun Leung, Tze Yui Ho, Zhenni Wang,
Chi Sing Leung, and Eric Wing Ming Wong

2.1 Introduction
The Precomputed Radiance Transfer (PRT) [Sloan et al. 2002] is a general framework
for illuminating surfaces using some precomputed transfer functions. It plays an im-
portant role in the more elaborated rendering applications (e.g., [Elcott 2016]) nowa-
days. The preparation for PRT usually involves a precomputation step, which is a
computationally expensive step for evaluating the light bounce and transfer in a scene;
the output is a set of transfer functions, and these transfer functions are usually pre-
pared in a per vertex manner. Such transfer functions can be prepared by using compu-
tational simulations (e.g., ray tracing) or measuring from the real world with some
specifically designed equipment [Matusik et al. 2003]. Either simulated or measured,
the fundamental objective of PRT is still the same, i.e., providing instant access to the
transfer functions for the rendering applications.
The formulation of PRT is so general that it can practically capture all possible
visual effects due to illuminations, including shadows, interreflections, lighting func-
tions, etc. While PRT can render many visual features, the real challenges are to make
it also efficient; and there are two of them. First, the PRT approach needs an efficient
means to evaluate the radiance equation (i.e., a numerical integral) with potentially
several hundred thousand samples. The evaluation has to be efficient because we might
need to do the evaluation for a few million times per frame (e.g., an evaluation for each
onscreen pixel; screen resolution 1280 962). Second, they need an algorithm to com-
press the transfer functions to a manageable data size. When local effects (e.g., view
dependent lighting, shadows and interreflections) are included, the data size of the

17
18 2. Applying Vectorized Visibility on All Frequency Direct Illumination

precomputed transfer functions can be in gigabytes order. Even if the memory con-
sumption and the bandwidth of the distribution medium are not a concern, it will not
be worth the effort to spend all that memory for illumination.
Thankfully, the human visual system is very forgiving to illumination being a bit
different to reality. Even if we simplify the lighting evaluation to an extreme measure,
the rendered results can still be visually pleasing. For instance, if we restrict the view
positions to a single point at indefinite far and ignore the local effects, PRT will reduce
to MatCap [Brauer 2010, Moreno 2018]; if we ignore just the local effects, PRT will
reduce to the environment mapping with mipmapping for the glossy reflections [Scheu-
ermann and Isidoro 2006]; if we restrict the view positions to a single point and fix the
3D model postures (i.e., essentially a static image), PRT will reduce to Image Based
Lighting [Russell 2015]. Other than the simplified approaches, there exist many PRT
approaches [Ng et al. 2003, Ng et al. 2004, Sloan et al. 2003, Tsai and Shih 2006, Kautz
et al. 2002, Liu et al. 2004, Ben-Artzi et al. 2006, Lam et al. 2010, Wang et al. 2009,
Wang et al. 2013], which attack the problem head-on by modeling the transfer func-
tions as a whole instead of dropping the effects whenever they are inconvenient.
This article presents the implementation and the rendering algorithms for the vec-
torized visibility [Ho et al. 2018], i.e., a variant of the PRT approaches. It differentiates
itself from the other PRT approaches by having its visibility functions represented in
vector graphics forms. This fundamental difference is so profound that the correspond-
ing rendering algorithm evolves through a very different path as compared to other PRT
approaches; this results in a different set of strengths and weaknesses.
Our PRT approach can preserve the fidelity of high frequency shadows and accu-
rately account for huge number of light sources even with coarsely tessellated 3D mod-
els; and it can also handle the specular component from mirror to blurry reflection1.
Both per-vertex and per-fragment direct illumination are supported. It can also make
use of dynamic tessellation to provide a better scalability, which is faster than the per-
fragment with better quality than the per-vertex. The specular component in the real
world does have shadows2; and our algorithm can capture the visual impression of the
specular component shadows3.

2.2 The Precomputed Radiance Transfer


Before we can appreciate the differences, we need a deeper understanding of the PRT
approaches. We begin with the radiance equation. The radiance equation is given by

I ω 0 , n ,   
 Ω
f r  ω 0 , n , ω   ω  dω, (2.1)

1
The video demonstration of our BRDF editing feature https://youtu.be/tvNnQfL5UT4.
2
The video of the specular component shadows in the real world https://youtu.be/EezUtywshyI.
3
The video of the specular component shadows from our demo program https://youtu.be/
dlLLZTyY7xs.
2.2 The Precomputed Radiance Transfer 19

where ω 0 is the viewing direction, ω is the incident direction, n is the surface normal,
f r is the Bidirectional Reflectance Distribution Function (BRDF) (or, the transfer func-
tion),  is the environment map (or the illuminations), and Ω is the spherical domain.
(Note that, we will refer to f r as the transfer function from this point onward.) Concep-
tually, it is just describing that we can have the appearance of a surface point by adding
up the influence of all involving light sources. In simple terms, it is just a complicated
way to describe the multiple light source application, and the integration sign simply
means having as many light sources as possible.
An obvious intention of writing the radiance equation in its integral form is to
apply frequency analysis. That is transforming the transfer function f r and the illumi-
nations  to their frequency representations, and performing the integration in the fre-
quency domain. In the frequency domain, the convolution of f r and  becomes a dot
product of their frequency coefficient vectors C f r and C  , i.e.,

I ω 0 , n ,    C f r  C . (2.2)

Figure 2.1 shows an intuitive illustrative figure to demonstrate the radiance equation
evaluation using frequency representations. Using the lower frequency components
only, for instances 16 coefficients per representation, a numerical integral potentially
requiring hundred thousand samples can then be approximated by a simple dot product
of two 16-component vectors; and it makes the radiance evaluation efficient enough
for real time applications as presented by Sloan et al. [2002].
Sloan et al. [2002] use spherical harmonics as his basis for the frequency repre-
sentation, which is, in vague terms, the Fourier series for the spherical domain. Like
Fourier series, spherical harmonics is a global basis, which is inefficient for handling
high frequency signals. As a result, Sloan et al. [2002] can only render blurry effects,
e.g., blurry shadows and the diffuse lighting function.
Ng et al. [2003] extends Sloan’s work by replacing the global basis with a local
basis4 (i.e., Haar wavelets in the cubemap space). To make things more focused, the
effects involved are simplified from global illumination (i.e., view independent light-
ing, visibility, and interreflections) to direct illumination (i.e., view independent light-
ing and visibility). In this case, the radiance equation becomes

I ω 0 , n ,   
 Ω
S ω  f  ω 0 , n , ω    ω  ω  n  dω , (2.3)

where  is the environment map, S is the visibility function separated from f r , f is f r


without the visibility info, and ω  n is the cosine of the incident angle. Using the

4
A basis is a prototype of functions. A global basis means each of its functions is non-zero in
general throughout the domain. On the contrary, a local basis (also referred as compactly sup-
ported basis) means each of its functions is non-zero only within the neighborhood of a particular
point.
20 2. Applying Vectorized Visibility on All Frequency Direct Illumination

Figure 2.1. An intuitive illustrative figure to demonstrate the radiance equation evaluation using
frequency representations. The efficiency of this approach come from aggressively dropping less
important components until the number of components is manageable.

advantage of the local basis (i.e., more efficient in handling high frequency signals),
Ng et al. [2003] attains the all-frequency shadow rendering quality.

2.3 Rewriting the Radiance Equation


While the local basis does have a superior performance w.r.t. that of the global basis,
they use the same strategy to make the radiance equation evaluation efficient, i.e., ag-
gressively dropping less important components until the number of components is
manageable. For instance, suppose the illuminations are come from a 6 256 256
cubemap, it will give us 400k number of light sources. Transforming it to spherical
wavelet representations [Ng et al. 2003], we have 400k components; even if an imprac-
tical amount of important components (say 1k components) is being used, we will be
discarding 99.75% of the information.
Having a significant amount of information loss is bound to cause a significant
damage to the radiance equation evaluation, and therefore the rendered results. In this
case, the subtle shadow translation in the darker regions will be discarded. A discussion
about this issue could be found in [Ng et al. 2004]. These translations, though subtle,
do have a significant impact to the visual impression; and the image quality loss due to
the aggressive strategy will be immediately identifiable if the rendered results are com-
pared to the ground truth.
If the efficiency is the only concern, aggressively dropping less important compo-
nents in frequency domain is not the only option. We do have a numerical tool that can
provide accurate numerical integrals regardless of the number of samples, i.e., the
summed area table (SAT) [Crow 1984]. To apply SATs to the radiance equation eval-
uation, the radiance equation needs to be adjusted accordingly. We begin the adjust-
ment from Equation (2.3), i.e., the direction illumination with visibility equation.
Removing the cosine term ω  n , we have

I ω 0 , n ,   
 Ω
S  ω  f  ω 0 , n , ω    ω  dω . (2.4)
2.3 Rewriting the Radiance Equation 21

Replacing the integration domain Ω with the visibility function S ω , we have

I ω 0 , n ,   
 S
f ω 0 , n , ω   ω  dω. (2.5)

Green’s theorem, one of the major theorems in multivariable calculus, gives a relation-
ship between the line integral of a two-dimensional vector field over a closed path (i.e.,
a contour integral) and the double integral over the region it encloses. Coincidentally,
the visibility function S happens to be some regions enclosed by the contour of its
visible regions. This unusual coincidence allows us to apply Green’s theorem to Equa-
tion (2.5).
Applying Green’s theorem, we can rewrite the radiance equation from a double
integral to a contour integral, i.e.,

I ω 0 , n ,   
 SAT  f ω , n, ω   ω  p  dp ,
A
0 (2.6)

where A is the contour of the visibility function S; SAT  g  denotes the pre-integrated
spherical function of a given spherical function g, i.e.,

 
pθ pφ
SAT  g  p   g  φ, θ  sin θ dφ dθ , (2.7)
0 0

where p is a unit vector,  p φ , p θ  is the spherical coordinates of p, and  φ, θ  is the


parameterization of Ω in spherical coordinates. Figure 2.2 shows an intuitive illustra-
tive figure to demonstrate the radiance equation evaluation using the vectorized visi-
bility representation.

Figure 2.2. An intuitive illustrative figure to demonstrate the radiance equation evaluation using
the vectorized visibility representation. Using the visibility function as the domain and applying
Green’s theorem, we can convert the radiance equation to a contour integral. The vectorized
visibility happens to be the ideal candidate for the domain of the contour integral.
22 2. Applying Vectorized Visibility on All Frequency Direct Illumination

As a contour integral, Equation (2.6) only needs to process a one dimensional data,
i.e., the contour of the visibility function. Having the contour represented in vector
graphics forms, we decouple the visibility function from the illumination resolution.
Hence, even with 400k light sources, Equation (2.6) can still provide an accurate inte-
gral without dropping any information at the same computational cost. This makes
Equation (2.6) both efficient and accurate for the radiance equation evaluation.
The relative pros and cons between doing it from the frequency analysis perspec-
tive vs. Equation (2.6) might be not immediate obvious. Metaphorically, their relative
pros and cons are analogous to that of Finite Element Method (FEM) vs. Finite Bound-
ary Method (FBM) respectively. The downside of FBM is that it takes more efforts to
formulate the simulation (in a lot of cases, such formulations may not even be feasible
at all), whilst FEM works in general. However, if FBM is applicable, it can provide
several orders higher accuracy at a tiny fraction of computational cost. Two applica-
tions of Equation (2.6), i.e., diffuse lighting and specular lighting, will be discussed in
Section 2.5.

2.4 The Vectorized Visibility


Our representation, i.e., the vectorized visibility [Ho et al. 2015], is designed to model
all visibility functions in a small neighborhood, instead of just for a single point. It
represents them by using a sequence of position vectors a i  (see Figure 3(b)). This
representation has an intuitive geometrical meaning, i.e., some closed paths on the 3D
model (see Figure 2.3(a)). Given a vectorized visibility a i , we can synthesize the
visibility contour at any point in the 3D space with a simple operation (see Fig-
ure 2.3(d)), i.e., translation then normalization. To be precise, the synthesized visibility
contour  u i  at an arbitrary point p is given by  u i    normalize  a i  p .
The process to obtain the vectorized visibility at a given sample origin o is render-
ing the 3D model to a visibility cubemap centered at o (see Figure 2.4(a)), sweeping
along the occluded region boundary, and recording the corresponding position vectors
to some sequences (see Figure 2.4(b)). For a more efficient manipulation, all the indi-
vidual sequences are concatenated together with some zero area closed paths to form a
single sequence (see Figure 2.4(c)). The resulting sequence is the vectorized visibility
a i . Two paths to and from the same pair of endpoints will cancel out each other
during the integration. Therefore, we can add an arbitrary number of them without
altering the integration.
We prepare one vectorized visibility for each vertex, with each vertex as its sample
origin. The resulting vectorized visibility of all vertices will then be stored in a 3D
texture of resolution Wv  H v  D, where Wv  H v  No. of vertices, and D is the num-
ber of position vectors (see Figure 2.4(d)). The process is simple and deterministic,
which can be implemented easily with a bit of C++ and GLSL, and a generation pro-
gram along with the source code is readily available in the demo package.
2.5 Lighting Evaluation 23

(a) the sample origin (b) the vectorized visibility (c) the synthesis locations

(d) the synthesized visibility contours

Figure 2.3. More about the nature of the vectorized visibility. (a) The vectorized visibility
sampled at the yellow dot plotted on the 3D model. (b) The vectorized visibility in Figure (a).
(c) Indicating the locations for the visibility contour synthesis in Figure (d). (d) The visibility
contours synthesized at the locations identified in Figure (c) using the vectorized visibility shown
in Figure (b). A single vectorized visibility representation can already synthesize a family of
visibility contours.

(a) (b) (c) (d)

Figure 2.4. The data structure for the vectorized visibility. (a) The visibility cubemap. (b) The
sequences of position vectors obtained from the boundaries in Figure (a). (c) Multiple sequences
can be connected to form a single sequence. (d) The position vector sequences of all vertices
would be stored in a 3D texture.

2.5 Lighting Evaluation


For the diffuse component, the radiance equation for direct illumination is given by

Id 
 Ω
S  ω  max  n  ω, 0   ω  dω . (2.8)
24 2. Applying Vectorized Visibility on All Frequency Direct Illumination

Applying Green’s theorem as described in Equation (2.6), we have

Id  n
 SAT ω ω  p  dp .
A
(2.9)

We use the synthesized visibility contour  u i , which is the unit vector sequence pre-
sented in Section 2.4, as the representation of A. Each consecutive pairs of unit vectors
defines a line segment, and the contour integral is evaluated by summing up the partial
results of all line segments. Then, the radiance becomes

Id  n  vsat  SAT ω ω  , u , u


i
i i 1 , (2.10)

where vsat is the generalized SAT lookup operator for line segments to be presented
in Section 2.6.
For the specular component, we use the Phong lighting function, and its radiance
equation is given by

Is 
 Ω
S ω  Phong  r , ω, s    ω  dω, (2.11)

where Phong  r , ω, s   max  r  ω, 0  , r is the reflected view direction, and s is the


s

glossiness. However, due to the s power of r  ω, it is not practical to directly factorize


Equation (2.11) like the diffuse component in Equation (2.9).
Therefore, we approximate I s with

I s  I Cap 
Is  Is  Is , (2.12)
Is I
Cap

where Is  Ω Phong  r , ω, s    ω  dω is the radiance value without visibility, and
I Cap and I 
Cap are the corresponding radiance values of I s and I s by substituting an alge-
braically friendly function for the Phong lighting function. As shown in Figure 2.5,
Equation (2.12) expresses I s as a fraction of the radiance value without visibility Is .
The intention is to isolate the influence of visibility from the lighting function as a
ratio. Such that we can evaluate the ratio with an algebraically friendly function, while
preserving the feature of the lighting function with Is . Our algebraically friendly func-
tion is
max  r  ω  cos ξ , 0 
Cap  r , ω, ξ   , (2.13)
1  cos ξ

where ξ is the radius of a circular window W  r , ξ  centered at r. Better than that, both
the ratio I Cap I  
Cap and I s are computationally friendly. I s can be obtained easily from
the cubemap mipmap of . However, calculating the ratio is a bit tricky. Expanding
the ratio, we have
2.5 Lighting Evaluation 25

Figure 2.5. Visualizing each of the components separately. The specular component is esti-
mated by the product of a ratio estimator and the filtered  without visibility.

I Cap

 A
Cap  r , ω, ξ   ω  dω

 A
max  r  ω  cos ξ , 0    ω  dω
. (2.14)
ICap
Ω
Cap  r , ω, ξ    ω  dω
 Ω
max  r  ω  cos ξ , 0    ω  dω

Examining Equation (2.14) carefully, we see that r cannot be factorized out of the in-
tegral due to the presence of the max function. The max function can be considered as
a form of visibility function, which is a circular window in this case. By using this fact,
the max function place can be taken by changing the domain to the intersection of the
original domain and the circular window; then, we have

I Cap

 A W  r , ξ 
 r  ω  cos ξ   ω  dω
. (2.15)
ICap
 Ω W  r ,ξ 
 r  ω  cos ξ   ω  dω

Without the explicit presence of the max function, r can be factorized, and we have

I Cap

r
 A W  r , ξ 
ω  ω  dω   cos ξ 
 A W  r , ξ 
  ω  dω
. (2.16)
ICap r
 W  r ,ξ 
ω ω  dω   cos ξ 
 W  r ,ξ 
  ω  dω

The integrals in Equation (2.16) can all be evaluated efficiently using a similar formu-
lation to Equation (2.10) given the contour of A  W  r , ξ . The contour of A  W  r , ξ 
can be obtained by clamping all the unit vectors in A to the circular window; each
clamped unit vector is given by


 a i  r b  r b  a i  rt  rt
 r cos ξ  sin ξ , if a i  r  cos ξ ;

vclamp  a i        (2.17)
2 2
 a  r  a  r

i b i t

 ai, otherwise,

26 2. Applying Vectorized Visibility on All Frequency Direct Illumination

Figure 2.6. Intersecting a visibility contour with a circular window. (a) shows a visibility con-
tour (black and red) and a circular window (blue). (b) shows the visibility contour of their inter-
section. As indicated by the red arrows, the clamped unit vectors will form either the missing
paths of intersected regions or some zero-area round trips.

where  r b , r , r t  are orthonormal vectors. The clamped unit vectors will form either the
missing paths of intersected regions or some zero-area round trips as shown in Fig-
ure 2.6(b).
The GLSL code for the evaluation process of the diffuse and specular components
is provided below.

void vsatlookup(
sampler2D ysat, // the pre-integrated environment map
sampler3D vcptmap, // the vectorized visibility 3D texture
vec2 ci, // the texture coordinates for vmap
corresponding to a given vertex
vec3 cv, // the vertex position in global coordinate
system
mat3 mtx, // local to global coordinate system
rotation matrix
out vec4 ID, // the diffuse component radiance value
float phongbound, // the glossiness
vec3 ax, // these three vectors form a coordinates
vec3 ay, // system with ay pointing the reflected
vec3 az, // view direction
out vec4 II) // the specular component radiance ratio
{
int DD = textureSize(vcptmap, 0).z; // the number of position
// vector in the visibility
float dd = 1.0 / float(DD);
vec3 td = vec3(ci, 0.5 * dd);

vec3 ll; // a temp variable for unit vectors


2.5 Lighting Evaluation 27

vec2 d0, d1; // the spherical coordinates of a line segment


for the diffuse component
vec2 a0, a1; // the spherical coordinates of a line segment
for the spec. component

vec2 bound = vec2(cos(phongbound), sin(phongbound));


ID = vec4(0.0);
II = vec4(0.0);

// synthesize visibility
ll = normalize((mtx * texture(vcptmap, td).xyz) - cv);
d0 = Float32Angle(ll); // unit vector to spherical coords

// clamping to the circular window


a0 = Float32Angle(specboundc(bound, ax, ay, az, ll));

for (int i = 1; i < DD; i++) // doing integration line segment


{ // by line segment
td.z += dd;

// synthesize visibility
ll = normalize((mtx * texture(vcptmap, td).xyz) - cv);
d1 = Float32Angle(ll); // unit vector to spherical coords

// clamping to the circular window


a1 = Float32Angle(specboundc(bound, ax, ay, az, ll));

// generalize SAT lookup for diffuse component


ID += satlookup(ysat, tppair(vec4(d0, d1)));

// generalize SAT lookup for specular component


II += satlookup(ysat, tppair(vec4(a0, a1)));

d0 = d1;
a0 = a1;
}

// south pole correction for diffuse component


if (ID.w < -0.01)
ID += texture(ysat, vec2(DEND, 1.0));

// south pole correction for specular component


if (II.w < -1.0)
II += texture(ysat, vec2(DEND, 1.0));
}
28 2. Applying Vectorized Visibility on All Frequency Direct Illumination

2.6 Shader Implementation for the


Generalized SAT Lookup
The ordinary SAT lookup has a well-known limitation, which is the domain of the
integral ought to be an axis aligned rectangle. The straightforward way to extend it to
support region defined with line segments is approximating the line segments with
horizontal lines. However, the horizontal line SAT lookup is not good enough for our
representation because our representation is defined with non-axis aligned line seg-
ments. Therefore, we further extend the SAT to the generalized SAT lookup.
Suppose we want to find the integral swept by a non-axis aligned line segment as
shown in Figure 2.7(a), and the functions in concern are the three component function
of the diffuse component in Equation (2.9). Then, the true integral G will be a value in
between E and F, where E  SAT ω ω  φ 2 , θ1   SAT ω  ω  φ1 , θ1  and
F  SAT ω  ω  φ 2 , θ 2   SAT ω ω  φ1 , θ 2 . Thus, we can express the true
integral G as an interpolation of E and F, i.e.,

G  vsat  SAT ω  ω  , a 1 , a 2   1  β  E  βF , (2.18)

where a 1   φ1 , θ1 , a 2   φ 2 , θ 2 , β is an interpolation factor in between  0,1, and


vsat is the generalized SAT lookup operator for the line segment  a 1 , a 2 .
To compute β , we exploit the three component function of ω  ω . Although the
three component function is designed for the radiance evaluation in the first place, we
can also consider them as some weighted directions, which are weighted by the illumi-
nations . Considering it this way, we can calculate a centroid c  normalize  E  F ,
which is the dominated light source direction of the yellow region in Figure 2.7(a).
If we assume that the dominated light source has all the energy and attains its
maximum at x c with a bell shape distribution sech 2  x  x c , the radiance evaluation

(a) (b) (c)

Figure 2.7. (a) Non-axis aligned line segment. (b) The bell shape function. (c) The rendering
results comparison between the ordinary SAT vs. the generalized SAT. The rendered image using
the generalized SAT lookup eliminates the jaggy shadow boundaries that caused by the well-
known limitation of the ordinary SAT.
2.6 Shader Implementation for the Generalized SAT Lookup 29

will degenerate to the 1D scenario as shown in Figure 2.7(b); x c is the perpendicular


distance from c to the diagonal. Then, the interpolation factor β can be approximated
by



sech 2  α   x  x c  dx
1 1
β 0
  tanh  α x c , (2.19)



2 2
sech  α   x  x c  dx
2



where α  is an empirical constant depending on the environment map resolution, and


the suggested value for a 6 256 256 cubemap is 5.8. Equation (2.19) is formulated
using a similar principle to derive the specular component radiance equation, i.e., the
radiance ratio with and without visibility. The only difference is that it is done in a 1D
scenario. The GLSL code for the generalized SAT lookup is provided below.

vec4 satlookup(sampler2D satmap, vec4 cc)


{
vec2 A = cc.xy - cc.zw;
vec4 E = texture(satmap, cc.xy) - texture(satmap, cc.zy);
vec4 F = texture(satmap, cc.xw) - texture(satmap, cc.zw);

float dl = 0.5;
if (abs(A.x) > 0.0001 && abs(A.y) > 0.0001)
{
vec2 B = Float32Angle(normalize((E.xyz - F.xyz)
* sign(E.w - F.w)));
B.x = (B.x + 0.5 / DSIDE) / 1.5;
B.x += step(B.x, min(cc.x, cc.z)) / 1.5;
B -= cc.zw;
dl += tanh(5.87 * clamp((B.y / A.y - B.x / A.x), -1, 1))
* 0.5;
}

return mix(E, F, dl);


}

To demonstrate that the generalized SAT lookup can appropriately account for the
integral of non-axis aligned line segments, we compare it to the horizontal line SAT
lookup. If the SAT lookup can only provide the integrals of axis aligned line segments,
the rendered image will have some jaggy shadow boundaries as shown in Figure 2.7(c).
As shown in the figure, the rendered image using the generalized SAT lookup does not
have the jaggy shadow boundaries.
30 2. Applying Vectorized Visibility on All Frequency Direct Illumination

2.7 Dynamic Tessellation


As mentioned in Section 2.3, applying Green’s theorem converts the radiance equation
to the contour integral Equation (2.6). Representing the visibility functions in vector
graphics form decouples them from the illumination resolution. These make Equation
(2.6) an accurate and computationally friendly variant of the radiance equation. Eval-
uating the radiance equation with indefinite accuracy using Equation (2.6) is intriguing;
unfortunately, as promising as it might sound, it does not necessarily imply a better
image quality. As a matter of fact, if the amount of sampled transfer functions does not
catch up with the screen resolution, it will just be exaggerating the vertex-to-pixel in-
terpolation artifact indefinitely (see Figure 2.8(b)). In terms of Digital Signal Pro-
cessing, this is the aliasing caused by not enough sampled transfer functions.
The visibility function for every point on the surface has different content in gen-
eral. Requiring a faithful all frequency shadows quality (see Figure 2.8(a)), we will
have to sample the 3D model densely for the transfer functions, where the sampling
needs to be dense enough to roughly correspond every onscreen pixel to a distinct sam-
pled transfer function. We can roughly workout the required data size given the screen
resolution. For instance, given the screen resolution 1280 962 and the visibility func-
tion resolution 6 256 256 (1 bit storage), we will have 56 gigabytes for the visibility
functions.
Most PRT approaches will have the visibility function premultiplied to the lighting
functions, and compress the resulting data by removing redundancy across multiple
dimensions (e.g., across the vertex dimension). There exist quite a number of compres-
sion algorithms for this task, e.g., cluster PCA [Sloan et al. 2003, Wang et al. 2009,
Wang et al. 2013] and cluster tensors [Tsai and Shih 2006]. However, as demonstrated
in [Ho et al. 2015], if the transfer functions are compressed down to a few megabytes,
there can be significant rendering artifacts.
General purpose compression algorithms come with a blossom variety of empha-
ses. More elaborated PRT approaches will also have antialiasing integrated to their
compression algorithms (e.g., [Ng et al. 2004] and [Wang et al. 2009]). These algo-
rithms bound the signal frequency to safeguard them from exceeding the representation
capability of the given amount of samples. In less formal words, that is blurring every-
thing with higher order filters. Again, antialiasing is not the only option to handle ali-
asing. Aliasing, as the negative effect caused by not enough samples, can also be solved
by increasing the amount of samples. It is precisely what we does for the ground truth
image, i.e., increasing the amount of sampled transfer functions such that each pixel
has its own transfer function.
To realize this, we utilize the vectorized visibility. As mentioned in Section 2.4,
the vectorized visibility is designed to model all visibility functions in a small neigh-
borhood. Using this property, we can synthesize an individual visibility contour for
each pixel. To render the direct illumination of a surface position, we synthesize the
visibility contour using the vectorized visibility of its neighboring vertices, and then
feed the synthesized visibility contours to Equation (2.6) for evaluations (i.e., the
2.7 Dynamic Tessellation 31

(a) Ground truth (b) Per-vertex

(c) Per-fragment (d) Dynamic tessellation

Figure 2.8. The rendering results for the teapot model: (a) ground truth, (b) per- vertex, (c) per-
fragment, and (d) dynamic tessellation. Figure (a) is the reference image generated using
6  256  256 number of directional lighting sources, where the visibility is handled by using
shadow volume algorithm. Figures (b)–(d) are the rendering results of our algorithm using dif-
ferent rendering modes. The per-vertex result was intended to be the ground truth; however, it
failed because of its severe aliasing.

radiance equation in contour integral form). If we intend to do per-fragment direct il-


lumination rendering, the surface positions will all be the rasterized vertex positions
of the onscreen pixels. As shown in Figure 2.8(c), the per-fragment rendering result
accurately resemble the ground truth Figure 2.8(a) (therefore, aliasing free naturally).
Besides, if we intend to do per-vertex direct illumination rendering (see Fig-
ure 2.8(b)), the surface positions will all be the vertices. In other words, the whole
rendering setup boils down to defining the neighboring vertices for the surface posi-
tions in concern. This flexibility gives us more freedom for the design of rendering
modes.
Doing the rendering 1280 962 full screen per-fragment with the radiance equa-
tion evaluated at 400k illumination resolution, our algorithm achieves a few frames per
32 2. Applying Vectorized Visibility on All Frequency Direct Illumination

second with a GTX 660 display card. Given the image quality requirement and the
rendering capabilities of our approach (see Section 2.1), this rendering speed is practi-
cally impossible for most PRT approaches to achieve. However, the uncompromising
image quality is simultaneously the major drawback of our algorithm, as the image
quality is not always the primary concern. In a lot of cases, users might just prefer
sacrificing some image quality over a real time performance, which could be addressed
gracefully by the antialiasing feature in most PRT approaches, while neither the per-
vertex nor the per-fragment of our algorithm offer such a flexible tradeoff. To cover this
loose end, we improvise a little bit using the flexible rendering setup of our algorithm.
As mentioned in the introduction, we can use dynamic tessellation to provide a
flexible tradeoff between speeds versus quality, which is faster than the per-fragment
with better quality than the per-vertex and has a better scalability. If we examine the
per-fragment rendered result in Figure 2.8(c), we can see that a good portion of regions
are almost identical to the per-vertex result in Figure 2.8(b). This is equivalently saying
that a lot of pixels are using some extremely expensive computation just to render some
blurry signals. In other words, the utilization is very low.
To improve the utilization, an intuitive idea would be adjusting the sampling rate
of individual triangles on demand according to the signal frequency. In terms of dy-
namic tessellation, it will decide the tessellation level per triangle. For the frequency
estimation, we estimate the frequency per vertex with

 SAT ω   ω   ω  dω ,


 
freq  A, n ,      abs  n  (2.20)
 A

where  is a blurry version of . Then, for a triangle, we have three frequency values
 freq 0 , freq1 , freq 2  corresponding to its vertices. The frequency value per triangle
freq Δ is taken to be the maximum of the three frequency values. To convert freq Δ to
the tessellation level per triangle Lv Δ, we use the following equation

Lv Δ  floor  tessscale  freq Δ , (2.21)

where tessscale is an arbitrary constant for the tradeoff. Adjusting tessscale to a higher
value will result in a finer grained tessellation. The parameter tessscale is coupled with
the frame rates and adjusted automatically using a PID controller. In this case, suppose
the computational power could be quantified, then each unit of computation power
would bear more value in terms of rendering quality.
We allocate the GPU memory for the vertices, the inner tessellated vertices of
edges, and the inner tessellated vertices of triangles as three textures, namely vmap,
emap, and fmap (see Figure 2.9), where the size of emap and fmap are chosen to be
large enough for the maximum tessellation level. They are managed this way for two
practical reasons. First, any common edge of two neighboring triangles must have a
single tessellation level in order to render a smooth image across neighboring triangles
with different tessellation levels. Second, as our radiance evaluation is computationally
expensive, we want to ensure that the vertices and the common edges are evaluated
only once.
2.7 Dynamic Tessellation 33

Figure 2.9. An illustrative figure to demonstrate the flow of our dynamic tessellation. Three
textures, namely vmap, emap, and fmap store the radiance values of the vertices, the inner tes-
sellated vertices of edges, and the inner tessellated vertices of triangles. Managing it this way
allows us to render the image seamlessly across the triangles with different tessellation levels.

Each row in emap and fmap contains the radiance of the tessellated vertices of a
triangle. Given a tessellation level Lv , we can calculate the number of tessellated ver-
tices of the inner edges and the inner triangles in close form, i.e.,

N edge  Lv   2 Lv 1, (2.22)

 2 Lv 1 2 Lv  2 
N face  Lv   . (2.23)
2
Each edge and each face are associated with their own tessellation level. Given an edge
with tessellation level Lv edge, we will only evaluate the radiance for N edge  Lv edge  num-
ber of pixels in the emap. Similarly, given a triangle with tessellation level Lv face ,
N face  Lv face  number of pixels will be evaluated in the fmap. The following geometric
shaders are the shaders for updating pixels of the inner edges and the inner triangles
depending on their tessellation levels.

void main() // emap


{
float level_t0, level_t1, level;

vec2 wh = vec2(textureSize(ttmap, 0));

// level_t0 and level_t1 are the two tessellation


// levels of the current edge.
level_t0 = texture(ttmap, vec2(mod(btriangle_index[0].x + 0.5,
512), floor((btriangle_index[0].x + 0.5) / 512) + 0.5) / wh).x;
level_t1 = texture(ttmap, vec2(mod(btriangle_index[0].y + 0.5,
512), floor((btriangle_index[0].y + 0.5) / 512) + 0.5) / wh).x;
34 2. Applying Vectorized Visibility on All Frequency Direct Illumination

// the edge tessellation level is taken to be the minimum


// of two level_t0 and level_t1
level = min(level_t0, level_t1);

float N = pow(2, level); // Equation (2.22)

// emit a line primitive covering N-1 number of pixels.


EndPrimitive();
}

void main() // fmap


{
// the face tessellation level
float level = texture(ttmap,
vec2(mod(btriangle_index[0] + 0.5, 512),
floor((btriangle_index[0] + 0.5) / 512) + 0.5)
/ vec2(textureSize(ttmap, 0))).x;

float N = pow(2, level);


N = (N - 1) * (N - 2) / 2; // Equation (2.23)

// emit a line primitive covering N number of pixels.


EndPrimitive();
}

During the render-time, we prepare the per-vertex frequency map and then the per-
face tessellation level map. Given the tessellation levels, we evaluate the radiance val-
ues for vmap, emap, and fmap. Then, we render vmap, emap, and fmap to a screen
buffer through straightforward rasterization (notice that, special attention has to be paid
to render the common edges). The screen buffer will then contain the radiance values
for the diffuse component and specular component individually. Finally, the post-pro-
cessing effects, e.g., texture mapping, material color, and bump mapping, etc., will be
applied. The most beautiful part of this process is that it is all completed within the
GPU.

2.8 Results
Now, we examine the performance aspect of our rendering algorithm. We begin with
the rendering information and the model information. The rendering was carried out
with a GTX 660 display card. The screen resolution was 1280 962. The illuminations
consisted of 400k light sources (i.e., St. Peter HDR cubemap; resolution 6 256 256).
As mentioned in Section 2.1, our algorithm does not require 3D models to be
densely tessellated. As shown in Table 2.1, the girl model and the teapot model only
have 6655 and 2038 vertices, which are coarsely tessellated, and they have 10333 and
2.8 Results 35

3706 faces. Their precomputed data sizes are 11.5 MB and 3.14 MB, and the allocated
memory sizes to buffer the required tessellated data are 10.9 MB and 4.1 MB. Given
the GPU graphics memory capacity nowadays (usually in GBs), both the precomputed
data and the buffer for the tessellated data can reside at the GPU graphics memory
comfortably. The preparation times for their vectorized visibility are 3.27 minutes and
0.93 minutes, where the preparation time is measured from loading the model file to
having all the required data saved to the hard disk.
The dynamic tessellation here serves the role to distribute the computational re-
sources adaptively to the region in need. tessscale controls the overall spending of the
computational resources. Figure 2.10 shows the distribution of samples with the aware-
ness to the content frequency given more and more computational resources. As shown
in Figure 2.10, the density of samples starts increasing from some selective regions
first, then the less selective regions.
In particular, just by evaluating a faction of sample points to that of the per-frag-
ment rendering (see Figure 2.10(d)), the dynamic tessellation (see Figure 2.11(d)) at-
tains almost identical result to the per-fragment rendering (see Figure 2.11(c)), or even
the brute force rendering (see Figure 2.11(a)).
The PID controller allows the users to specify their expected frame rate. The PID
will adjust tessscale and try to fulfill the user expectation. The realized frame rate is
limited to somewhere in between the per-vertex and the per-fragment frame rates. In
particular, for the girl model, the expected frame rate we specified is 20 fps, the per-
fragment frame rate is 2 fps, and the per-vertex frame rate is 83 fps (see Table 2.1).
The dynamic tessellation improves the rendering speed to almost ten times the per-
fragment frame rate, while achieving almost identical rendering results.
In addition, our algorithm can also capture the visual impression of the specular
component shadows as shown in Figure 2.8 and Figure 2.11. However, the visual im-
pression of the specular component shadows is not very appreciable in static images.
Please visit the video to visualize the behavior of the specular component shadows in
action5.

(a) tessscale 1 (b) tessscale 3 (c) tessscale 7 (d) tessscale 10.1

Figure 2.10. The dynamic tessellation with different tessscale for the girl model. Increasing
the tessscale will increase the density of samples for the more demanding regions.

5
The video of the specular component shadows from our demo program https://youtu.be/
dlLLZTyY7xs.
36 2. Applying Vectorized Visibility on All Frequency Direct Illumination

(a) Ground truth (b) Per-vertex

(c) Per-fragment (d) Dynamic tessellation

Figure 2.11. The rendering results for the girl model: (a) ground truth; (b) per-fragment; (c)
per-vertex; (d) dynamic tessellation tessscale 10.1. Just by evaluating a faction of sample points
(see Figure 2.10(d)) w.r.t. that of the per-fragment rendering, the dynamic tessellation attains
almost identical result to the per-fragment.

Our algorithm also supports the BRDF editing feature [Ben-Artzi et al. 2006],
which allows us to adjust the specular component glossiness continuously. The video
shows the specular component glossiness transition from mirror to very rough surface6.
The intention of our specular component approximation, Equation (2.12), is to
isolate the influence of visibility from the lighting function as a ratio. For the lighting
function, filtering the illuminations without visibility can be handled with relatively
high accuracy even with the primitive form of cubemap mipmap. For the ratio estimator,
because of its geometrical formulation, the shadows will topologically make sense on
its own. Therefore, although the rendered results do not agree to the ground truth

6
The video demonstration of our BRDF editing feature https://youtu.be/tvNnQfL5UT4.
2.9 Conclusion 37

The girl model The teapot model


Number of vertices 6655 2038
Number of faces 10333 3706
Data size 11.5 MB 3.14 MB
Per-vertex fps 83 fps 192 fps
Per-fragment fps 2.1 fps 2.1 fps
20 fps (explicitly main- 20 fps (explicitly main-
Dynamic tessellation fps
tained with PID) tained with PID)
Total memory size to
buffer the tessellated 10.9 MB 4.1 MB
data
Preparation time for the
3.27 minutes 0.93 minutes
visibility
Table 2.1. The rendering information. The rendering is carried out with a GTX 660 display
card. The screen resolution is 1280  962 . The illuminations consist of 400k light sources (i.e.,
St. Peter HDR cubemap; resolution 6  256  256 ). The information provided in the table shows
that our algorithm does not require 3D models to be densely tessellated, the precomputed data
size is very small, and the dynamic tessellation ten times the frame rate while maintaining the
image quality.

precisely, there will be no identifiable visual clues within a rendered image, e.g., light
bleeding, to tell whether the ground truth images or our rendered results are closer to
our world (see Figure 2.8 and Figure 2.11). However, the approximation damage is in-
deed bigger than that would have been suggested by the static image comparisons. The
video7 shows the visual lost due to our approximation, and we can see that our specular
shadows are less responsive to the illuminations w.r.t. the ground truth.

2.9 Conclusion
In this article, we presented the implementation and the rendering algorithm relying on
vectorized visibility. By exploiting its vector graphics properties and the GPU parallel
architecture, our rendering algorithms supports a number of functionalities which ap-
pear to be impractical to be supported simultaneously, e.g., the per-fragment direct il-
lumination with all frequency shadow quality, using coarsely tessellated 3D models,
the BRDF editing, etc.
By integrating the dynamic tessellation feature, the scalability of our algorithm is
improved drastically, which is faster than the per-fragment with better quality than the
per-vertex. Having the frame rates coupled with the tessellation level, we practically

7
The video to demonstrate the visual lost due to our specular approximation https://youtu.be/
cgiKwniyktA.
38 2. Applying Vectorized Visibility on All Frequency Direct Illumination

make the frame rate an intrinsic property among the devices with different computa-
tional capability.
While interreflection is not yet supported by the presented algorithm, it is feasible
to extend our algorithm to support it by approximating the transfer functions in the
higher dimension space, which is one of our ongoing research projects.

2.10 Acknowledgments
The work is supported by a research grant (CityU 11259516) from the Hong Kong Special Ad-
ministrative Region.

Bibliography
BEN-ARTZI, A., OVERBECK, R., AND RAMAMOORTHI, R. 2006. Real-time BRDF editing in com-
plex lighting. In ACM Trans. Graph., 25:3, pp. 945–954.
BRAUER, D. 2010. MatCap. URL: http://wiki.unity3d.com/index.php/MatCap.
CROW, F. 1984. Summed-area tables for texture mapping. In ACM SIGGRAPH computer
graphics, 18:3.
ELCOTT, S., ET AL. 2016. Rendering techniques of final fantasy XV. ACM SIGGRAPH 2016
Talks.
HO, T., XIAO, Y., FENG, R., LEUNG, C., AND WONG, T. 2015. All-Frequency Direct Illumination
with Vectorized Visibility. In IEEE Trans. Vis. Comput. Graph., 21:8, pp. 945–958.
KAUTZ, J., SLOAN, P., AND SNYDER, J. 2002. Fast, arbitrary BRDF shading for low-frequency
lighting using spherical harmonics. In Proc. 13th Eurograph. Workshop Rendering, pp. 291–
297.
LAM, P., HO, T., LEUNG, C., AND WONG, T. 2010. All-frequency lighting with multiscale spher-
ical radial basis functions. In IEEE Trans. Vis. Comput. Graph., 16, pp. 43–56.
LIU, X., SLOAN, P., SHUM, H., AND SNYDER, J. 2004. All-frequency precomputed radiance trans-
fer for glossy objects. In Proc. Eurograph. Symp. Rendering, pp. 337–344.
MATUSIK, W., PFISTER, H., BRAND, M., AND MCMILLAN, L. 2003. A data-driven reflectance
model. In ACM SIGGRAPH 2003 Papers (SIGGRAPH ‘03). ACM, pp. 759–769. DOI:
https://doi.org/10.1145/1201775.882343.
MORENO, J. 2018. MatCap Shaders. http://jeanmoreno.com/unity/matcap.
NG, R., RAMAMOORTHI, R., AND HANRAHAN, P. 2003. All-frequency shadows using non-linear
wavelet lighting approximation. In ACM Transactions on Graphics, 22:3, pp. 376–381.
NG, R., RAMAMOORTHI, R., AND HANRAHAN, P. 2004. Triple product wavelet integrals for all-
frequency relighting. In ACM Transactions on Graphics, 23:3.
RUSSELL, J. 2015. HDR Image-Based Lighting on the Web. In WebGL Insights. CRC Press,
pp. 253–260.
Bibliography 39

SCHEUERMANN, T. AND ISIDORO, J. 2006. Cubemap filtering with cubemapgen. Game Devel-
opers Conference 2006.
SLOAN, P., KAUTZ, J., AND SNYDER, J. 2002. Precomputed radiance transfer for real-time ren-
dering in dynamic, low-frequency lighting environments. ACM Trans. Graph. 21, pp. 527–
536.
SLOAN, P., HALL, J., HART, J., AND SNYDER, J. 2003. Clustered principal components for pre-
computed radiance transfer. ACM Trans. Graph., 22, pp. 382–391.
TSAI, Y. AND SHIH, Z. 2006. All-frequency precomputed radiance transfer using spherical radial
basis functions and clustered tensor approximation. In ACM Transactions on Graphics, 25:3.
WANG, J., REN, P., GONG, M., SNYDER, J., AND GUO, B. 2009. All-frequency rendering of dy-
namic, spatially-varying reflectance. In ACM Trans. Graph., 28, pp. 1–10.
WANG, R., PAN, M., CHEN, W., REN, Z., ZHOU, K., HUA, W., AND BAO, H. 2013. Analytic double
product integrals for all-frequency relighting. In IEEE Trans. Vis. Comput. Graph., 19:7, pp.
1133–1142.
3
I

Nonperiodic Tiling of Noise-


based Procedural Textures
Aleksandr Kirillov

3.1 Introduction
Procedural noise functions have been one of the key tools for adding visual fidelity in
computer graphics for decades. They serve as a foundation for landscape geometry
synthesis, creation of textures containing surface properties such as color and normals,
simulation of atmospheric effects and many other tasks.
With the ever-growing game environment scale and the amount of detail expected
from modern games, developers are more and more frequently faced with content pro-
duction challenges. Reduction of the time required for authoring and iterating on con-
tent is always among the hottest topics in the industry.
Procedural content creation is an increasingly popular solution to this problem.
Procedural methods allow developers to automate and simplify tasks, from object
placement to texture creation. Many large studios as well as independent developers
that cannot afford to produce assets manually already employ them. It is highly likely
that these methods are going to become an industry standard, and an integral part of
all modern content pipelines.
Most of the time games cannot afford to evaluate procedural noise functions at
runtime and instead store the precomputed results in textures. The majority of them
only use basic tiling options provided by the hardware. Noise functions that are de-
signed with efficient evaluation on the GPU in mind are periodic with a relatively small
period. Both cases lead to either repetition or loss of detail. A common approach to
overcome this is to increase the texture resolution, to add decal textures or employ
multitexturing. This results in an increase in memory consumption and memory band-
width requirements.
In this chapter, we present a method to combine noise-based procedural texture
synthesis with a nonperiodic tiling algorithm. We describe modifications to several
popular procedural noise functions that directly produce texture maps containing the

41
42 3. Nonperiodic Tiling of Noise-based Procedural Textures

smallest possible complete Wang tile set. Our approach can be used as a preprocessing
step or during application runtime.
Additionally, we present several improvements of the algorithm [Wei 2004] that
implements Wang tiling on the GPU. We show how our modifications enable nonperi-
odic tiling for a large range of noise-based procedurally generated textures (see an ex-
ample in Figure 3.1). Finally, we analyze the effect these modifications have on the
performance, and discuss their limitations.

Figure 3.1. An image generated with our algorithm.

3.2 Wang Tiles


Wang tiling uses a set of rectangular tiles of the same size with color-coded edges. A
valid surface tiling is obtained by any composition of tiles without rotation or reflection
where all the adjacent tiles share an edge color (see Figure 3.2). The initial assumption
was that if a finite set of tiles could tile a plane in a valid way, a periodic tiling would
exist as well. Later research has shown that a finite set exists that would tile a plane
only nonperiodically. A series of successor works reduced this set, the latest [Jeandel
and Rao 2015] proving that the minimal set required for nonperiodicity consists of 11
tiles with 4 edge colors.
Wang tiles were first used for synthesizing large non-repetitive textures by Stam
[1997]. He described construction of a very limited set of patterns, mostly focusing on
water rendering. Neyret and Cani [1999] proposed using triangular patches with color-
coded edges instead of rectangles, described an algorithm to map those patches to sur-
faces, and provided techniques to generate them procedurally based on Perlin and Wor-
ley noises. Wei and Levoy [2000], as well as Efros and Freeman [2001], proposed
methods to avoid repetition by using a small texture tile to create a larger non-repetitive
texture that looks similar to this tile. While the result is guaranteed to be seamlessly
tileable, both methods can introduce seams inside the texture and require additional
computation and storage. Cohen et al. [2003] adapted these methods to generate sets
of Wang tiles and introduced a stochastic process of laying down individual tiles to
produce nonperiodic tilings. Wei [2004] enabled hardware filtering for Wang tiles by
laying them out in a single texture that itself forms a valid tiling, and proposed a hash-
based algorithm to map the tiles on a plane.
3.3 Nonperiodic Tiling of Procedural Noise Functions 43

(a) (b)
Figure 3.2. A schematic representation of Wang tiling. (a) A complete Wang tile set with two
edge colors. (b) A valid Wang tiling of a surface using the tile set from (a).

Cohen [2003] observed that Wang tiles do not take into account tile corners, which
may result in a discontinuity in the resulting image. Corner tiles [Lagae and Dutré
2006] address this problem by restricting the diagonal tile neighbors in addition to the
horizontal and vertical ones.
We observe, however, an insufficiency in modern Wang tile set synthesis methods.
Many procedural noise functions are limited to producing only periodic images. Noise
functions that allow the construction of nonperiodic noise at runtime are usually com-
putationally expensive, which makes them unusable by many real-time applications.
We lift these limitations by combining Wang tiles and procedural noise functions.
In the following section we present modifications to several noise synthesis algo-
rithms (see Figure 3.3) which directly output a minimal complete set of Wang tiles (16
square tiles with 2 edge colors). The tiles are arranged in a single texture map as pro-
posed by Wei [2004]. Our methods guarantee seamless tiling, can be extended to sup-
port more edge colors or higher dimensions, and can be applied to the noise synthesis
either during precomputation or at runtime. Full implementations of the original and
the modified noise functions are available in the accompanying source code.

3.3 Nonperiodic Tiling of Procedural Noise Functions

3.3.1 Perlin Noise


Perlin noise [Perlin 1985, Perlin 2002] (Figure 3.3(a)) is probably the most well-known
procedural noise function. It is fast and simple, and is widely used to simulate natural
phenomena, such as clouds, fire and smoke.
Perlin noise assigns a pseudo-random gradient vector to each point of the integer
lattice. The value of the function at a point on the 2D plane is then determined by
smooth interpolation between the four closest gradient vectors projected onto the vec-
tors between the current point and the lattice point each vector is assigned to (see List-
ing 3.1). The pseudo-randomness is the result of successive hashing of the lattice point
coordinates with a permutation table.
44 3. Nonperiodic Tiling of Noise-based Procedural Textures

(a) (b) (c) (d) (e)


Figure 3.3. Procedural noise function examples. (a) Perlin noise. (b) Better gradient noise. (c)
Anisotropic and (d) isotropic Gabor noises. (e) Worley noise.

// The code in this chapter is tuned for better readability,


// not for optimal performance. The code is provided in Cg.

float PerlinNoise2D(float2 xy) {


uint2 in00 = floor(xy);
uint2 in11 = in00 + uint2(1u, 1u);
uint4 bounds = uint4(in00, in11);

// Interpolate is either a cubic (Perlin noise) or


// a quintic (Improved Perlin noise) polynomial
float2 weights = Interpolate(xy - in00);
uint2 out00 = in00;
uint2 out01 = uint2(in00.x, in11.y);
uint2 out10 = uint2(in11.x, in00.y);
uint2 out11 = in11;

// See the accompanying sample code for definitions


// of gradients[] and hash()
float g00 = dot(gradients[hash(out00)], xy - bounds.xy);
float g01 = dot(gradients[hash(out01)], xy - bounds.xw);
float g10 = dot(gradients[hash(out10)], xy - bounds.zy);
float g11 = dot(gradients[hash(out11)], xy - bounds.zw);
float v0 = lerp(g00, g01, weights.y);
float v1 = lerp(g10, g11, weights.y);
return lerp(v0, v1, weights.x);
}

Listing 3.1. Evaluation of 2D Perlin noise.

Many noise functions are derived from Perlin noise. Improved Perlin noise [Perlin
2002] reduced visual artifacts in the derivatives and improved overall noise appearance.
Modified noise [Olano 2005] replaced the permutation table with a hash function
3.3 Nonperiodic Tiling of Procedural Noise Functions 45

based on a pseudo-random number generator (PRNG), making the function evaluation


very efficient on the GPU. Better gradient noise [Kensler et al. 2008] (Figure 3.3(b))
used a permutation table per dimension, and different hashing to better decorrelate the
values. It also widened the filter kernel to improve band-limitation.
All the functions derived from Perlin noise have a common property: they are pe-
riodic, with a relatively small period. The period of these noise functions is determined
either by the period of the hash function or by the size of the permutation table.
Toroidal boundary handling is a common way of making the lattice-based noise
functions seamlessly tileable. To achieve wrapping around N, the lattice coordinates
are taken with modulo N:

uint2 out00 = in00 % uint2(N, N);


uint2 out01 = uint2(in00.x, in11.y) % uint2(N, N);
uint2 out10 = uint2(in11.x, in00.y) % uint2(N, N);
uint2 out11 = in11 % uint2(N, N);

We use a similar approach to construct a Wang tile set. We subdivide all the points
of the lattice into two non-intersecting groups: boundary points, which form the
boundaries of the Wang tiles within the texture, and inner points. We do a further sub-
division of the boundary points: corner points; C; vertical border points (one subgroup
per Wang tile edge color); V0 and V1; and horizontal border points; H 0 and H 1 (see
Figure 3.4(a)). The modified function needs to guarantee that the values at the tile bor-
ders that have the same edge color are the same, or, in other words, that the lattice
points that are at the same position local to the tile and are in the same group share a
gradient vector.

(a) (b) (c)


Figure 3.4. A schematic representation of our method for producing a Wang tile set from Perlin
noise. (a) Lattice point subdivision. Each group is assigned a different color: grey (C), light blue
(V 0 ), dark orange (V1), light orange (H 0 ), dark blue (H 1 ), white (inner points). (b) The resulting
positions of points in each group after coordinate mapping. The inner points are not affected.
(c) The resulting noise texture.
46 3. Nonperiodic Tiling of Noise-based Procedural Textures

uint2 TransformCoord(uint2 coord) {


// (kPointsPerRow + 1)^2 is the size of the lattice covering
// the resulting texture (65x65 in this case).
// Each tile gets (kPointsPerRow / 4 + 1) lattice points
const uint2 kPointsPerRow = 64;
const uint2 kPointsPerTile = kPointsPerRow / 4;

uint2 localCoord = coord % kPointsPerRow;


uint2 inTileCoord = localCoord % kPointsPerTile;
uint2 tileCoord = localCoord / kPointsPerTile;

bool isXBorder = inTileCoord.x == 0;


bool isYBorder = inTileCoord.y == 0;

// Tile edge colors in the final layout:


// tile coord (binary): 00 01 10 11
// left: 0 0 1 1
// bottom: 0 0 1 1
uint left = (tileCoord.x / 2) & 1;
uint bottom = (tileCoord.y / 2) & 1;

uint2 offset = uint2(bottom, left) * kPointsPerTile;


uint2 borderCoord = inTileCoord + offset;
uint borderX = isYBorder ? borderCoord.x : localCoord.x;
uint borderY = isXBorder ? borderCoord.y : localCoord.y;
uint x = isXBorder ? 0 : borderX;
uint y = isYBorder ? 0 : borderY;

return uint2(x, y);


}

Listing 3.2. A coordinate mapping function used with Perlin noise to produce Wang tiles.

The easiest way to achieve this is to modify the lattice point coordinates. We first
check which group the lattice point belongs to, and then map the points that fall into
the same group to the same region of the lattice (see Listing 3.2).
For simplicity, we chose the following mapping: all corner points end up at  0, 0 ;
horizontal borders either at  x t , 0  or at  x t  K, 0 , depending on the edge color; ver-
tical borders at  0, y t  or  0, y t  K , where K equals kPointsPerTile, and  x t , y t 
is the coordinate of the point local to the tile (see Figure 3.4(b)). As long as the regions
that these groups are mapped to do not overlap, this choice does not noticeably affect
the resulting noise.
The last step is to call the TransformCoord function to modify the lattice point
coordinates:
3.3 Nonperiodic Tiling of Procedural Noise Functions 47

uint2 out00 = TransformCoord(in00);


uint2 out01 = TransformCoord(uint2(in00.x, in11.y));
uint2 out10 = TransformCoord(uint2(in11.x, in00.y));
uint2 out11 = TransformCoord(in11);

The resulting noise is shown in Figure 3.4(c). The TransformCoord function can
be used together with many other lattice-based procedural noise functions. We suc-
cessfully applied it to modified noise and Perlin noise with the xor hashing function
introduced by Kensler et al. [2008].
In order to avoid the corner problem mentioned in Section 3.2, our modifications
are defined in such a way that all the tile corners are the same, and thus are guaranteed
to not produce any discontinuities.

3.3.2 Better Gradient Noise


Better gradient noise [Kensler et al. 2008] uses a 4  4 filter instead of a 2  2 filter
employed by Perlin noise. In order to ensure that the colors at the tile borders are the
same, we increase the border thickness to cover two additional rows of lattice points
(see Figures 3.5(a) and 3.5(b) and Listing 3.3).
With Perlin noise we had to care only about the left and the bottom borders of the
tile, because the other two borders were handled by the adjacent tiles. Here, we also
take into account lattice points that lie within the tile, and have to check which group
these points belong to as well.
In order to simplify the mapping, we define the boundary point subdivision into
subgroups in a way different from the one we used for Perlin noise. The corner points
and the points that belong to the borders of one color form one group, C 0, while the

(a) (b) (c)


Figure 3.5. A schematic representation of our method for producing a Wang tile set from better
gradient noise. (a) Lattice point subdivision. The groups have the same colors as in Figure 3.4(a).
(b) The resulting positions of points in each group after coordinate mapping. (c) The resulting
noise.
48 3. Nonperiodic Tiling of Noise-based Procedural Textures

uint2 TransformCoordWideBorder(uint2 coord)


{
// (kPointsPerRow + 1)^2 is the size of the lattice covering
// the resulting texture (65x65 in this case).
// Each tile gets (kPointsPerRow / 4 + 1) lattice points

const uint2 kPointsPerRow = 64;


const uint2 kPointsPerTile = kPointsPerRow / 4;
uint2 localCoord = coord % kPointsPerRow;
uint2 inTileCoord = localCoord % kPointsPerTile;
uint2 tileCoord = localCoord / kPointsPerTile;
const uint2 kBorder00 = uint2(2, 2);
const uint2 kBorder11 = kPointsPerTile - kBorder00;

bool isX0Border = inTileCoord.x <= kBorder00.x;


bool isX1Border = inTileCoord.x >= kBorder11.x;
bool isXBorder = isX0Border || isX1Border;
bool isY0Border = inTileCoord.y <= kBorder00.y;
bool isY1Border = inTileCoord.y >= kBorder11.y;
bool isYBorder = isY0Border || isY1Border;
bool isBorder = isXBorder || isYBorder;
bool isCorner = isYBorder && isXBorder;

// Tile edge colors in the final layout:


// tile coord (binary): 00 01 10 11
// left: 0 0 1 1
// bottom: 0 0 1 1
// right: 0 1 1 0
// top: 1 0 0 1

uint left = (tileCoord.x / 2) & 1;


uint bottom = (tileCoord.y / 2) & 1;
uint right = left != (tileCoord.x & 1);
uint top = bottom == (tileCoord.y & 1);

bool zeroX = isCorner || !isXBorder;


bool zeroY = isCorner || !isYBorder;
uint tileOffsetX = zeroX ? 0 : (isX0Border ? left : right);
uint tileOffsetY = zeroY ? 0 : (isY1Border ? top : bottom);
uint2 tileOffset = uint2(tileOffsetX, tileOffsetY);

uint2 offset = tileOffset * kPointsPerTile + inTileCoord;


return isBorder ? offset : localCoord;
}

Listing 3.3. A coordinate mapping function with a wider border.


3.3 Nonperiodic Tiling of Procedural Noise Functions 49

points that belong to the borders of the second color are in another group, C 1:
C 0  C  V 0  H 0 , C 1  V1  H 1 . Points from the group C 0 are mapped to the coordi-
nate local to the tile,  x t , y t . Points from the other group are mapped to positions with
a fixed offset from the coordinate local to the tile,  x t  K , y t  K , where K is equal
to kPointsPerTile. The resulting noise is shown in Figure 3.5(c).

3.3.3 Gabor Noise and Worley Noise


Gabor noise [Lagae et al. 2009] (Figures 3.3(c) and 3.3(d)) is a procedural noise func-
tion that offers intuitive and direct control over the power spectrum, and at the same
time supports anisotropy. It is based on the sparse convolution noise [Lewis 1989].
Sparse convolution noise is constructed by convolving an arbitrary kernel k with a
Poisson noise process γ ,

N  x, y  
 γ u, v  k  x  u, y  v  du dv. (3.1)

The Poisson process consists of impulses of uncorrelated intensity distributed at ran-


dom uncorrelated locations  x k , y k ,

γ  x, y    a δ  x  x , y  y .
k
k k k (3.2)

The Gabor kernel g is a multiplication of a circular Gaussian and a directional cosine


wave,
 
g  x, y   Ke cos  2 πF0  x cos ω 0  y sin ω 0 .
 πa 2 x 2  y 2
(3.3)

The parameters K and a control the magnitude and the radius of the Gaussian. F0 and
ω 0 control the frequency and the orientation of the cosine wave.
Gabor noise is constructed as a sparse convolution noise that uses the Gabor kernel
as its kernel. Because the power spectrum of a sparse convolution noise is the power
spectrum of the kernel, scaled by a constant as shown by Lewis [1989], the parameters
of the Gabor kernel provide direct control over the power spectrum of the resulting
noise.
In order to increase the computational efficiency, Gabor noise is evaluated on a
grid. The properties of the Gabor kernels are generated per cell on the fly using a
PRNG. The size of a grid cell equals the radius of the Gabor kernels. This allows us to
limit the evaluation of the noise function to the cell containing the point being evalu-
ated, and its immediate neighbors.
Lagae et al. [2009] provide ways to produce both periodic and nonperiodic noise.
In order to obtain noise with a period of N, the grid cells are enumerated in row-major
order with cell coordinates taken with modulo N. The seed for the PRNG is then cal-
culated as a sum of the cell index and a global offset parameter. Nonperiodic noise uses
Morton order for cell enumeration.
50 3. Nonperiodic Tiling of Noise-based Procedural Textures

Worley [1996] introduced a new texture basis function to complement procedural


noise functions (Figure 3.3(e)). This function can be used to produce various textures
like cobblestones, water and cellular structures. The algorithm randomly distributes
feature points in space. Function Fn is defined as the distance between the input and
the n-th closest feature point. The resulting value is determined by a function of
F1 ,  , Fn .
Although the cellular texture basis function is not a procedural noise function in
the same sense as other function discussed in this chapter, it is very relevant to texture
synthesis. Lagae et al. [2010] observed that its implementation is similar to that of
sparse convolution noise, and, inherently, Gabor noise. Both subdivide the space into
a uniform grid, and a random number of positions are generated for each cell. The
positions become the locations of feature points for Worley noise and the centers of
Gabor kernels for Gabor noise.
A uniform grid is very tightly linked to the integer lattice: each grid cell can be
identified by a lattice point in the corner of the cell. This means that for our purposes,
the terms lattice point and grid cell are interchangeable. Because of that, we can di-
rectly use the TransformCoordWideBorder from Listing 3.3 for both Gabor noise
and Worley noise. kPointsPerRow and kPointsPerTile represent the number of
cells per row and the number of cells per tile, respectively. The only modification we
need to do is to set the border width to one cell to cover a 33 area:

const uint2 kBorder00 = uint2(1, 1);

The resulting noises are shown in Figures 3.6(a) and 3.6(b).

3.4 Tiled Noise Filtering


There are three methods that are commonly used to map a noise to a surface. Explicit
parameterization is most commonly used with regular textures. It allows fine-tuning,
but requires additional memory and can introduce distortions and seams. Solid noise
is a 3D noise function that is sampled at each surface point. The objects textured using
solid noise appear carved out of a block of matter. The memory cost remains low if the
noise is not precomputed. Surface noise [Lagae et al. 2010] is defined directly on the
surface, giving the appearance of the noise features following the curvature. This is
difficult to achieve in general, but sparse convolution noises (such as Gabor noise) en-
able this approach.
Our methods work with explicit surface parameterization and with solid noise,
regardless of whether the noise is precomputed or evaluated at application runtime.
Surface noise obtained from sparse convolution noises is limited to runtime eval-
uation only. It is constructed from Gabor noise by projecting a three-dimensional Pois-
son distribution on the plane tangent to the surface, and evaluating two-dimensional
Gabor noise in that plane (see Lagae et al. [2009]). There is no grid defined in the
3.4 Tiled Noise Filtering 51

(a)

(b) (c)
Figure 3.6. Noise function Wang tile texture maps. (a) Anisotropic Gabor noise. (b) Worley
noise. (c) Worley noise from (b) with separated tiles. Note that the corresponding edges of the
tiles marked with the same color have matching borders.

tangent plane for the final noise evaluation, which renders our methods not applicable
to surface noise.
When the noise is mapped to surfaces, special care must be taken to avoid aliasing
on distant objects and on surfaces seen at an angle. While the precomputed noise can
benefit from the existing texture filtering methods, noise functions that are evaluated
at application runtime usually rely on the properties of the noise to reduce aliasing.
Anisotropic filtering is a commonly used technique that increases the image qual-
ity on surfaces viewed at an oblique angle. It takes several samples in an elliptic area,
producing a more accurate approximation of the pixel projection to the screen. When
the filter area crosses the tile borders, it can produce incorrect values. All the discussed
noises except the ones that use the TransformCoord function (Listing 3.2) exhibit
such artifacts only when the anisotropic filter is wider than the width of the tile border.
Perlin noise and other noise functions that require just one row of lattice points for the
tile border always have this error present. The error is nearly invisible due to interpo-
lation between the lattice points, so some games can completely neglect this. Others
may choose to increase the width of the border region to remove it completely.
52 3. Nonperiodic Tiling of Noise-based Procedural Textures

Using mipmaps is one of the most common ways of filtering textures on distant
surfaces. When dealing with textures containing tiles, one would need to take into ac-
count the way the hardware performs the sampling while downscaling the images in
order to minimize the discontinuities at the tile boundaries between two adjacent levels
of detail.
Our experiments show that anisotropic filtering hides the discontinuities intro-
duced by general-purpose downscaling algorithms used to construct the mipmaps, even
in high-contrast images. In our opinion, using a combination of anisotropic filtering
and mipmaps produces the best visual results (see Figure 3.7).
The proposed methods do not modify the underlying noise function. It follows that
filtering the procedural noise functions that use them can be done with the help of the
techniques that are applicable to the original noise functions. Read Lagae et al. [2010]
for details on these techniques.

(a) (b) (c)


Figure 3.7. Effect of texture filtering on the tile boundaries. (a) Point filtering, no mipmaps. (b)
Trilinear filtering only. Note the visible tile borders. (c) Anisotropic filtering and trilinear filter-
ing. Tile borders can still be distinguished, but are much less pronounced.

3.5 Tiling Improvements

3.5.1 Tile Packing


Wei [2004] introduced a scheme to pack the Wang tiles into a single texture. The solu-
tion is generic for K edge colors, minimizes the storage requirements by ensuring that
each tile is used only once, avoids filtering artifacts by arranging the tiles in the final
texture using a valid Wang tiling and is at the same time efficient to compute. A
3.5 Tiling Improvements 53

piecewise function TileIndex 1D is applied separately to the vertical and horizontal edges
of the tile, and gives the vertical and horizontal position of the tile respectively:

 0, e1  e 2  0;

e12  2e 2 1, e1  e 2  0;

TileIndex 1D  e1 , e 2     2e1  e 22 , e 2  e1  0; (3.4)

 e  1 2  2, e1  e 2  0;
 1

 e1  1 1,
 e1  e 2  0.
2

Most applications, however, limit the usage of Wang tiles to just two edge colors,
thus reducing the texture size. We propose a different packing function that can be used
specifically for two edge colors, ensures the same tile layout and is more efficient to
compute than TileIndex 1D  e1 , e 2 :
TileIndex 1D  e1 , e 2   2e1   e1  e 2 
 2 e1  e 2  e 1
 2 max  e1 , e 2   e1  e 2. (3.5)

We provide three equivalent forms of the same function to account for possible perfor-
mance differences in the target hardware.

3.5.2 Edge Color Hashing and Repetition


Wei [2004] proposed the use of toroidal boundary handling to compute the tile index
from the input texture coordinate. The tile coordinates are then used as inputs to a hash
function based on a small permutation table, which determines the edge colors of the
resulting tile.
The permutation table size, however, either grows linearly together with the size
of the output texture, or causes repetition of the tile pattern. While the amount of
memory required for the permutation table is very modest by modern standards, it has
to be kept in very fast memory to provide efficient access. Our experiments show that
usage of the same hash function as Olano [2005] proposed for lattice point coordinate
hashing produces acceptable results and is very fast to evaluate. The hash value is cal-
culated by offsetting the input by a fixed number K, and applying the PRNG to the
result twice [Blum et al. 1986]. Because modern GPUs have full support for integer
operations, it is no longer necessary to constrain the PRNG parameters to avoid preci-
sion loss in floating-point numbers.
Applications that require higher periods of tile repetition and can afford a more
expensive hash function would benefit from using cryptographic hashes. They offer
statistical randomness of the result even if evaluated in parallel (as opposed to many
classical PRNGs), and do not depend on the evaluation order. An example of such a
54 3. Nonperiodic Tiling of Noise-based Procedural Textures

hash function is the implementation of an MD5 hash1 for GPUs by Tzeng and Wei
[2008].
Figure 3.8 shows the tile patterns produced by these tiling methods and the sto-
chastic tiling algorithm proposed by Cohen et al. [2003].

(a) (b) (c) (d) (e)


Figure 3.8. Tile patterns produced by different tiling algorithms. Each image contains 128 64
color-coded Wang tiles. (a) Stochastic. (b) Permutation table, 32 entries. (c) Permutation table,
64 entries. (d) Blum Blum Shub hash. (e) MD5 hash.

3.6 Results
Perlin [1985] showed that his noise function can be used as a building block for many
realistic-looking textures. We would like to show that similar results can be achieved
with the noise functions modified as proposed in this chapter (see Figure 3.9).
Additionally, we would like to make an observation that once a function F is ap-
plied to a Wang tile texture, the result remains a Wang tile texture with the same layout
if F depends on the pixel color, or if it depends on the pixel position and forms a Wang
tile texture with the same layout. This class of functions includes, for example, linear
combinations of Wang tile textures and periodic functions, with the period being a
multiple of the Wang tile size. We show some textures generated with our application
in Figure 3.9. We provide a Unity project with our example implementations of the
shaders producing these textures in the accompanying code samples.

3.7 Performance
We conducted a series of CPU and GPU tests to analyze the impact of the proposed
modifications on the performance. All tests were performed on a computer with an
Intel Core i7-6820HK CPU (2.7 GHz) and a GeForce GTX 980M (8GB) GPU. The
screen resolution was set to 1920 1080 . The GPU driver settings were left in the de-
fault state, allowing the application to control most parameters. VSync was turned off.
The test code was written in C++ using Visual Studio 2015. The compiler and
linker settings were left in the default state for a console application in release config-

1
There are alternatives to MD5 which provide a better balance between hash complexity and
tile pattern repetition.
3.7 Performance 55

Figure 3.9. Some interesting Wang tile texture maps produced using the presented techniques
(left) and the corresponding tilings (right). Top to bottom: marble, detailed fabric, brick wall,
cobblestones, stained glass.

uration. The code was written without usage of SIMD CPU extensions or multithread-
ing. The application used OpenGL 4.5 as a graphics API.
We measured the average and the median execution time for all the variants of the
noise functions being tested. The time values provided are in milliseconds. The last
two columns of the following tables present the percentage difference in sampling per-
formance between the original noise functions and the noise functions using the pro-
posed modifications.

3.7.1 CPU Performance


We conducted a CPU performance comparison between the original noise functions
and the noise functions using the proposed modifications by computing a 2048 2048
texture. The texture contents were updated 65 times. The first iteration served as a
warm-up and was excluded from the calculation of average and median duration values.
Lattice-based procedural noise functions used a lattice with 257 257 points. All
tables required by the functions contained 256 entries. The Perlin noise implementa-
tion used quintic interpolation.
56 3. Nonperiodic Tiling of Noise-based Procedural Textures

Non-tileable, ms Wang tiles, ms Difference, %


Noise function
Avg Med Avg Med Avg Med
Perlin, def. 31.77 31.71 60.97 61.01 91.9 92.4
Perlin, opt. 24.74 24.64 24.89 24.85 0.61 0.85
Better grad., def. 194.20 194.25 964.24 965.63 397 397
Better grad., opt. 207.52 207.55 207.73 208.27 0.10 0.35
Gabor, 0 kernels 2,613 2,627 3,672 3,664 40.5 39.5
Gabor, 8 kernels 18,955 18,850 20,306 20,308 7.13 7.73
Gabor, 16 kernels 34,863 34,857 36,222 36,243 3.9 3.98
Gabor, 24 kernels 51,233 51,233 52,783 52,395 3.03 2.27
Gabor, 32 kernels 63,740 63,515 64,018 64,027 0.44 0.81
Gabor, 40 kernels 67,184 67,399 67,262 67,384 0.12 −0.02
Table 3.1. Precomputed Noise Function Performance on the CPU.

We chose to use an anisotropic version of Gabor noise for the performance com-
parison. All the parameters of the Gabor kernel were fixed. We measured the impact
of the proposed modifications on the Gabor noise by introducing a value to cap the
number of kernels per cell (we set the noise parameters to have 32 kernels as the mean
value) and comparing the performance with the cap set to 0, 8, 16, 24, 32 and 40
kernels.
For the CPU performance test we provided two algorithms that compute the values
of the Perlin noise and of the better gradient noise. The default implementation samples
the function as usual, both the non-tileable and the modified versions. An optimized
implementation first precomputes the lattice gradients into a large lookup table, and
then uses them to efficiently calculate the values of the final function.
The default implementations of both lattice-based noise functions that produce
Wang tiles are several times slower than the original functions. Still, the amount of
time it takes to sample the functions once per pixel in a 2048 2048 texture is relatively
small, indicating that they are very cheap to evaluate. The optimized implementations
require additional memory, but make the performance of both the original and the
modified function equal. The optimized Perlin noise implementation is also faster than
the default one by more than 20% (see Table 3.1).
The evaluation of the anisotropic Gabor noise with zero Gabor kernels per cell
closely estimates the performance cost of using the modified noise function. It is
around 40% higher than that of the original function. The evaluation of additional Ga-
bor kernels quickly reduces the weight of the boundary cell remapping to values close
to zero in the overall function sampling duration.

3.7.2 GPU Performance, Precomputed Textures


We conducted a comparison between sampling a regular texture and sampling a Wang
tile texture map on the GPU by rendering meshes that consist of two triangles covering
3.7 Performance 57

the whole screen. The texture being sampled contained a full mip chain. We tested two
texture resolutions, 1024 1024 and 2048 2048. We rendered either a single mesh
covering the whole screen or 16 such meshes located at the same depth in order to
simulate overdraw.
Sampling a Wang texture once per pixel carries very little overhead. In this case
the performance drops insignificantly, by less than 2%. Sampling a texture 16 times
per pixel makes the difference more noticeable, up to about 20% (see Table 3.2).
Our tests showed that enabling trilinear and anisotropic filtering did not affect the
performance.

Texture No tiling, ms Wang tiling, ms Difference, %


w  h @ samples Avg Med Avg Med Avg Med
1K 1K @ 1 0.992 0.526 0.984 0.525 −0.81 −0.19
1K 1K @ 16 5.567 4.159 6.655 4.970 19.54 19.50
2K  2K @ 1 1.434 0.784 1.461 0.799 1.88 1.91
2K  2K @ 16 5.346 4.227 6.666 5.020 24.69 18.76
Table 3.2. Precomputed Noise Function Performance on the GPU.

3.7.3 GPU Performance, Direct Noise Evaluation


Finally, we conducted a performance comparison between the original and the modi-
fied versions of the noise functions, when the evaluation happens directly on the GPU
at application runtime. The cost of sampling a modified noise function can be up to
10% higher than the cost of sampling the original function (see Table 3.3).

Noise function No tiling, ms Wang tiling, ms Difference, %


@ samples Avg Med Avg Med Avg Med
Perlin @ 1 1.25 0.67 1.29 0.69 2.63 2.37
Perlin @ 16 6.87 5.65 6.83 6.29 −0.54 11.35
Better grad. @ 1 2.58 1.5 2.87 1.69 11.49 12.48
Better grad. @ 16 20.97 20.97 22.35 22.35 6.60 6.58
Gabor, 16 kern. @ 1 11.05 10.90 12.23 12.14 10.68 11.35
Gabor, 16 kern. @ 16 175.3 173.2 191.7 191.6 9.36 10.62
Gabor, 32 kern. @ 1 21.21 21.12 23.26 23.29 9.67 10.29
Gabor, 32 kern. @ 16 340.0 340.0 370.4 370.3 8.92 8.92
Gabor, 64 kern. @ 1 42.01 42.05 45.00 44.88 7.12 6.75
Gabor, 64 kern. @ 16 671.6 671.5 719.0 719.1 7.06 7.09
Table 3.3. Noise Function Evaluation Performance on the GPU.
58 3. Nonperiodic Tiling of Noise-based Procedural Textures

Overall, the results of the performance comparison tests indicate that the modifi-
cations presented in this chapter can be adopted by many applications without having
a major effect on the frame rate. Both precomputation and direct runtime evaluation of
the modified noise functions are only slightly slower than the original. Sampling a pre-
computed Wang tile texture is roughly equivalent in performance to evaluating Perlin
noise with simple tiling in the shader, and is at least several times faster than evaluating
noises that offer higher quality.

3.8 Limitations
As mentioned in Section 3.4, Wang tile textures require special downscaling algo-
rithms in order to minimize errors on tile borders between adjacent mip levels. The
same applies to block texture compression methods. Anisotropic filtering, however,
visually hides the discontinuities arising from texture compression as well.
Using a Wang tile texture within a texture atlas presents challenges very similar to
those of using texture atlases with wrapping modes other than clamping. A detailed
discussion is available in [Nvidia 2004]. If the Wang tile textures are of the same size,
texture arrays can be utilized on the hardware that supports them.
Sampling a Wang tile texture requires using a version of the texture sampling func-
tion with explicitly provided gradients. Older or low-power hardware used in some
mobile devices may lack support for gradient computation. Additionally, some old
GPUs in mobile devices undergo a performance penalty when the texture coordinates
used to sample a texture are modified in the fragment shader. This can be mitigated in
some cases by adjusting the mesh UV coordinates in a way that would enforce all tile
corners to correspond to mesh vertices, and by moving the texture coordinate calcula-
tion to the vertex shader.
The presented algorithms do not solve certain inherent problems of Wang tiles.
For example, when the resulting tiles have large-scale distinct features, the tiling pat-
tern becomes quite obvious.
Hash-based evaluation of tile edge colors requires a complete set of Wang tiles to
operate. The number of tiles in the set grows very quickly when adding an edge color
or increasing the number of dimensions. A full 2D set contains N 4 tiles, where N is
the number of colors. This significantly increases the memory requirements for the
precomputed textures. A tile index texture map can be employed to reduce the memory
occupied by the tile set by including only those tiles that are actually used. This, how-
ever, introduces additional complexity to the tile packing step, and adds a texture read
to the shader.

3.9 Conclusion
We presented a set of modifications to several popular procedural noise functions that
directly produce texture maps containing the smallest complete set of Wang tiles. The
3.10 Future Work 59

Figure 3.10. A comparison between simple tiling of a 1024 1024 texture (top) and Wang tiling
of a 512  512 Wang tile texture map (bottom). The latter is nonperiodic and requires 4 times
less memory.

proposed modifications can be used both at application runtime and during the prepro-
cessing steps and can be generalized to higher dimensions and Wang tile sets with more
edge colors.
The modified noise functions retain most of the key characteristics of the original
functions [Kirillov 2018]. We discussed the effect of using the proposed modifications
on the noise function filtering and the mapping of noise functions to surfaces. Addi-
tionally, we presented several improvements of the tiling algorithm on the GPU for
Wang tiles.
Our modifications enable nonperiodic tiling for a large range of noise-based pro-
cedurally generated textures and effects. The presented techniques can be used to pro-
duce large, non-repetitive and detailed terrain geometry, atmospheric effects and
realistic-looking, natural and artificial textures. The option to combine the Wang tile
texture maps while maintaining tile layout further increases the diversity of the possi-
ble results.
The performance tests indicate that the proposed techniques can be adopted even
by high-performance interactive real-time applications like games. The presented
methods offer a potential to decrease the memory consumption of the precomputed
noise-based textures, and enable precomputation for large noise-based textures that
would not fit into the memory budget of an application otherwise, without sacrificing
the final image variance (see Figure 3.10). The Wang tile texture map sampling is also
straightforward to implement. This guarantees smooth integration into both existing
and new applications.

3.10 Future Work


There are several interesting directions for future work. One is to investigate the possi-
bility of defining similar modifications for other noise functions. Adding the function-
ality to handle corner tiles would give an opportunity to reduce the repetition in the
final image even further. Introduction of an algorithm that minimizes the discontinui-
ties on the tile borders between the adjacent levels of detail when downscaling the
Wang tile textures would be beneficial for high-quality rendering.
60 3. Nonperiodic Tiling of Noise-based Procedural Textures

Bibliography
BLUM, L., BLUM, M., AND SHUB, M. 1986. A Simple Unpredictable Pseudo-Random Number
Generator. In SIAM Journal on Computing, 15:2, pp. 364–383. URL: https://doi.org/10.1137/
0215025.
COHEN, M., SHADE, J., HILLER, S., AND DEUSSEN, O. 2003. Wang Tiles for Image and Texture
Generation. 22, pp. 287–294. URL: http://doi.acm.org/10.1145/882262.882265.
EFROS, A. AND FREEMAN, W. 2001. Image Quilting for Texture Synthesis and Transfer. In Pro-
ceedings of ACM SIGGRAPH ‘01, pp. 341–346. URL: http://doi.acm.org/10.1145/383259.
383296.
JEANDEL, E. AND RAO, M. 2015. An aperiodic set of 11 Wang tiles. URL: https://arxiv.org/pdf/
1506.06492.pdf.
KENSLER, A., KNOLL, A., AND SHIRLEY, P. 2008. Better Gradient Noise. SCI Institute Technical
Report No. UUSCI-2008-001. URL: https://www.cs.utah.edu/aek/research/noise.pdf.
KIRILLOV, A. 2018. Non-periodic Tiling of Procedural Noise Functions. Proc. In ACM Comput.
Graph. Interact. Tech. 1, 2, Article 32. URL: https://doi.org/10.1145/3233306.
LAGAE, A. AND DUTRÉ, P. 2006. An Alternative for Wang Tiles: Colored Edges Versus Colored
Corners. In ACM Trans. Graph. 25:4, pp. 1442–1459. URL: http://doi.acm.org/10.1145/
1183287.1183296.
LAGAE, A., LEFEBVRE, S., DRETTAKIS, G., AND DUTRÉ, P. 2009. Procedural Noise Using Sparse
Gabor Convolution. In Proceedings of ACM SIGGRAPH ‘09, pp. 54:1–54:10. URL:
http://doi.acm.org/10.1145/1531326.1531360.
LAGAE, A., LEFEBVRE, S., COOK, R., DEROSE, T., DRETTAKIS, G., EBERT, D., LEWIS, J., PERLIN,
K, AND ZWICKER, M. 2010. State of the Art in Procedural Noise Functions. In EG 2010 – State
of the Art Reports. Eurographics Association.
LEWIS, J. 1989. Algorithms for Solid Noise Synthesis. In Proceedings of ACM SIGGRAPH
‘89, pp. 263–270. URL: http://doi.acm.org/10.1145/74334.74360.
Neyret, F. and Cani, M. 1999. Pattern-based Texturing Revisited. In Proceedings of ACM SIG-
GRAPH ‘99, pp. 235–242. URL: http://dx.doi.org/10.1145/311535.311561.
NVIDIA. 2004. Improve Batching Using Texture Atlases. NVSDK 7.0 Whitepaper. URL:
http://download.nvidia.com/developer/NVTextureSuite/Atlas_Tools/
Texture_Atlas_Whitepaper.pdf.
OLANO, M. 2005. Modified Noise for Evaluation on Graphics Hardware. In Proceedings of the
ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, HWWS ‘05, pp.
105–110. URL: http://doi.acm.org/10.1145/1071866.1071883.
PERLIN, K. 1985. An Image Synthesizer. In Proceedings of ACM SIGGRAPH ‘85, pp. 287–
296. URL: http://doi.acm.org/10.1145/325165.325247.
PERLIN, K. 2002. Improving Noise. In Proceedings of ACM SIGGRAPH ‘02, pp. 681–682.
URL: http://doi.acm.org/10.1145/566654.566636.
Bibliography 61

STAM, J. 1997. Aperiodic texture mapping. Tech. rep., R046. European Research Consortium
for Informatics and Mathematics (ERCIM). URL: http://www.dgp.toronto.edu/people/stam/
reality/Research/pdf/R046.pdf.
TZENG, S. AND WEI, L. 2008. Parallel White Noise Generation on a GPU via Cryptographic
Hash. In Proceedings of the 2008 Symposium on Interactive 3D Graphics and Games, I3D
‘08, pp. 79–87. URL: http://doi.acm.org/10.1145/1342250.1342263.
WEI, L. AND LEVOY, M. 2000. Fast Texture Synthesis Using Tree-structured Vector Quantiza-
tion. In Proceedings of ACM SIGGRAPH ‘00, pp. 479–488. URL: http://dx.doi.org/10.1145/
344779.345009.
WEI, L. 2004. Tile-based Texture Mapping on Graphics Hardware. In Proceedings of the ACM
SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, HWWS ‘04, pp. 55–63.
URL: http://doi.acm.org/10.1145/1058129.1058138.
WORLEY, S. 1996. A Cellular Texture Basis Function. In Proceedings of ACM SIGGRAPH
‘96, pp. 291–294. URL: http://doi.acm.org/10.1145/237170.237267.
4
I

Rendering Surgery
Simulation with Vulkan
Nicholas Milef, Di Qi, and Suvranu De

4.1 Introduction
While surgical simulation requires much of the same rendering functionality as games,
critical differences necessitate simulation-specific optimizations and engine design de-
cisions that aren’t commonly needed or provided in rendering engines for games. Given
our unique use cases, we take advantage of the explicitness of the Vulkan API (as com-
pared to OpenGL) to develop a rendering engine for surgical simulation. In this article,
we explain how we tailored our rendering engine design around surgery simulation
including how higher level design decisions propagate to lower-level usage of Vulkan.
To achieve this goal, our rendering architecture is designed to be flexible, maintainable
and efficient. In surgical use cases, soft tissues are modeled by deformable meshes
which are specially handled by our efficient memory system. We show how perfor-
mance scales with our memory system. Later, we present a case study using our ren-
derer in a virtual cricothyroidotomy (CCT) 3D simulator.

4.2 Overview
Virtual surgery simulators present some unique computational and development chal-
lenges that are less common in other applications such as games. Rendering is partic-
ularly important because the appearance of the simulator must be convincingly realistic
to properly train surgeons for real-life surgery scenarios.
General-purpose game engines often have limited soft-body physics and haptics
support. Platforms such as the Software Framework for Multimodal Interactive Simu-
lation (SoFMIS) [Halic et al. 2011], the Simulation Open Framework Architecture
(SOFA) [SOFA 2018], OpenSurgSim [OpenSurgSim 2017], and the Interactive Medi-
cal Simulation Toolkit (iMSTK) [iMSTK 2018] seek to fill this gap. Our rendering
engine in particular is part of the larger framework of iMSTK. In addition, newer APIs

63
64 4. Rendering Surgery Simulation with Vulkan

such as Vulkan [The Khronos Group 2018] provide more capabilities to make better
use of computing resources and allow for more predictable performance compared to
older graphics APIs such as OpenGL.

4.2.1 Current Rendering Challenges for Surgery Simulation


One of the main challenges with surgery simulation is the rendering of difficult mate-
rials such as skin, tissue, and organs. Unlike in games, difficult rendering scenarios
cannot be avoided by artwork or level design. It’s necessary to render intricate details
such as marks on organs and skin while making the rendering look photorealistic so
that users can become familiar with the real surgical procedure.
Another challenge we face is the handling of dynamic meshes to prepare for ren-
dering. Since a major component of useful surgery simulation is the simulation of the
physical deformation of tissue and organs and their interactions with the virtual envi-
ronment, the renderer needs to be able to efficiently and correctly handle the updating
of these geometries. Although soft bodies exist in video games, they often have limited
interactions with other objects (e.g., cloth simulation), so the simulation and rendering
can be often be done exclusively on the GPU. However, surgery simulation also relies
on haptic devices, so forces must be sent to the haptic device drivers. GPU computation
isn’t as desirable option because the computation results must be send to the haptic
devices as soon as possible and many of the physics computations use serialized con-
straint solving, making these algorithms difficult to parallelize for the GPU. In order to
render the deformable meshes, we needed to develop an architecture to quickly update
data.
Finally, although the Vulkan API is more explicit than OpenGL, many API usage
choices depend on the target hardware. In other words, it’s not always clear what ac-
ceptable parameters are to guarantee satisfactory performance. Memory management
is one of the areas where the optimal use cases are particularly ambiguous.
In this chapter, we first present our approach towards achieving realistic rendering
with our rendering architecture. Next, we present methods for handling the computa-
tion and transfer of dynamic meshes. Finally, we share our approach to memory man-
agement for our rendering system.

4.3 Render Pass Architecture


Rendering architectures construct their render passes in a way to minimize rendering
artifacts, reach performance goals, and allow for advanced rendering algorithms. For a
specific application, it is critical to balance these tradeoffs. Additionally, the increased
complexity of these rendering engines necessitates more complex shader code, which
can be difficult for users and engine developers to maintain.
4.3 Render Pass Architecture 65

Figure 4.1. The render pass architecture.

4.3.1 Render Pass Stages


Our rendering architecture is divided into individual render passes (Figure 4.1). Surgi-
cal simulation requires the rendering of diverse materials which some lighting tech-
niques, such as deferred rendering, have difficulty expressing efficiently since lighting
is decoupled from the geometry rendering. In deferred rendering, the geometry is ren-
dering to a geometry buffer (G-buffer) that separate lighting passes read to perform
lighting calculations for each fragment on the screen. Classical deferred renderers can
be more efficient for rendering many lights, but surgery simulation generally only re-
quires a small amount of lights (usually less than a dozen). While branching can be
used to allow for some material variety [Garawany 2016], it ultimately still limits the
number and types of materials.
In contrast, in forward rendering, each material is evaluated independently, allow-
ing for completely arbitrary material evaluations in each fragment shader for a material.
While we chose forward rendering for this reason, we still incorporated a thinner G-
buffer so that we could later use its data for post-processing such as screen-space sub-
surface scattering.
Each render pass is explained in detail in the following sections.

Shadow Mapping
In some surgery simulation scenarios, many shadow-casting lights are necessary for
helping users to judge depth perception of their instruments in the virtual environment.
The first pass of each frame includes rendering shadows for each directional light into
shadow maps, which we place into a texture array. Rendering shadow maps into texture
arrays is a common approach [Pettineo 2015] as it allows for the binding of many shad-
ows maps during a single draw, which is necessary for forward rendering architectures
that evaluate multiple lights in one draw call. We save on material permutations by
66 4. Rendering Surgery Simulation with Vulkan

reusing the same shadow pipelines for each shadow pass (since the render passes are
compatible). We pass the index of the shadow current shadow map to the shadow
shader using a push constant that accesses an array of light inverse matrices.

Depth Prepass and Ambient Occlusion


Forward rendering is inefficient for scenes with significant overdraw. In order to mini-
mize this, we implemented a depth-prepass as the next pass to minimize the lighting
computation. We follow the depth-prepass with a horizon-based ambient occlusion
(HBAO) pass [Bavoil et al. 2008]. By calculating ambient occlusion (AO) before the
lighting passes, the AO becomes available for use in the lighting pass. We calculate the
HBAO at quarter resolution by first downscaling using a min-depth operator, and then
we upscale the AO using a Gaussian-based bilateral depth-aware filter, as opposed to
just a 2-pass Gaussian filter, to prevent AO bleeding.

Physically-Based Rendering (PBR) Lighting


The lighting passes for both opaque geometry and decals follow. During this pass,
lighting is calculated for both specular and diffuse and written into separate accumula-
tion HDR 16-bit buffers (Table 4.1), similar to what was done in CryEngine [Sousa et
al. 2012]. In the case of the opaque geometry pass, we populate another buffer with 4
8-bit color channels for world-space normals and a subsurface scattering constant. The
light types include both classical computer graphics light sources such as directional
and point lights, and global illumination approximation, which in our case, is image-
based lighting (IBL) with split-sum approximation [Karis 2013]. Because we already
calculated the AO, the IBL can be used with both decals and the underlying opaque
geometry. Baked AO textures for rigid objects are read to supplement the screen-space
AO, and the maximum of these two AO values is taken to determine the total AO. For
the lighting pass, we support a roughness/metalness workflow for PBR content.

Buffer Type R G B A
Diffuse Accumulation 16-bit per channel color
Specular Accumulation 16-bit per channel color
Normal/SSS x y z SSS strength
Depth 32-bit depth
Table 4.1. G-buffer layout.

Deferred Decals
Deferred decals are a low-cost yet flexible method for adding details to underlying
geometry. The decal pass differs from the opaque rendering in that it doesn’t write to
the normal or subsurface scattering (SSS) buffer but rather reads from it. One of the
benefits of using the underlying SSS buffer is that it allows the decal to blend in with
4.3 Render Pass Architecture 67

Figure 4.2. (left) A mesh without a decal, (right) a mesh with a decal affected by subsurface
scattering.

the surface underneath during the SSS render pass, so any marks on the skin for in-
stance will look more naturally integrated. Since the decals are deferred, they need a
normal, and we chose to use the underlying normal and the underlying SSS constant
(Figure 4.2). Many use cases of the decals such as bleeding or marking only make small
changes to the surface normal, so the underlying surface normal can provide an ac-
ceptable approximation.

Post Processing
Surgery simulations include organic geometry such as skin and organs, which requires
subsurface scattering (SSS) to display accurately (Figure 4.3). We chose to implement
a screen-space SSS as opposed to a texture-space implementation for several reasons.
First, screen-space avoids overdraw which becomes a problem when inside the body.
Although a depth prepass mitigates this, there’s still the need to perform extra texture
lookups as compared to screen space methods. Secondly, the diffusion profile is rela-
tively similar since most of the materials have blood and/or fat under the surface. Third,
it samples across different draw calls. This becomes particularly important when a
mesh is split into smaller meshes in order to avoid unnecessary physics computations.
For example, if a section of an organ undergoes operation, then it must be deformable,
but the surrounding organ can be rigid. With screen-space SSS, the SSS can sample
from across both meshes, creating a seamless rendering.
After the lighting pass, separable screen-space SSS [Jimenez et al. 2015] is applied
to only the diffuse buffer. We keep a pool of 3 HDR buffers (one for the specular render
target, one for the diffuse, and one free) in order to ping-pong the diffuse buffer during
the two passes. After the SSS, the specular buffer and diffuse buffer are composited
into the free buffer. These buffers are reused for later passes.
Bloom is then calculated in two passes at quarter resolution and then composited
with the previous result. A filmic tone mapping pass [Hable 2010] follows the bloom
68 4. Rendering Surgery Simulation with Vulkan

Figure 4.3. A rendering of a polyp with subsurface scattering.

pass to map down to a 32-bit sRGB buffer. In the early stages of implementing SSS,
we found that using a non-filmic tone mapper such as Reinhard [Reinhard et al. 2002]
hide the effect of the SSS by desaturating the effect of the diffusion profile.

4.3.2 Uber-Shader Approach and Material System


One of the challenges with forward rendering is the possible combinatorial explosion
of different shader options.
In Vulkan, all draw state is compiled into a single VkPipeline object. We build
materials for each object; each material contains a VkPipeline object and associated
descriptor sets. A single object may have multiple materials (and thus VkPipeline
objects) to account for multiple render passes. For example, a single object could have
a material for shadow map passes and another one for the lighting pass.
One of the challenges with forward rendering is the need for software developers
to maintain a large number of possible shader permutations. This becomes especially
problematic in our framework because we allow arbitrary combinations of textures to
be attached to a material as well as different draw modes such as for debug rendering.
We chose to use an Uber-shader approach to reduce maintenance; we use a large shader
that contains all possible shader code combinations and block out irrelevant parts dur-
ing the shader compilation process. This generally works well for our applications since
our applications are made to be photorealistic, so the shaders generally try to model
lighting physics rather closely. Separate shader code paths can be made available for
different lighting conditions such as debug rendering without introducing branching.
4.4 Handling Deformable Meshes 69

Shaders are compiled as a build step for compilation of the C++ code, so only
SPIR-V binary files are read into the engine at runtime. This creates some problems
with texture resource management, however, since each material can provide a different
number texture resources. We solve this problem by creating small placeholder tex-
tures, but the lookup of these textures are restricted by specialization constants. Spe-
cialization constants allow the driver’s shader compiler to optimize away shader code
that gets set during pipeline creation. Other expensive operations, such as PBR lighting
code, can be optimized away for situations that don’t require it. One drawback of using
specialization is that the pipeline objects can become incompatible for similar materi-
als. In this case, they cannot be shared across draw calls.

4.4 Handling Deformable Meshes


Deformable meshes, which we define as meshes that include per frame vertex and in-
dex updates, require data transfer to the GPU, mesh recomputation, and efficient
memory usage.

4.4.1 Efficient Data Transfer to the GPU


Deformable meshes require frequent and large-scale updates to the whole mesh, and
transferring this to the GPU can be complex due to needed synchronization and slow
if the data is large enough. Unlike OpenGL, Vulkan gives explicit control over the
location of data, but the correct locations (e.g., which meshes should be on the GPU)
are often difficult to assess for a given use case. Furthermore, different hardware ven-
dors even have different recommended memory types that aren’t universally available.
We specifically chose a solution that could work on a wider range of hardware and,
through experimentation, gave acceptable results. We tested on Nvidia’s GTX 1080
GPU. We chose to use a single queue, the same queue used for rendering, to do the
transfer operations required to update the meshes to avoid the need for inter-queue
synchronization. Although some graphics drivers expose transfer-only queues that
could potentially allow for higher bandwidth, this is not universally available across all
major vendors [Willems 2018].
With using a single queue, there are two effective ways to update data for render-
ing: 1) using a CPU-accessible staging buffer with a mirrored GPU buffer and running
transfer operations or 2) using a CPU-accessible buffer. In comparing these two ap-
proaches, we found that for large meshes (i.e., larger than 100k triangles), the differ-
ences in performance were negligible between the two methods. Even with multiple
render passes for the same mesh data that could cause redundant PCI-E bus usage, the
performance was similar. On the other hand, using system memory (CPU-accessible
buffer) allows us to avoid managing memory barriers to ensure the data is uploaded to
the GPU in time.
70 4. Rendering Surgery Simulation with Vulkan

Figure 4.4. An example of dynamic mesh updating using double-buffering.

Multi-Buffering
Our renderer includes a pool of command buffers for geometry passes, and each frame,
a command buffer is recycled and rewritten to. When we finish writing to a command
buffer, we submit it to the driver, and start recording another command buffer. The
command buffer will run asynchronously to our render loop. We needed to avoid read-
write hazards but didn’t want to stall the render loop, so we implemented multi-buff-
ering for vertex and index buffers (Figure 4.4). We buffer the data the same amount as
the number of back buffers presented in the swap chain; if the application renders with
triple-buffering, then the mesh data also uses triple-buffering. This makes tracking the
region to update simple as the remainder of the frame number can just be passed to the
update functions.
Our multi-buffering implementation is similar to unsynchronized multi-buffering
in OpenGL with persistent data mapping [Hrabcak and Masserann 2012], but we have
more control over the memory management and synchronization.

4.4.2 Normal Calculations per Frame with Smoothing Groups


Because the deformable meshes can have both topology changes and individual vertex
displacements, the normals and tangents must be able to be recomputed each frame.
Additionally, many of the geometries in surgery simulation tend to have organic
shapes, leading to vertex seams along the edges of the UV maps. These seams cause
lighting discontinuities which make surfaces incorrectly appear to have hard edges.
To handle these problems, on file import, we create a mapping on vertices that
belong to the same smoothing group. With recomputation of the normals, these map-
pings are preserved. The final normal for each vertex in the group is calculated from
each neighboring triangle’s normal and each vertex that belongs to the group. The tan-
gents are calculated separately, however, because they are aligned to each vertex’s UV
coordinate, which differs for each vertex in the group.
4.5 Memory Management System 71

One problem with this approach is that the tangents can diverge from the normals
since tangents depend on the UV coordinates which are likely unique to each vertex,
whereas normals can be shared across vertices. This produced shading artifacts that
were highlighted by our BRDFs, but a simple fix was to orthogonalize the tangent-
bitangent-normal basis in the vertex shader through the Gram-Schmidt process.

4.5 Memory Management System


Unlike OpenGL, Vulkan gives users explicit control on where they can store data, but
the correct locations are often difficult to assess. In addition, memory backings for
resources such as images and buffers are not automatically allocated. In contrast, be-
hind the scenes, OpenGL drivers do additional work such as memory defragmentation
and suballocation. OpenGL abstracts the physically locations of all resources (such as
in system RAM or VRAM), although this hides differences between different GPUs
and drivers. While this requires less development from the application side, this can
lead to inconsistent performance across platforms. In order to fill this gap lying in the
Vulkan API, we implemented our own memory allocator.

4.5.1 Custom Memory Allocator


The performance implications of different allocation strategies are not always obvious
and can differ depending on the vendor. Furthermore, certain allocations strategies
can be tailored to specific applications. For instance, in surgery simulation, it’s uncom-
mon for new geometry and resources to be added during the simulation. It is generally
known beforehand what resources are needed for a specific application. This allows us
to avoid implementing performance-sensitive features such as memory defragmenta-
tion.
Our memory manager separates different resources into different memory alloca-
tions (Figure 4.5). For instance, uniform buffers occupy a different memory allocation
than textures. The advantage of this approach is that the certain resource types need to
be in certain areas of memory for optimal performance. For instance, staging buffers
need to be in host visible memory, whereas images (e.g., textures) need to stay on the
GPU, so they reside in device-local memory.
Certain resources such as images, uniform buffers, and storage buffers have align-
ment requirements. This allows mesh data to be more compact, which is useful since
mesh data might need to be transferred each frame for dynamic objects.
With the exception of uniform buffers, a single VkBuffer occupies the whole un-
derlying memory allocation. Internally, our memory manager uses lightweight abstract
buffer objects that point to regions within the VkBuffer object. With uniform buffers
and VkImages, however, multiple uniform buffers and images can fill a single alloca-
tion. Uniform buffers are treated this way since the minimum uniform buffer size guar-
anteed by the specification is 64 KiB.
72 4. Rendering Surgery Simulation with Vulkan

Figure 4.5. An overview of the memory manager.

The initial allocation size we chose was 16 MiB, and this allocation is used until
it runs out of space and then a new allocation is made. For images and buffers larger
than the allocation size, the allocation is expanded to account for these, so very large
resources can potentially have their own allocation.

Mesh Data
Mesh data is handled differently from the other resource types. Deformable meshes
reside in host visible memory, whereas other meshes (e.g., rigid objects) are in device
local memory through staging. Other game engines such as Source 2 have followed a
similar approach for static/dynamic resources [McDonald 2016].
Deformable meshes also take up more space than rigid meshes need as they are
multi-buffered. When the mesh is initially allocated, this extra buffering space is also
allocated in the same location. For host visible memory, this works out well since the
transfer to the GPU is implicitly done, avoided overhead in calling transfer functions.
In some operations, such as mesh cutting operations, additional triangles or verti-
ces can be added or removed to the mesh. Adding new vertices or triangles would re-
quire more buffer space. Because operations such as cutting are a high-frequency op-
eration, we needed a way to expand geometry all of the time without allocating new
memory and deallocating old memory, which could be costly and cause memory frag-
mentation. To solve this problem, we allow users to specify a load factor that sets a
maximum size of the geometry relative to the original mesh size. We allocate addi-
tional space within each buffer subsection for each frame (Figure 4.6).
4.6 Performance and results 73

Figure 4.6. A buffer layout allowing for per-frame updates.

4.6 Performance and results


To analyze the performance of our memory management system within our ren-
derer, we devised a benchmark that tests the role of different default memory allocation
sizes. A size of zero MB represents a naive approach of creating one memory allocation
per vertex and index buffer. We found two areas that showed improvement based on
our test scenario: application load time (Figure 4.7) and frame time (Figure 4.8). We
tested on a Windows computer with an Intel i7 6850k CPU and an Nvidia GTX 1080.
There are optimal allocation sizes that meet both of these metrics. For example,
larger allocations (such as 256 MiB) have slightly longer load times while they have
comparable performance to smaller allocations (4 MiB). Different default allocation
sizes have been proposed by vendors such as 8MB for mesh data and 128 MB for
textures [McDonald 2016] or 256 MiB [Sawicki 2018]. This depends on the hardware
and application resources sizes to some extent, but there should be a fairly large range
of acceptable default allocation sizes, as demonstrated by our data. Unless very large
resources are used (e.g. 10s of MB per resource), there’s a point of diminishing returns
for runtime performance with larger buffer sizes. However, larger allocation sizes
slightly increase load time and will waste more memory in cases when they aren’t
saturated.
For our tests, we compared rigid mesh data with deformable meshes. Rigid meshes
require two buffers, a staging (CPU) and a device local (GPU) buffer. Meanwhile, the
deformable meshes require larger buffers to account for multi-buffering, but they re-
main on the CPU, resulting in much lower the allocation times, particularly for more
allocations. We experienced a larger decrease in load time performance when using the
naive allocation strategy for rigid meshes as compared to deformable meshes, which
indicates that making many GPU allocations can be much slower than CPU allocations.
For the runtimes, the dynamic meshes require rewriting of the data and transfer to the
GPU, which slows down performance compared to rigid meshes.
74 4. Rendering Surgery Simulation with Vulkan

Figure 4.7. 10,000 meshes, each made up of 100 lines.

Figure 4.8. 10,000 meshes, each made up of 100 lines.


4.7 Case Study: CCT 75

4.7 Case Study: CCT


Surgeons perform the cricothyroidotomy (CCT) procedure as an emergency procedure
when patients have a restricted airway. The steps involved in the procedure are as
follows:

 Palpating the neck region to identify the locations of the thyroid and cricoid
cartilages which are the landmarks anatomies in this procedure.

 Making an incision along the midline of the neck through the skin and the fat
tissue to uncover the cricothyroid membrane underneath.

 Making an incision along the membrane to open an entrance to the trachea.

 Inserting an endotracheal tube inside the trachea through the new incision.

There are two main problems with the CCT simulation from a rendering perspective:
efficiently updating the large geometric models representing the fat and membrane tis-
sues and rendering the surface of each cut.

4.7.1 Dynamic Mesh Update


The CCT procedure requires at least two incisions to allow for intubation. This is ac-
complished on the CPU side through a mesh cutting algorithm which causes the mesh
to regenerate in order to incorporate topological changes. In addition, the mesh is de-
formed during each physics step, which causes the vertex positions to be displaced and
the normals and tangents to be recalculated. All of this data must be reuploaded to the
GPU each frame.
Initially, we had difficulty with transfer speeds for this use case as we used staging
and GPU buffers to handle all transfers. We wrote to a host visible buffer and performed
a transfer operation, and we operated on a single queue. We quickly found this actually
substantially sped up performance, so we switched to using host-visible memory.

4.7.2 Surface Cut Rendering


The rendering of the cut mesh introduced a few challenges with our current framework.
First, the outer surface, the skin for example, must be rendered with a different material
than the inner surface, which in our case is the fat tissue. On top of this, the UV
coordinates are generated during runtime as the new surface mesh is recreated.
We wanted to reduce the number of surface meshes to be updated each frame, but
our renderer doesn’t currently support multiple materials per mesh. Both the skin and
the fat tissue use the same shading logic, but they differ in texture sets. In order to
circumvent our single material per mesh limitation, we project the sides of the mesh
onto different regions of a texture atlas.
76 4. Rendering Surgery Simulation with Vulkan

Figure 4.9. The CCT case study demonstrating cutting.

When performing the cut, we needed a visual cue (e.g., bleeding) for the progress
of the cut. We opted to use a pool of blood decals to display the cutting path while it
was being performed by the user. The decals automatically recycle after the pool hits
a certain maximum number so they can be reused for multiple cuts.

4.8 Conclusion and Future Work


Rendering for surgery simulation is critical for creating a realistic immersive experi-
ence for training surgeons. The Vulkan API provided more flexibility, but also some
challenges with determining the optimal implementation strategies given our use cases.
We were able to create a rendering architecture that reduced the amount of shader
maintenance needed in the future and would give us more accurate visualization. Our
handling of deformable meshes allows us to efficiently render the output of various
CPU-based algorithms. Finally, our memory management system allows us to scale our
applications without worrying about introducing substantial overhead, and it can easily
be extended support new types of resources.
In the future, we hope to expand these subsystems to further improve performance
and rendering capabilities. One area we would like to explore is using asynchronous
buffer transfers with multiple queues for mesh updates, as this could potentially de-
crease the transfer times by using more bandwidth. Another area that could see perfor-
mance improvements would be the normal/tangents recalculations, possibly through
compute shaders as this takes a considerable amount of time each frame depending on
the topology and number of triangles of the mesh. Finally, we would like to expand our
material system to allow for more complex and expressive materials.
4.9 Source Code 77

Figure 4.10. A rendering of an internal organ using our renderer.

4.9 Source Code


The source code is available as part of iMSTK: https://www.imstk.org.

4.10 Acknowledgments
Research reported in this publication was supported by the National Institute of Biomedical
Imaging and Bioengineering (NIBIB) of the National Institutes of Health (NIH) under Award
Number 2R01EB005807, 5R01EB010037, 1R01EB009362, 1R01EB014305; National Heart,
Lung, and Blood Institute (NHLBI) of NIH under Award Number 5R01HL119248; National
Cancer Institute (NCI) of NIH under Award Number 1R01CA197491 and NIH under Award
Number R44OD018334.

Bibliography
BAVOIL, L., SAINZ, M., AND DIMITROV, R. 2008. Image-space horizon-based ambient occlusion.
In ACM SIGGRAPH 2008 talks, p. 22.
78 4. Rendering Surgery Simulation with Vulkan

GARAWANY, R. 2016. Deferred Lighting in Uncharted 4. In Advances in Real-Time Rendering


in Games: Part I.
HABLE, J. 2010. Uncharted 2: HDR lighting. In Game Developers Conference 2010.
HALIC, T., VENKATA, S., SANKARANARAYANAN, G., LU, Z., AHN, W., AND DE, S. 2011. A soft-
ware framework for multimodal interactive simulations (SoFMIS). In MMVR, pp. 213–217.
HRABCAK, L. AND MASSERANN, A. 2012. Asynchronous Buffer Transfer. CRC Press.
IMSTK. 2018. URL: http://www.imstk.org/.
JIMENEZ, J., ZSOLNAI, K., JARABO, A., FREUDE, C., AUZINGER, T., WU, X., PAHLEN, J., WIMMER,
M., AND GUTIERREZ, D. 2015. Separable Subsurface Scattering. In Computer Graphics Forum,
34:6, pp. 188–197.
KARIS, B. 2013. Real shading in Unreal Engine 4. In Proc. Physically Based Shading Theory
Practice, pp. 621–635.
MCDONALD, J. 2016. High Performance Vulkan: Lessons Learned from Source 2. In GPU
Technology Conference 2016.
OPENSURGSIM. 2017. URL: https://www.sofa-framework.org/.
PETTINEO, M. 2015. Rendering the alternate history in The Order 1886. In Advances in Real-
Time Rendering in Games: Part II.
REINHARD, E., STARK, M., SHIRLEY, P., AND FERWERDA, J. 2002. Photographic tone reproduc-
tion for digital images. In ACM transactions on graphics, 21:3, pp. 267–276.
SAWICKI, A. 2018. Memory Management in Vulkan DX12. In Game Developers Conference
2018.
SOFA. 2018. URL: https://www.sofa-framework.org/.
SOUSA, T., KASYAN, N., AND SCHULZ, N. 2012. CryENGINE 3: Three Years of Work in Review.
CRC Press.
THE KHRONOS GROUP. 2018. Vulkan 1.1.81 – A Specification.
WILLEMS, S. 2018. GPUInfo. URL: https://www.gpuinfo.org/.
5
I

Skinned Decals
Hawar Doghramachi

5.1 Introduction
Decals that are dynamically added to a scene are a great way to increase interactivity
by allowing the player to change the game environment. An efficient way to do this for
static environments is the use of deferred decals [Krassnigg 2010], [Persson 2011].
However, in general deferred decals fail to deliver convincing results on top of skinned
meshes. A typical use-case is shooting at a character and creating wounds at the impact
position of the projectiles. We will present a technique for adding dynamically decals
on top of arbitrary, skinned meshes that is compatible to common rendering architec-
tures and easy to integrate into existing rendering engines.

5.2 Overview
Dynamically added deferred decals work pretty well as long as the target area is rigid,
i.e., only influenced by one bone. In this case it is possible to intersect a ray that repre-
sents the projection direction of the decal with the skinned mesh and record the triangle
and barycentric coordinates of the intersection. With this information, each frame the
position and normal of the intersection can be updated and a deferred decal applied
accordingly. However, this method fails to deliver convincing results when the target
area is influenced by several bones. Typical artifacts in such cases are decals that are
“swimming” on top of the target mesh, not stretching according to the underlying sur-
face, producing distracting overlaps and projecting on top of areas that were initially
not in the influence area of the decals.
A method that doesn't suffer from the aforementioned problems was mentioned
by [Bronx 2011]. The idea is to render the skinned mesh that receives a decal once
when a new decal is created. Instead of rendering the mesh itself into a render target,
it uses the mesh texture coordinates as vertex shader output position. In a pixel shader
the decal texture coordinates are calculated, similarly to deferred decals, and the cor-
responding texels are fetched from the decal texture and output into the mesh texture.

79
80 5. Skinned Decals

This implies that each mesh instance has its own texture that serves as render target
resource. Finally when the skinned mesh is rendered in the shading pass, the updated
mesh texture is fetched at the mesh texture coordinates and the resulting decals are
automatically applied. Unfortunately there are several drawbacks which make this tech-
nique impractical for game production:

 Usually modern games use several textures for shading purposes. In general you
can expect to have a diffuse, normal, specular and roughness map and in case of
parallax occlusion mapping and displacement mapping additionally a height
map. In this way several texture maps would need to be duplicated per mesh
instance which can have a large memory impact. Since compressing textures at
runtime is most likely too expensive, textures would need to be stored uncom-
pressed, in this way increasing the memory requirements and the memory band-
width when reading from these textures.

 To avoid severe aliasing issues, mipmapping is required. However, generating


mipmaps for all required mesh textures, each time a new decal is added, can
have a significant performance impact.

 Once a decal is added, it can't be selectively removed from the mesh textures.

The proposed technique is based on the same base idea as the aforementioned
method, but overcomes several of its drawbacks. The idea is to output decal texture
coordinates instead of decal texture values when a new decal is created and the corre-
sponding skinned mesh is rendered. In this way neither does it matter how many texture
maps are utilized (diffuse, normal, specular, etc.) nor do we have to bother with gener-
ating mipmaps for all of these texture maps. Each mesh instance requires only one
additional texture, called hereafter decal lookup map, which holds the decal texture
coordinates, a fading value and an index for each decal added on top of the skinned
mesh. This decal lookup map is used during shading to fetch and apply the decal tex-
tures. In this way decals correctly stretch according to the underlying surface, don't
overlap under motion and don't project on top of mesh areas that were initially not in
the influence area of the decals.

5.3 Implementation
The presented system can be divided into three steps (Figure 5.1) that will be described
in more detail in the following subsections. All explanations are assuming: column-
major matrix layout, right-handed coordinate system, NDC depth-range from 0.0 to 1.0
and left top corner as texture / screen space origin. Furthermore the explanations
assume the use of DirectX 12, but the system could be also implemented with e.g.,
OpenGL or Vulkan.
5.3 Implementation 81

Figure 5.1. Overview of the three steps involved in the proposed technique. While the third
step is executed each frame, the first two steps are only excuted when a new decal is added. The
skinned model used in this image was authored by KatsBits [2014].

5.3.1 Decal setup


This step doesn't differ from the setup that is required for deferred decals. In this step
we have to construct a matrix that defines the position, orientation and size of a decal,
called hereafter decal matrix. In the example of creating a decal at the impact position
of a weapon projectile, we first need to find the closest intersection on the path of the
projectile. For this we intersect the corresponding ray with the front-facing triangles of
the target skinned mesh. This intersection test has to be performed either on CPU or
GPU, depending on where skinning is done and where we have access to the skinned
vertices. On CPU this is straight forward; after getting from the physics system the
mesh of closest intersection, the corresponding skinned triangles are checked for inter-
section. In case of GPU skinning, the skinned triangles are tested in parallel on com-
pute shader threads and the resulting decal matrix is constructed directly on the GPU
and passed via a GPU buffer to the subsequent step. This could be a good use case for
DirectX Ray Tracing where the intersection part could be accelerated.
From the position and normal of the closest intersection we create a view matrix
and from the extents of the decal an orthographic projection matrix. These matrices
82 5. Skinned Decals

are combined with a texture matrix, that maps texture coordinates from 1,1 range
into  0,1 range, to the final decal matrix. Listing 5.1 shows pseudo code how to calcu-
late the decal matrix.

5.3.2 Update decal lookup map


Each time a new decal is added, the underlying skinned mesh is rendered into the decal
lookup map using the mesh texture coordinates as output position. The decal texture
coordinates are calculated in the same way as for deferred decals by multiplying the
decal matrix with the world space position of the mesh vertices. By setting the output
depth to the z component of the decal texture coordinates, and using custom clip planes
for the x and y components, pixels outside of the decal will be clipped away. Listing 5.2
shows the HLSL code for the corresponding vertex shader.
Since each texel of the decal lookup map can only store information for one decal,
overlapping decals are not supported. Therefore it has to be ensured that each decal
texel in the decal lookup map doesn't get overridden. For this the pixel shader of this
step is outputting its result via an unordered access view (UAV) into the decal lookup
map that makes it possible to examine the target texel before overwriting it. When the

// decalTangent is a vector perpendicular to decal projection


// direction and specifies the rotation of a decal on target surface.

Vector3 viewPosition = hitNormal * decalDepth * 0.5 + hitPosition;


Vector3 bitangent = Normalize(CrossProduct(decalTangent, -hitNormal));
Vector3 tangent = Normalize(CrossProduct(-hitNormal, bitangent));

Matrix4x4 viewMatrix = Inverse(Matrix4x4((tangent, 0,


bitangent, 0,
hitNormal, 0,
viewPosition, 1));

Matrix4x4 projMatrix(2/decalWith, 0, 0, 0,
0, 2/decalHeight, 0, 0,
0, 0, -1/decalDepth, 0,
0, 0, 0, 1);

Matrix4x4 textureMatrix(0.5, 0, 0, 0,
0, -0.5, 0, 0,
0, 0, 0.5, 0,
0.5, 0.5, 0.5, 1);

Matrix4x4 decalMatrix = textureMatrix * projMatrix * viewMatrix;

Listing 5.1. Pseudo code for calculating decal matrix.


5.3 Implementation 83

VS_Output main(VS_Input input)


{
VS_Output output;

// transform position from world to decal space


float3 decalTC = mul(constBuffer.decalMatrix,
float4(input.position, 1.0f)).xyz;
output.decalTC = decalTC.xy;

// Write rasterized fragments to output texture using the input


// mesh texture coordinates and use z component of decal UV to
// clip decal in z direction.
output.position = float4(float2(input.texCoords.x, 1.0f
- input.texCoords.y) * 2.0f - 1.0f, decalTC.z, 1.0f);

// Clip decal in x and y directions (clipDistances declared


// with SV_ClipDistance semantic).
output.clipDistances = float4(decalTC.x, 1.0f - decalTC.x,
decalTC.y, 1.0f - decalTC.y);

output.normal = input.normal;
return output;
}

Listing 5.2. Vertex shader for updating decal lookup map.

target texel already contains information for another valid decal, the newly added decal
texel will be discarded. To check that the target texel contains already information for
a valid decal and not for a decal that is considered to cause an overlap with existing
decals, a small GPU buffer is used, called hereafter decal validity buffer. Each entry
of this buffer corresponds to one decal on the target mesh and is initialized to 0, i.e.,
all decals are valid at the beginning. In case an overlapping decal texel is detected, the
corresponding entry in this buffer is set to 1, so that on one hand this decal won't pre-
vent more recent decals to be added into the decal lookup map and on other hand such
decals won't be rendered in the shading pass. Since we store the index of each decal at
8 bits into a decal lookup map texel, this buffer needs only 256 entries. The decal index
is combined with the decal texture coordinates and a fading value and stored as
DXGI_FORMAT_R8G8B8A8_UNORM. If typed UAV reads of this texture format are not
supported, we can pack all values into a bitmask, using a 32 bit integer and perform
manual bilinear filtering later on. The corresponding pixel shader is shown in List-
ing 5.3. Since for the proposed system the texture coordinates of the target mesh have
to be unique, i.e., each triangle of the mesh has to map to a different texture area, we
never run into race conditions where different pixel shader threads write into the same
texel location.
84 5. Skinned Decals

void main(VS_Output input)


{
int2 outputPos = int2(input.position.xy);
uint prevDecalIndex = uint(decalLookupMap.Load(outputPos).w
* 255.0f + 0.0001f);

// If target texel already contains decal information (index > 0)


// that comes from a different, valid decal, current decal is
// marked in decal validity buffer as invalid and the
// corresponding texel is discarded.
if ((prevDecalIndex > 0) && (prevDecalIndex !=
constBuffer.decalIndex) &&
(decalValidityBuffer[prevDecalIndex] == 0))
{
decalValidityBuffer[constBuffer.decalIndex] = 1;
return;
}

// Fade out fragments which have an angle greater than 60 degrees


// to surface normal to avoid projection artifacts.
float cosAngle = dot(constBuffer.normal, normalize(input.normal));
float angleFade = saturate(cosAngle - 0.5f) * 2.0f;

// Fade out fragments towards near/far clip plane to avoid


// clipping artifacts.
float distFade = abs(0.5f - input.position.z) * 2.0f;
distFade = 1.0f - pow(distFade, 4.0f);

// Combine distance- and angle-based fading values.


float fade = distFade * angleFade;

// output into DXGI_FORMAT_R8G8B8A8_UNORM


decalLookupMap[outputPos] = float4(input.decalTC.xy, fade,
float(customUB.decalIndex) / 255.0f);
}

Listing 5.3. Pixel shader for updating decal lookup map.

5.3.3 Apply skinned decals


With the help of the decal lookup map and decal validity buffer, skinned decals can be
applied in the case of forward rendering during regular shading or in the case of de-
ferred rendering when writing into the G-Buffers. The pixel shader code for this is
shown in Listing 5.4.
5.3 Implementation 85

// fetch decal texture coordinates and fading value using


// bilinear texture filtering
float3 decalTCFade = decalLookupMap.
SampleLevel(bilinearSampler, meshTC, 0.0f).xyz;

// fetch decal indices for corresponding 2x2 pixel area


float4 decalIndices = decalLookupMap.
GatherAlpha(bilinearSampler, meshTC);

// calculate decal UV derivatives


float2 derivX = ddx(decalTCFade.xy);
float2 derivY = ddy(decalTCFade.xy);

// determine if valid decal info can be retrieved for


// processed fragment
uint decalIndex = uint((decalIndices[0] * 255.0f) + 0.0001f);
float valid = ((decalIndex > 0)
&& (decalIndices[0] == decalIndices[1])
&& (decalIndices[0] == decalIndices[2])
&& (decalIndices[0] == decalIndices[3])
&& (decalValidityBuffer[decalIndex] == 0)) ? 1.0f : 0.0f;

// only add decal when all fragments of processed quad have valid
// decal info to ensure correct UV derivatives
[branch]
if ((valid == 1.0f) && (ddx(valid) == 0.0f) && (ddy(valid) == 0.0f))
{
// fetch decal diffuse texture
uint diffuseIndex = constBuffer.decals[decalIndex].diffuseIndex;
float4 decalDiffuse =
decalDiffuseMaps[NonUniformResourceIndex(diffuseIndex)].
SampleGrad(trilinearSampler, decalTCFade.xy, derivX, derivY);

float decalAlphaMask = decalDiffuse.a;


float decalFade = baseAlphaMask * decalAlphaMask * decalTCFade.z;

// blend decal diffuse with base diffuse of mesh


baseDiffuse = lerp(baseDiffuse, decalDiffuse, decalFade);

// fetch and blend normal, specular, roughness, etc.,


// textures accordingly
}

Listing 5.4. Pixel shader code for applying skinned decals.


86 5. Skinned Decals

To be able to select the correct mipmap from the decal textures, derivatives of the
decal texture coordinates are calculated. This has to be done outside of the branch
where valid decals are applied, since derivatives require valid values across the entire
2  2 processed pixel quad. A decal texel is only considered as valid and added on top
of the target mesh, when all decal indices in a 2  2 pixel area of the decal lookup map
are valid and equal. To ensure that decal texture coordinate derivatives for trilinear
filtering are valid as well, we need to check the validity of the entire 2  2 processed
pixel quad. For this the derivatives of the validity value are used.
The decal index is used to index into a common GPU buffer and retrieve decal
specific information. In the case of DirectX 12 this information can contain indices
into a common DirectX 12 shader-resource descriptor table to select the corresponding
decal textures. For platforms, where textures can't be dynamically indexed, either a
texture array or a texture atlas can be used instead.
Since at the border of decals we can't perform bilinear texture filtering, artifacts in
form of jaggies will be visible. Such artifacts can be avoided by adding a few texels
wide border to the alpha mask texture of a decal that contains an alpha value of zero.
Furthermore not all areas on a skinned mesh are a good fit for a decal, especially areas
that contain a UV seam or are extremely deformed under skinning, as for example the
area around a character elbow. Decals on top of such mesh areas can be easily faded
out by an alpha mask texture that is stored alongside the diffuse, normal, etc. mesh
textures. Mesh areas that don’t have a unique texture coordinate mapping have to be
masked out as well.

5.4 Pros and Cons


The following lists give an overview of the pros and cons of the proposed system.

Pros
 Supports both, forward and deferred rendering systems.
 Supports static, dynamic, skinned and transparent meshes.
 Additional geometry has to be only rendered once when adding a new decal.
 Each mesh instance requires only one additional decal lookup map texture, re-
gardless of how many textures the mesh uses.
 Textures don’t need to be decompressed.
 Texture mipmaps don't need to be generated at runtime.
 Artifacts from wrong mipmap selection at geometry edges, as often visible with
deferred decals, are avoided.
 Supports arbitrary blending of all decal attributes with underlying mesh.
 Decals can be also used to perform displacement mapping or to cut out holes.
 Decals can be selectively removed.
 Decal texture coordinates can be animated.
5.5 Results 87

Cons
 One decal lookup map and decal validity buffer required per mesh instance.
 Overlapping decals not supported. However, automatically ensuring that decals
don’t overlap can be also considered as desirable to avoid decals stacking up on
each other and causing a performance impact.
 Texture coordinates of the target mesh need to be unique, i.e., each triangle of
the mesh has to map to a different texture area.

5.5 Results
In Figure 5.2, a decal was dynamically added on top of an animated, skinned mesh and
rendered once as a deferred decal and once with the proposed technique.

Figure 5.2. The images show a decal dynamically added on top of an animated, skinned model,
authored by KatsBits [2014]. The images on the left side show the animation frame where the
decal was added, the images on the right side a pose, couple of animation frames later. The
images on the top were rendered with the proposed technique, the images on the bottom with
the deferred decal technique.
88 5. Skinned Decals

One can observe that while the deferred decal “swims” on the top of the underlying
surface, not stretching accordingly, the skinned decal stretches and follows the
underlying surface correctly. Thus, in contrast to the skinned decal, the deferred decal
breaks the illusion of dynamically changing the target surface. It is also visible, that
the skinned decal has the same rendering quality as the deferred decal.

5.6 Conclusion
We presented a system to efficiently add high quality, dynamic decals on top of skinned
meshes that correctly follow the underlying mesh surface. It supports common render-
ing systems and can be easily integrated into existing rendering engines.

Bibliography
BRONX. 2011. Deferred decals. Blog post, URL: http://broniac.blogspot.ca/2011/06/deferred-
decals.html.
KATSBITS. 2014. URL: http://www.katsbits.com/download/models/md5-example.php.
KRASSNIGG, J. 2010. A Deferred Decal Rendering Technique. In Game Engine Gems 1, Jones
and Bartlett, pp. 271–280.
PERSSON, E. 2011. Volume Decals. In GPU Pro 2, A K Peters, pp. 115–120.
II
Environmental
Effects

One of the coolest features of games are nowadays effects that are typically observed
in the environment. In general, they stand out as a visually impressive simulations of
real-world occurrences.
This edition of GPU Zen has two articles in the Special Effects section: the first
article “Real-Time Fluid Simulation in Shadow of the Tomb Raider” by Peter Sikachev,
Martin Palko and Alexandre Chekroun describes how the latest Tomb Raider install-
ment simulates and renders real-time fluids. The contributions of this article to the field
of real-time fluid simulation are:

 Fluid simulation solution for vast areas using scrollable textures.

 Easy and effective method for factoring in viscosity.

 Frame rate-independent inflow algorithm.

 PIC-based (particle-in-cell) numerically stable approach for algae simulation.

 Practical implementation, running on a 256  256 grid within 0.5 ms on Xbox


One.

The second article “Real-time Snow Deformation in Horizon Zero Dawn: The Frozen
Wilds” by Kevin Örtegren in this section is the technique shown on the cover of this
book. This technique models’ interactions of dynamic characters and objects with the
environment while being persistent and scalable. The development requirements have
been:

 We needed real-time snow deformation for any dynamic object.

 It had to work in a massive open world and on top of certain static objects, like
rocks and roof tops.

 It had to run very fast on the GPU, since the GPU frame was already laid out
and optimized for the base game.

 No major asset re-factor could be done to avoid adding to the expansion down-
load size and because we could not spare artists to go through all the content
manually and fine tune for a new system.

—Wolfgang Engel

89
1
II

Real-Time Fluid Simulation in


Shadow of the Tomb Raider
Peter Sikachev, Martin Palko,
and Alexandre Chekroun

1.1 Introduction
Fluid simulation has been thoroughly investigated topic by academics and the VFX
industry. However in practice, fluid simulation has been implemented in very few major
AAA game titles.
In this chapter, we present our real-time fluid simulation implementation in the
Shadow of the Tomb Raider game. We simulate a dynamic interaction between charac-
ters and fluid substance: oil spots on water, floating algae, etc.
In particular, our contributions are:

 Fluid simulation solution for vast areas using scrollable textures (Section 1.3.3).

 Easy and effective method for factoring in viscosity (Section 1.3.6).

 Frame rate-independent inflow algorithm (Section 1.3.4).

 PIC-based (particle-in-cell) numerically stable approach for algae simulation


(Section 1.3.8).

 Practical implementation, running on a 256 256 grid within 0.5 ms on Xbox


One.

1.2 Related Work


As mentioned before, real-time fluid simulation has been really widely researched out-
side of the game development. A good primer on fluid simulation can be found in
Bridson et al. [2006] and Bridson and Müller-Fischer [2007]. A very detailed beginner-

91
92 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

friendly explanation for mathematics behind fluid equations is given by Grinspun


[2018].
The first GPU-accelerated implementation for fluid dynamics was done by Harris
[2006]. Our implementation is, by large, based off his work.
A subsequent chapter in GPU Gems 3 [Crane et al. 2008] generalizes the simula-
tion for 3D case and proposes handling for moving objects. An implementation for
compute shader-based implementation of fluid simulation and volumetric rendering
for fire with full source code can be found in Vlietinck [2009].
The key limitation of the abovementioned approaches is that they run fluid simu-
lation on very tiny volumes, making it impractical for a real use case. In this chapter,
we propose a method that capitalizes on the results, achieved by the said approaches,
overcomes this limitation, and is practically feasible for a game on current-gen (Xbox
One, PlayStation 4) game console hardware.

1.3 Simulation
In this section we describe how fluid simulation works in the Shadow of the Tomb
Raider engine. First, we give a brief primer on the fluid simulation theory in Sec-
tion 1.3.1. After that, we show the simulation data flow on a block scheme and discuss
each step in Section 1.3.2. We discuss particular features details in the subsequent
sections.

1.3.1 Fluid Simulation 101


Fluid simulation is typically done in two distinctive ways. In the Lagrangian approach,
the individual fluid particles and their properties are tracked. On the other hand, the
Euclidian approach treats fluid as a scalar/vector field and tracks field values on a grid.
As in Harris [2006], we use the Euclidian approach for the most part of the simu-
lation. However, for algae, we use a mixed Euclidian/Lagrangian approach, as will be
described in Section 1.3.8.
The cornerstone of fluid simulation is the famous Navier-Stokes equation:
u 1
 u u  p  g  νΔu
t ρ
 u  0
where u t is the velocity time derivative, u u is the advection term, 1 ρ  p is
the pressure term, g stands for the acceleration from external forces (such as obstacles
moving inside a fluid), and νΔu is the viscosity term.  u  0 is the incompressibility
condition (alternatively, one may say that a respective vector field is divergence-free).
We ignore the viscosity term, as this would simplify the calculations, and there is
a simpler (though not physically correct) way to take it into account. The Navier-Stokes
equation is solved numerically by decomposing it in the following main steps:
1.3 Simulation 93

 Advection and external forces application.

 Solving Poisson-pressure equation.

 Projection (by pressure gradient subtraction from divergent velocity).

Please refer to Harris [2006] or Bridson and Müller-Fischer [2007] for more detailed
description of these stages. In the rest of the article we will focus on the specifics of
our implementation.

1.3.2 Simulation Data Flow


As in Harris [2006], fluid simulation in Shadow of the Tomb Raider is implemented as
a sequence of image processing operations. We use compute shaders to perform these
operations.
Figure 1.1 shows the main stages of the method and the input and output data for
each of it. The density and velocity textures are persistent, i.e., they are statically allo-
cated and preserved between frames. All other textures are allocated on the frame heap
and are not kept. If the first frame is rendered for a surface with fluid simulation, the
density and velocity textures are cleaned. Figure 1.2 shows the end results alongside
with different simulation stages. In the subsequent subsections we examine those of
the simulation steps, which differ substantially from Crane et al. [2008], in detail.

1.3.3 Scrollable Grid


In order to be able to cover large areas with simulated fluid with high resolution, we
introduce a scrollable grid. In a nutshell, instead of simulating the whole area, we are
running simulation only for an area around the main character. On the edges of the
simulation area, the density map is cross-faded into a static texture map. Figure 1.3
shows a debug view where the simulation boundaries of the scrollable grid can be seen
very clearly.
The following code snippet shows how the shift of the grid is calculated on each
frame on CPU:

int2 newPos = int2(characterPos)


* fluidSimGridResolution / int(simulationAreaSize);
int2 deltaPos = oldPos - newPos;

where characterPos is the character world xy (if z is pointing up) world position,
fluidSimGridResolution is the resolution of the simulation grid, simulation-
AreaSize is the size of the simulated area, and the oldPos is the snapped position
calculated on the previous frame. The Listing 1.1 shows the grid shifting code executed
on the GPU. When the character moves, we are shifting the grid accordingly, so that
the character is always in the center of it. As velocity and density are the only persistent
94 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

textures, only they have to be shifted. The grid is snapped to the world in order to avoid
jittering when the character moves. Figure 1.4 shows the simulation window within a
bigger static mask map.

Figure 1.1. Fluid simulation data flow. Blue figures show algorithm steps. Green rectangles
show input texture data for each of the algorithm steps and orange ones show the output data.
1.3 Simulation 95

Figure 1.2. Oil simulation. On the top, the auxiliary render targets are shown, cropped and
remapped for better readability. Maps, from left to right: velocity, obstacle, obstacle velocity,
divergence, pressure, and pressure gradient.

Figure 1.3. Static map (white) and simulated area (red) debug view.
96 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

float2 GetScrolledCoords(float2 coords)


{
float2 tc = mul(float4(coords * g_vSimulationAreaSize
+ g_vPosition, 0, 0), g_mInverseTransform).xy;
tc = (tc.xy + g_vProxyHalfSize) * g_vInverseProxySize;
return tc;
}

RWTexture2D<float4> g_uavOutput : register(u0);


Texture2D<float4> g_texInput : register(t0);
#if EffectType == fxDensity
Texture2D<float4> g_texInflow : register(t1);
#endif

[numthreads(GROUP_SIZE, GROUP_SIZE, 1)]


void main(uint3 dtID : SV_DispatchThreadID)
{
int2 texCoord = int2(dtID.xy) + g_vUVShift.xy;
#if EffectType == fxDensity
if (any(texCoord < 0) ||
any(texCoord >= g_FluidSimGridRes))
{
float2 coords = (float2(dtID.xy) + 0.5f))
* g_InvFluidSimGridRes;
coords = GetScrolledCoords(coords);
g_uavOutput[dtID.xy] = g_texInflow.SampleLevel(
BilinearSampler, coords, 0.0f).rgrg;
}
else
#endif
{
g_uavOutput[dtID.xy] = g_texInput[texCoord];
}
}

Listing 1.1. Grid scrolling shader. g_vPosition stands for characters position relative to the
object with fluid simulation, g_v(Inverse)Proxy(Half)Size stands for the size of the object
in world units, and g_texInflow is a texture with static density map.

1.3.4 Density Inflow


If you take a volume with two differently colored fluids and stir it long enough, even-
tually, both fluids would mix up to a point when it would be one uniformly colored
fluid. At this point, no matter how you interact with a fluid, you would never see any
effect, because, there is only now, practically speaking, only one fluid.
1.3 Simulation 97

Figure 1.4. Static density map (white) and simulated density area (red).

Over time, the density, disturbed by fluid dynamics, should fade back into the
static texture. We call this process of fading back density to the default value density
inflow.
Listing 1.2 shows how the inflow is added into the density map. The following
constants are being introduced in it:

 g_FadeoutStart is the point (in  0,1 space on the simulation grid) at which
fadeout starts and g_FadeoutLength  1 1  g_FadeoutStart .

 g_DensityExponent defines how inflow depends on the density itself.

 g_DissipationFactor is basically the speed at which the density map fades


back to the static texture map multiplied by the time elapsed from the previous
frame.

There are few tricks that deserve to be explained further.


First, we use densityExponent in order to differentiate the speed at which high
and low density areas dissipate. This feature is used primarily for algae simulation: we
want to get the effect of algae ‘closing back’ after the player, so the higher the density
is, the lower is the dissipation speed.
Second, DensityExponent is made exponential in order to make the technique
frame rate-independent. As it has been mentioned in the Listing description, g_fDen-
sityExponent incorporates elapsed time. Since e t1 e t 2  e t1 t 2 , multiplying on the
98 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

RWTexture2D<float> g_uavOutput : register(u0);


Texture2D<float> g_texInput : register(t0);
Texture2D<float> g_texInflow : register(t1);

[numthreads(GROUP_SIZE, GROUP_SIZE, 1)]


void main(uint3 dtID : SV_DispatchThreadID)
{
float2 coords = (float2(dtID.xy) + 0.5f)) * g_InvFluidSimGridRes;

float proximityFactor = saturate(2.0f * length(coords - 0.5f));


proximityFactor = 1.0f - saturate(proximityFactor
- g_FadeoutStart) * g_FadeoutLength;

coords = GetScrolledCoords(coords);

float currentValue = g_texInput.Load(int3(dtID.xy, 0));


float fetchedValue = g_texInflow.SampleLevel(
BilinearSamplerClamp, coords, 0.0f);

float densityExponent = pow(max(currentValue, FLT_MIN),


g_DensityExponent);

float dissipationFactor = exp(-g_fDissipationFactor


* densityExponent);

float result = lerp(fetchedValue, currentValue,


dissipationFactor);

result = fetchedValue + sign(result - fetchedValue)


* min(abs(result - fetchedValue), proximityFactor);

g_uavOutput[dtID.xy] = result;
}

Listing 1.2. Density inflow shader.

exponential dissipation factor over one frame of 30 ms will be completely equivalent


to multiplying twice on the same exponential factor over two frames of 15 ms each, for
instance.
Last, proximityFactor is there to cross-fade density on the border. Initially, we
tried to handle it in the material shader, but that fails to work if one starts to walk
backwards: when the simulated part on the border is scrolled to the center, a clear
border between simulated and non-simulated areas is visible. What this code does,
essentially, is ‘shifting’ values towards the static map values.
1.3 Simulation 99

E.g., a value on the simulation area boundary is always equal to the static map
value. If fadeout starts from the very center of the simulation area (i.e.,
gFadeoutStart  0), then any value halfway (e.g., at  0, 0.25  or  0.75, 0  of the sim-
ulation grid) to the edge should be at least within 0.5 of the static map value. If it is
not, it is ‘shifted’ towards it. That allows avoiding discontinuities between simulated
and non-simulated areas with any pattern of character locomotion.

1.3.5 Obstacle Injection


Unlike Crane et al. [2008], we do not voxelize 3D meshes to get obstacle data, as this
would have been a highly impractical use of resources for a video game. Instead, we
reuse collision primitives that are already used for other purposes in the game engine.
In Shadow of the Tomb Raider, character collision is defined by an array of cap-
sules. We find intersections for all capsules with the fluid plane.
The game code allows us to query NPCs closest to the main character. We store
up to the maximum of 32 NPCs in the radius equal to the fluid object extents in a hash
table, where character’s ID serves as a key. Characters that were not returned as a result
of query from the previous frame get evicted from the hash.
Thus, we are able to track characters’ collision capsules locomotion from the pre-
vious frame and evaluate velocities. We then inject obstacles and obstacle velocities
just as in Harris [2006].

1.3.6 Advection
Advection is the process of transfer of fluid properties by a velocity field of a fluid
(including the velocity itself). Listing 1.3 shows the advection shader. Besides of ad-
vection itself, the advection shader performs two other operations.
First, it adds velocity inflow from the static map. Second, it adds viscosity by
simply dampening the velocity map according to the viscosity values.
Different viscosities are used for the two fluids, and the density map from the
previous frame is used to blend between them. Despite being not physically correct,
this approach works very well in practice.

1.3.7 Poisson Pressure Jacobi Solver


The Poisson pressure equation in fluid simulation is solved using a Jacobi iterative
method—an iterative method to solve systems of linear equations. One typically needs
a ballpark of 20 iterations in order to get plausible results, so that makes this part of
the algorithm a bottleneck. Pressure solver arithmetic is trivial, so the main offender is
the memory reads and writes.
While not much could be done about memory writes, the reads could be optimized
quite a bit. Listing 1.4 shows the optimized Jacobi solver.
100 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

RWTexture2D<float4> g_uavOutput : register(u0);


Texture2D<float4> g_texInput : register(t0);
Texture2D<float4> g_texVelocity : register(t1);
Texture2D<float4> g_texObstacle : register(t2);
Texture2D<float4> g_texInflow : register(t3);
Texture2D<float4> g_texDensity : register(t4);

[numthreads(GROUP_SIZE, GROUP_SIZE, 1)]


void main(uint3 dtID : SV_DispatchThreadID)
{
#if EffectType == fxVelocity
if (g_texObstacle.Load(int3(dtID.xy, 0)).x > OBSTACLE_THRESHOLD)
{
g_uavOutput[dtID.xy] = 0.0f;
}
else
#endif
{
float2 coords = (float2(dtID.xy) + 0.5f))
* g_InvFluidSimGridRes;

float2 inflowCoords = GetScrolledCoords(coords);

float fadeout = 1.0f - saturate(pow(dot(2 * coords - 1,


2 * coords - 1), g_FadeoutPower));

float2 pos = coords - g_DeltaTime


* (g_texVelocity.SampleLevel(BilinearSampler, coords,
0.0f).xy + fadeout * g_Factor * (g_texInflow.SampleLevel(
BilinearSampler, inflowCoords, 0.0f).xy - 0.5f));

g_uavOutput[dtID.xy] =
#if EffectType == fxVelocity
lerp(g_ViscosityWater, g_ViscosityOil, saturate(g_texDensity.
SampleLevel(BilinearSampler, coords, 0.0f).x)) *
#endif
g_texInput.SampleLevel(BilinearSampler, pos, 0.0f);
}
}

Listing 1.3. Advection shader. g_FadeoutPower is a fadeout exponent for the velocity inflow
map, g_DeltaTime is the elapsed time from the previous frame, and g_ViscosityOil and
g_ViscosityWater are oil and water viscosity coefficients, respectively.
1.3 Simulation 101

First, we use local data storage (LDS) memory to prefetch from the VRAM. Since
most (except for the boundary ones) fetches are shared between multiple threads, we
effectively limit the bandwidth per thread.
Second, we exploit Gather instructions for optimizing fetches. We found out that
this is highly beneficial even when not using LDS, and using two Gather instructions
to fetch just four texels. In out case, when LDS is used, we can group together fetches
corresponding to different threads and thus minimize the waste. Besides, we keep all
transient textures in ESRAM to improve bandwidth.

RWTexture2D<float> g_uavOutput : register(u0);


Texture2D<float4> g_texVelocityDivergence : register(t0);
Texture2D<float4> g_texPressure : register(t1);
Texture2D<float4> g_texObstacle : register(t2);

groupshared float pressureLDS[GROUP_SIZE + 2][GROUP_SIZE + 2];


groupshared float obstacleLDS[GROUP_SIZE + 2][GROUP_SIZE + 2];

[numthreads(GROUP_SIZE, GROUP_SIZE, 1)]


void main(uint3 dtID : SV_DispatchThreadID,
uint3 grtID : SV_GroupThreadID)
{
// Load data to LDS
if ((grtID.x < (GROUP_SIZE + 2) / 2) &&
(grtID.y < (GROUP_SIZE + 2) / 2))
{
float2 coordsNormalized = float2(dtID.xy) *
g_InvFluidSimGridRes;
coordsNormalized += float2(grtID.xy) * g_InvFluidSimGridRes;
float4 pressureSample = g_texPressure.Gather(
PointSampler, coordsNormalized);

float topLeftPressure = pressureSample.w;


float bottomLeftPressure = pressureSample.x;
float topRightPressure = pressureSample.z;
float bottomRightPressure = pressureSample.y;

pressureLDS[grtID.x*2][grtID.y*2] = topLeftPressure;
pressureLDS[grtID.x*2][grtID.y*2+1] = bottomLeftPressure;
pressureLDS[grtID.x*2+1][grtID.y*2] = topRightPressure;
pressureLDS[grtID.x*2+1][grtID.y*2+1] = bottomRightPressure;
/* Do the same for obstacle */
}

GroupMemoryBarrierWithGroupSync();
102 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

int3 coords = int3(dtID.xy, 0);


int2 ldsID = (int2)grtID.xy + int2(1, 1);

float left = pressureLDS[ldsID.x - 1][ldsID.y];


float right = pressureLDS[ldsID.x + 1][ldsID.y];
float bottom = pressureLDS[ldsID.x][ldsID.y + 1];
float top = pressureLDS[ldsID.x][ldsID.y - 1];
float center = pressureLDS[ldsID.x][ldsID.y];
/* Do the same for obstacle */

// If cell is solid, set pressure to center value instead


if (obstacleLeft > OBSTACLE_THRESHOLD)
{
left = center;
}
/* Do the same for other sides */

float velocityDivergence = g_texVelocityDivergence.Load(coords).x;


g_uavOutput[dtID.xy] =
0.25f * (left + right + bottom + top - velocityDivergence);
}

Listing 1.4. Poisson pressure shader. Some parts were edited in order to fit on the page.

1.3.8 Algae Simulation


Initially, our simulation method was designed for incompressible fluids, like oil or wa-
ter. After experimenting with it, the VFX artist asked if they could reuse the same
method for algae on a surface of a swamp. Figure 1.5 shows the final result for algae
simulation alongside with the density maps.
The key problem with reusing fluid simulation directly for algae is that the algae
is, in essence, made of ‘macroparticles’. If you walk through algae, those particles
would move out of your way, and effectively, clump together or overlap with each other.
Now, if we map the number of particles per unit area into density, we would realize
that algae would become a ‘compressible fluid’. Practically speaking, if you walk
through an area that is uniformly covered with algae, simulated using Navier-Stokes
equation, you will not see any interaction.
Fortunately, similar substance simulations were already researched by academia
[Zhu and Bridson 2005]. We used the idea of the mixed Eulerian-Lagrangian simula-
tion—particle-in-cell (PIC). In a nutshell, every frame we convert density into parti-
cles, advect them using the velocity field, and then convert them back into density by
accumulating particles’ densities via additive blending.
Figure 1.6 shows how density is calculated for algae simulation (all other steps
stay the same). Let us discuss the details of these stages (except clear, which is pretty
self-explanatory) in more detail below.
1.3 Simulation 103

Figure 1.5. Algae simulation. In the inset, the static density inflow map is shown (white) with
the simulated density area (red).

Figure 1.6. Algae fluid simulation data flow (density advection only).
104 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

Particle Density Accumulation. Listing 1.5 shows the shader that advects particles
and accumulates particle density. We allocate a one-channel 32-bit unsigned integer
accumulation texture. It has a resolution of original simulation grid resolution size
times ALGAE_PARTICLE_GRID_MULTIPLIER.
Effectively, we quantize density in ALGAE_DENSITY_QUANTIZATION quants. We
add a 0.5 offset in order to avoid energy loss during conversions. This quantization is
needed, because in Shader Model 5.0, atomic operations (such as InterlockedAdd)
work only with integer values.

Particle Density Resolution. Listing 1.6 shows the shader resolving particles back to
density. The denominator has ALGAE_DENSITY_QUANTIZATION to the power of two
because (before advection) each simulation grid cell contains ALGAE_DENSITY_QUAN-
TIZATION × ALGAE_DENSITY_QUANTIZATION particles.

1.4 Engine Integration


In order to be useful, fluid simulation needs to be properly integrated in the game en-
gine. In the following sections, we discuss the way fluid simulation was made into a
component and how it is used within the Shadow of the Tomb Raider material system.

RWTexture2D<int> g_uavOutput : register(u0);


Texture2D<float4> g_texDensity : register(t0);
Texture2D<float4> g_texVelocity : register(t1);

#define ALGAE_DENSITY_QUANTIZATION 65536.0f


#define ALGAE_PARTICLE_GRID_MULTIPLIER 2

[numthreads(GROUP_SIZE, GROUP_SIZE, 1)]


void main(uint3 dtID : SV_DispatchThreadID)
{
float2 coords = (float2(dtID.xy) + 0.5f) *
g_InvFluidSimGridRes / ALGAE_PARTICLE_GRID_MULTIPLIER;
float density = g_texDensity.Load(
int3(dtID.xy / ALGAE_PARTICLE_GRID_MULTIPLIER, 0)).r;
coords += g_DeltaTime * g_texVelocity.SampleLevel(
BilinearSampler, coords, 0.0f).xy;

uint2 uCoords = uint2(coords * g_FluidSimGridRes);


InterlockedAdd(g_uavOutput[uCoords],
int(density * ALGAE_DENSITY_QUANTIZATION + 0.5f));
}

Listing 1.5. Particle density accumulation shader.


1.4 Engine Integration 105

RWTexture2D<float4> g_uavOutput : register(u0);


Texture2D<int4> g_texInput : register(t0);

[numthreads(GROUP_SIZE, GROUP_SIZE, 1)]


void main(uint3 dtID : SV_DispatchThreadID)
{
g_uavOutput[dtID.xy] = (float4(g_texInput.Load(
int3(dtID.xy, 0))) / (ALGAE_DENSITY_QUANTIZATION
* float(ALGAE_PARTICLE_GRID_MULTIPLIER
* ALGAE_PARTICLE_GRID_MULTIPLIER)));
}

Listing 1.6. Particle density resolution shader.

1.4.1 Fluid Component Architecture


In Shadow of the Tomb Raider (like in many other games), we utilize a component
architecture. Essentially, that means that every entity (or, Instance in the engine terms)
in the game has an array of components of different types attached to them. For each
component type, there is usually a respective Manager class instance that is performing
certain operations for all the components of the same type (e.g., drawing them alto-
gether).
For the component types that must be rendered, there is usually an auxiliary Draw-
able type. The drawable is created on the main thread, and then a Draw() virtual func-
tion is called by the render thread. The drawables are allocated on the frame heap, so
they are deleted when a frame is rendered. Thus, we can safely pass the information
from the main thread to the render thread. We decided to implement fluid simulation
as a component within this architecture.
Figure 1.7 shows how fluid simulation is integrated into the Shadow of the Tomb
Raider engine architecture. When a fluid component is getting attached to an instance,
it is created and added to this instance’s array of components. Additionally, a reference
to this fluid component is added to the array of fluid components within the fluid com-
ponent manager.
When Process() function of the manager is called from the main thread, it loops
through all of the visible instances that have fluid components attached and picks one
that is closest to the player’s character. We use Equation (1.1) as a distance function,
where  x, y  is the main character’s normalized (to the object’s extents) relative posi-
tion within an object with fluid component with  0, 0  being the center of the object.

d  max  x , y  (1.1)

After that, a fluid simulation drawable is created on the main thread and appended
to the drawables list. This drawable encapsulates all fluid simulation parameters (e.g.,
106 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

Figure 1.7. Fluid component architecture. Blue boxes stand for classes, green boxes stand for
threads. Rhombus-ending lines stand for aggregation, and arrowhead-ending lines stand for func-
tion calls. The main thread creates a fluid manager, components, and a drawable, while the ren-
dering thread calls the Draw() function of the fluid drawable.

inflow texture maps, viscosity, dissipation, fadeout factors and such) needed to perform
a fluid simulation step.
Finally, when the render thread flushes the drawables list, the Draw() function of
the fluid simulation drawable is called. This is the place where the simulation happens:
compute shaders for the respective algorithm stages are dispatched.

1.4.2 Integration with Material System


Having performed the simulation, we need to actually use it for a particular object. This
is where the Shadow of the Tomb Raider material system comes into play.
The Shadow of the Tomb Raider material system is based on shader nodes. Every
material is represented by a graph where vertices are shader nodes. Each shader node,
in its turn, is an entity that has one or more outputs and, usually, one or more inputs
(with constant nodes being a notable exception to the latter).
Figure 1.8 shows the fluid simulation node within the material graph. The inputs
(left) are UVOffset (which defines additional artist-defined UV shift) and Stat-
icMask—a fetch from the inflow texture (this has to be exactly same texture as the one
used for simulation). The outputs (right) are Color—a 4-component floating point

Figure 1.8. Fluid simulation node.


1.4 Engine Integration 107

vector containing density value, velocity field, and pressure packed into its components
and PressureGradient—a precomputed gradient of the scalar pressure value.
Listing 1.7 shows implementation of the fluid simulation shader node. g_Play-
erPos and g_InvSimulationSize are global constants set by the fluid simulation
drawable. We use a simple binary fadeout, as there is already a fadeout in simulation,
but a more complex fadeout function could be used. FluidSimulationTexture and
PressureGradientTexture are global textures where fluid simulation outputs its
results, and they are persistent between frames.

<input type="float2" name="UVOffset"/>


<input type= "float" name="StaticMask"/>
<output type="vector4" name="Color"/>
<output type="vector2" name="PressureGradient"/>

float2 texCoord = g_PlayerPos.xy - v_WorldPosition.xy;


texCoord *= g_InvSimulationSize;
texCoord += UVOffset;
float fadeout = (all(abs(texCoord)) < FADEOUT_THRESHOLD) ? 1 : 0;
texCoord += 0.5f;

Color = FluidSimulationTexture.SampleLevel(
SamplerGenericAnisoBorder, float3(texCoord, 0), 0);
Color = lerp(float4(StaticMask.r, 0, 0, 0), Color, fadeout);
PressureGradient = PressureGradientTexture.SampleLevel(
SamplerGenericAnisoBorder, float3(texCoord, 0), 0);

Listing 1.7. Fluid mask shader node. g_InvSimulationSize is the inverse of the size of the
simulation area (in world units). g_PlayerPos is the player’s character position, multiplied by
the size of the simulation area and divided by the grid resolution.

1.4.3 Oil Shading


Shadow of the Tomb Raider features realistic water rendering with reflection, refraction
and light absorption effects. We use a simple linear interpolation between the water
material and the oil material controlled by the mask outputted by the fluid simulation.
To add more visual interest, we add a subtle iridescent effect on top of the oil using a
look-up table (see Figure 1.9) controlled by an oil thickness parameter and a Fresnel
factor. See Listing 1.8 for more details.

1.4.4 Algae Shading


Algae shading works in a similar way, but we put an emphasis on breaking up the
smooth edges produced by the fluid simulation mask with a high frequency mask (see
108 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

Figure 1.9. 128128 iridescence look-up table.

float t = Thickness;
float a = abs(dot(CameraVector, Normal));

float3 lutSample = IridescenceLUT_texture.SampleLevel(


SamplerGenericBilinearClamp, float2(t, a), 0).rgb - 0.5;
float intensity = Intensity * 4 * (FresnelIn * (1 - FresnelIn));
FresnelOut = saturate(lutSample * intensity + FresnelIn);

Listing 1.8. Iridescence shader.

Figure 1.10) that represents the micro structure of algae found in swamps. Listing 1.9
shows more details about our shader implementation.

1.5 Optimization
We have already shown how to optimize one of the bottlenecks: Poisson pressure equa-
tion solving. However, there are few more issues that need to be addressed.

1.5.1 Managing Many Obstacles


Originally, we were planning to handle only the main character’s interaction with the
fluid. Therefore, when rendering obstacle and obstacle velocities, it was possible to
simply loop over all collision capsules in every simulation grid cell.
Later on, we decided to add support for multiple characters that are close to the
player. In this case, there could easily be too many capsules to evaluate per each grid
cell, therefore, a better solution is needed.
We utilize the approach similar to a 2D clustered lighting approach. We subdivide
the simulation grid into 3232 tiles in each dimension. Then, we calculate on CPU
1.5 Optimization 109

Figure 1.10. Algae micro detail mask.

float algaeMicro; // Algae micro mask texture input


float fluidsim; // Fluid simulation mask input
float algaeMaskSpread; // Parameter for the spread of algae
float algaeMaskDarkness; // Parameter for algae mask color
float algaeTransSharp; // Parameter for transition sharpness

float mask = (algaeMicro * fluidsim * algaeMaskSpread) +


(algaeMicro * (1 - fluidsim) * algaeMaskDarkness ) + fluidsim;
float algaeMaskTransition = sat(pow(mask, algaeTransSharp));

Listing 1.9. Algae transition mask shader.

which capsules intersect with which tiles and store this information in a Structured-
Buffer. We use StructuredBuffer over a ConstantBuffer because of the storage
limitations. Finally, instead of querying all the capsules in each grid cell, we only query
the capsules from the respective tile. In practice, this was good enough to handle a
realistic number of NPCs around the main character.

1.5.2 Async Compute


Being independent from any graphics stage (e.g., depth pre-pass or lighting pass), fluid
simulation became a natural candidate to be moved to async compute. We can dispatch
110 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

it very early in frame, and the results are expected relatively late in the frame (as water
objects are usually rendered after all opaque objects have been rendered and lit).
Therefore, we dispatch fluid simulation on the low-priority compute pipe early in
the frame. In order to ensure simulation is done by the time it is needed to be used, we
insert a fence at the end of the simulation. Also, a wait on this fence is inserted before
the pass where the water is rendered.

1.6 Future Work


In the future, we would ultimately like to see more fluid simulation utilized in games.
We have demonstrated that 2D fluid simulation can be used effectively on current-gen-
eration consoles.
However, 3D fluid simulation could bring in more interesting use cases. Effects
such as fire or smoke can be simulated very realistically, including interaction between
fluid and solid objects.
Hierarchical grid approaches (like in Wroński [2014]) could be interesting to ex-
plore. This would potentially allow both having fluid simulation running in a bigger
area, and fluid simulation being seamlessly injected into volumetric fog 3D texture that
many game rendering engines use, saving on raycasting for rendering fluid effects.

Acknowledgments
We would like to thank several colleagues who made this feature possible. Maximilien Faubert
came up with the original idea to utilize fluid simulation for fluid on water and did the initial
prototype. Vincent Duboisdendien and Jonathan Bard did countless code reviews and helpful
comments on how to improve and accelerate the method. Finally, we would like to thank all the
Shadow of the Tomb Raider team, Eidos Montréal and Crystal Dynamics studio for providing us
with an opportunity to work together on this game and to make this publication happen.

Bibliography
BRIDSON, R. AND MÜLLER-FISCHER, M. 2007. Fluid Simulation. In ACM SIGGRAPH 2007
Courses. URL: https://www.cs.ubc.ca/rbridson/fluidsimulation/fluids_notes.pdf.
BRIDSON, R., MÜLLER-FISCHER, M., AND GUENDELMAN, E. 2006. Fluid Simulation. In ACM
SIGGRAPH 2006 Courses. URL: https://www.cs.ubc.ca/rbridson/fluidsimulation/2006/flu-
ids_notes.pdf.
CRANE, K., LLAMAS, I., AND TARIQ, S. 2008. Real-Time Simulation and Rendering of 3D Flu-
ids. In GPU Gems 3, pp. 633-675. URL: https://developer.nvidia.com/gpugems/GPUGems3/
gpugems3_ch30.html.
GRINSPUN, E. 2018. Animation and CGI Motion. edX. URL: https://www.edx.org/course/
animation-cgi-motion-columbiax-csmm-104x.
Bibliography 111

HARRIS, M. 2004. Fast Fluid Dynamics Simulation on the GPU. In GPU Gems, pp. 637–665.
URL: http://developer.download.nvidia.com/books/HTML/gpugems/gpugems_ch38.html.
VLIETINCK, J. 2009. Fluid simulation (DX11/DirectCompute). URL: http://users.skynet.be/
fquake/.
WROŃSKI, B. 2014. Volumetric fog: Unified, compute shader based solution to atmospheric
scattering. In ACM SIGGRAPH, Advances in the Real-Time Rendering in 3D Graphics and
Games.
ZHU, Y. AND BRIDSON, R. 2005. Animating Sand as a Fluid. In ACM SIGGRAPH Papers, pp.
965–972.
2
II

Real-time Snow Deformation


in Horizon Zero Dawn:
The Frozen Wilds
Kevin Örtegren

2.1 Introduction
Having dynamic characters and objects interact with the environment makes the scene
more immersive and alive. Typically games will have some form of foliage interaction
with the character, where leaves and grass bend out of the way when the character
moves through. Another example is ground projected decals when walking on snow.
Both of these commonly used techniques lack persistence and scalability; the foliage
will not be permanently deformed or crushed and the footsteps in snow usually have
an upper limit to the number of decals active at the same time.
Horizon Zero Dawn: The Frozen Wilds is the expansion to Horizon Zero Dawn1
and it takes place in a new snowy mountain region. Snow covers most of the landscape
and we thus needed believable snow which solved some of the shortcomings of existing
techniques and worked under the constraints and requirements for this particular pro-
ject. Figure 2.1 gives an overview of the results we achieved.

The requirements were:

 We needed real-time snow deformation for any dynamic object.

 It had to work in a massive open world and on top of certain static objects, like
rocks and roof tops.

1
Horizon Zero Dawn™ © 2017–2018 Sony Interactive Entertainment Europe. Developed by
Guerrilla. Horizon Zero Dawn is a trademark of Sony Interactive Entertainment Europe. All
rights reserved.

113
114 2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds

(a) (b)
Figure 2.1. Showing the final applied result versus what the deformation is in the system.
(a) Example of snow trails caused by the deformation system. (b) Debug overlay mesh showing
the actual system deformation, colored by its normal map.

 It had to run very fast on the GPU, since the GPU frame was already laid out
and optimized for the base game.

 No major asset refactor could be done to avoid adding to the expansion down-
load size and because we could not spare artists to go through all the content
manually and fine tune for a new system.

2.2 Related work


A few AAA games have implemented real-time snow deformation, with various ap-
proaches, prior to this. Some notable ones with presentations and articles on the subject
include Assassin’s Creed III [St-Amour 2013], Batman: Arkham Origins, and Rise of
the Tomb Raider.
Similar to Batman: Arkham Origins, our approach renders dynamic objects ortho-
graphically from below to determine the deformation [Barré-Brisebois 2014].
Rise of the Tomb Raider implements trail elevation [Michels and Sikachev 2016],
where the snow can be elevated above the base snow height. This is something which
our approach does not support in the height data, but is instead faked in the surface
shader of the snow using different diffuse textures and normal maps.

2.3 Implementation
This section will go through the different steps included in the snow deformation algo-
rithm. A overview block diagram depicting the algorithm is shown in Figure 2.2.
2.3 Implementation 115

Figure 2.2. Overview of the algorithm and the needed buffers.

The algorithm consists of two passes:

1. Write orthographic height. Render all dynamic objects (characters, robots,


debris from explosions etc.) from below into a depth buffer, using an ortho-
graphic camera. (Vertex/Pixel Shader)

2. Deform & temporal filter. Take the rendered height as input and determine for
each pixel if deformation has to be applied while simultaneously performing
temporal filtering on the persistent deformation data. The output of this shader
is both an updated version of the persistent height data, written to one of the
ping-pong buffers to be read back next frame, as well as the packed calculated
normal and height data to the result buffer. (Compute shader)

After finishing the compute shader work, the deformation data may be read from
the result buffer by any shader. The needed buffers for the algorithm are (all are
1024 1024 in our case):
1. 2× persistent UNORM8 buffers containing relative deformation height, used as
ping-pong buffers across GPU frames for temporal filtering.
116 2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds

Figure 2.3. Side view visualization of the deformation height above the different height maps.
This also illustrates the use of an orthographic frustum to capture dynamic object depths locally
around the player, where the deformation occurs. (Illustration not to scale.)

2. 1×16-bit depth buffer containing linear depth for the orthographic camera, only
used within the same GPU frame (can thus be aliased with other render targets).

3. 1× Result buffer, UNORM10.10.10.2 which stores the packed height and nor-
mal for other shaders to read.

The relative deformation height stored in the persistent buffers and in the result buffer
represents a height relative to the baked terrain/objects/water height. See Figure 2.3 for
a visualization of this.

2.3.1 Dynamic object representation


Most objects have a fairly high vertex count when viewed up close, rendering these into
the depth buffer would be costly. To solve this, we used lower LOD versions of objects.
The LOD is automatically selected as if the object was 30 meters out from the camera.
See Figure 2.4 for examples of meshes.

2.3.2 Write orthographic height


The first pass will do a scene query of dynamic objects within an orthographic frustum
from below the minimum terrain height, centered on the player character. The output
2.3 Implementation 117

(a) Watcher (b) Character (c) Boar


Figure 2.4. Examples of low LOD skinned meshes used for rendering the dynamic object
heights.

of this pass is a simple 16-bit depth buffer containing linear depth and it serves as the
input to the next pass.
To ensure that small objects have enough depth samples rendered, we render the
depth using nonuniform xy axes in NDC. The nonuniformity is achieved by the equa-
tion P  P xyndc P xyndc , where P xyndc is the original post-projected xy coordinate in
x

1,1, and x is the power of the distribution, limited between 0.8 (extreme distortion)
and 0.0 (linear). While the depth is linear, the normalized device coordinate (NDC) xy
axes are not. This acts as a dynamic level-of-detail, where objects close to the u and v
center axes in texture space get more depth samples. To read back the depth texture,
the same function must be applied to the sampling coordinate, see Listing 2.1 for a
helper function in HLSL.
Shown in Figure 2.5 is the comparison between uniform distribution and the dis-
tribution we used in The Frozen Wilds. Notice how in Figure 2.5(b) our main character
Aloy (in the center) is using many more samples in the depth texture than in Fig-
ure 2.5(a). Aloy’s feet would normally only be a few samples in the depth render, which
would not be enough for detailed deformation.

float2 TransformToNonUniformUV(float2 uv, float x)


{
float2 ndc_coord = uv * 2.0 - float2(1.0, 1.0);
ndc_coord *= pow(abs(ndc_coord), x);
return ndc_coord * 0.5 + float2(0.5, 0.5);
}

Listing 2.1. Shader function to convert from uniform normalized UV coordinates to non-uni-
form normalized UV coordinates.
118 2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds

(a) 0.0 exponent (b) 0.3 exponent


Figure 2.5. Non uniform distribution example.

2.3.3 Read object height, deform, and filter


This pass is run using a compute shader which does most of the heavy lifting of this
technique. First of all, each compute thread in a thread group will read back the value
from the previous frame and store it in local data store (LDS), a one texel skirt will
also be read in to LDS to allow for efficient neighborhood sampling. Temporal filtering
is performed by using a min-average 33 filter on the results in LDS. The filter is time
dependent and will converge towards a specified slope gradient value. A comparison
between a filtered and unfiltered deformation can be seen in Figure 2.6.
Once the filtered result has been calculated we can sample the depth of the dy-
namic objects from the depth buffer. We can early out if the depth buffer contains a far

(a) Filtered deformation (b) Unfiltered deformation


Figure 2.6. Comparison between a filtered and unfiltered deformation, viewed from the side.
The deformation was made by a character standing still in 1 meter snow.
2.3 Implementation 119

plane value (1.0), which means no object was written to that location. If the read depth
value is above the baked terrain/object/water height and below the current snow height,
apply it as the new snow height. See Figure 2.3 for a visualization of this in action. The
normal is calculated using finite difference between samples in the 33 filter. The new
height value and normal are packed and written to the UNORM10.10.10.2 output
buffer. We pack the world space normal using Lambert Azimuthal Equal-Area Projec-
tion, since we know that the Z component will always be positive (pointing up in world
space). The height stored represents the relative snow deformation, 0 meaning un-
touched snow and 1 meaning fully deformed down to the height data below it (terrain,
rocks, water etc.). The new result is also written to the next persistent buffer for input
to the next frame, but before that is done we subtract the snow refill which is based on
the current precipitation rate of the game.

2.3.4 Sliding window


Since we needed the technique to work in an open world, we had to move the localized
system with the player and fade out the results at the edges. By trial and error we found
that a 6464 m region around the player was enough to give a convincing persistence
to the deformation. The sliding is done by reading samples from the previous buffer
using an offset when the system has moved one or more texels. Figure 2.7 shows this
in action.

2.3.5 Applying results


The result buffer is read by the terrain shader and deforms the terrain height and blends
in the normals. We created a shader graph node for sampling the deformation height

Figure 2.7. Illustration of the deformation system during a move. The current texture reads
back previous frame data using an offset, depending on how much the system moved. Textures
in this example are 2  2 texels.
120 2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds

Figure 2.8. Example of snow lumps and piles, which respond the deformation, placed by our
procedural placement system.

and normal in any artist created vertex or pixel shader. This led to more than just the
terrain using the deformation. Things like snow lumps, like in Figure 2.8 and other
snowy assets sampled the deformation and responded to dynamic object interaction.
This is achieved by having the actual deformation height of the system differ from the
rendered snow height. In our case, the snow height of the deformation system is at 1
meter, but the visual snow layer is only about 30 cm deep. This allows shaders to know
about deformation which occurs above the snow height and applying that as defor-
mation to snow lumps sticking out of the snow layer.
One interesting thing which spawned from this system was interactive thin
ice/snow slush on the surface of lakes, as shown in Figure 2.9. Using the shader node
to sample the deformation data, our artists could make this happen in the water surface
shader. Since objects below the terrain/object/water height do not contribute to the de-
formation, it was even possible to stealth swim below the surface without disturbing it.

2.4 Results
With a 6464 m region around the player, we achieve the desired quality using
1024 1024 buffers, which adds up to 8 MB of VRAM. This gives us a persistent de-
formation buffer with 6.25 cm resolution, which roughly matches our inner most ter-
rain LOD mesh resolution. An example of the result buffer can be seen in Figure 2.10.
2.4 Results 121

Figure 2.9. Aloy swimming and interacting with the layer of snow slush on the surface of the
cold water.

Figure 2.10. An example of the result buffer after a battle with a few Watcher robots.
122 2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds

The GPU cost of this technique is broken down into three parts: Clear depth, ren-
der dynamic objects and deformation compute shader. The cost of rendering the dy-
namic objects depends on the number and complexity, but in general it takes 1 µs per
object. The timings for the other two passes can be found in Table 2.1. The minimal
cost to using the result of the deformation shader is one 32-bit texture sample and the
decoding of the height and normal.

Clear depth Deform compute shader


PS4 21 µs 210 µs
PS4 Pro 21 µs 105 µs
Table 2.1. GPU timings for the fixed cost shaders.

2.5 Conclusion and Discussion


All the initial requirements were met, but in doing so a few shortcuts had to be taken.
For instance, since we couldn’t do a major asset re-factor the input meshes to the system
were all low resolution dynamic shadow casting LODs, many of which were auto gen-
erated. These meshes worked reasonably well, but having either custom meshes, shapes
or points as input would be preferable. To create deformations from invisible objects
(like melting snow with the flame thrower) artists had to create a dynamic shadow
casting mesh with a huge depth bias (to not have it cast an actual shadow) and slowly
lower it down into the ground to simulate melting snow. That was an example of a
creative solution to the fact that we were limited to using shadow casting meshes.
Using a local system around the player with a sliding window allowed us to seem-
ingly run the system anywhere in the world, satisfying our goal for it to work in a mas-
sive open world. Since the system runs on top of baked static objects like houses or
rocks, there would be overhangs without any snow deformation below them. These
cases would be solved by artists not allowing for deep snow under overhangs. Perfor-
mance was acceptable and fit well into our budget thanks to relying on temporal filter-
ing with the downside of slightly more memory usage and a frame delay in the
interaction.
Since the system outputs the results in world space and a shader graph node was
created for easy access to the data by artists, other use cases spawned from this, like
the interactive snow slush on lakes and general snow lump assets being deformed by
the system.
Going forward, it would be interesting to explore the introduction of trail elevation,
similar to what was done in Rise of the Tomb Raider, and other physical behaviors of
snow.
Having only one layer of deformable snow proved to be a limitation which would
be worth trying to solve, maybe by having multiple deformation layers which could
Bibliography 123

interact with each other. Snow from an upper layer (e.g., on a roof top) could fall down
and get added to a snow deformation layer below.
Even though the size of the deformation area gave a plausible result, it would be
interesting to explore a larger scale, more persistent solution using a combination of
“pre-deformed” data, either coming from streamed in data or from our procedural
placement system, and streaming out real-time deformation results to secondary stor-
age to later stream it back in.
One of the more interesting aspects going forwards is repurposing this technique
to allow for interaction with other environmental assets, like vegetation, sand, mud,
water. Rendering depth from above could allow a dynamic precipitation occlusion sys-
tem to be spawned from this.

Bibliography
BARRÉ-BRISEBOIS, C. 2014. Deformable Snow Rendering in Batman: Arkham Origins. In
Game Developers Conference 2014. URL: https://www.gdcvault.com/play/1020379/
Deformable-Snow-Rendering-in-Batman.
MICHELS, A. AND SIKACHEV, P. 2016. Deferred Snow Deformation in Rise of the Tomb Raider.
In GPU Pro 7, pp. 3–16. CRC Press.
ST-AMOUR, J. 2013. Rendering Assassin’s Creed III. In Game Developers Conference 2013.
URL: https://www.gdcvault.com/play/1017710/Rendering-Assassin-s-Creed.
III
Shadows

Shadows are the dark companions of lights, and although both can exist in their own,
they shouldn’t exist without each other in games. Achieving good visual results in ren-
dering shadows is considered one of the particularly difficult tasks of graphics pro-
grammers.
The first article “Soft Shadow Approximation for Dappled Light Sources” by
Mariano Merchante proposes to mimic an effect that is called dabble lights. This type
of effect occurs when for example light shines through leaves that are very close to-
gether, while a small patch of light can travel through them, projecting the sun shape
into the shadow receiver.
The second article is “Parallax-Corrected Cached Shadow Maps” by Pavlo
Turchyn is the successor to another great shadow map article in GPU Pro 2 by the same
author. This article describes a parallax correction algorithm for rendering sweeping
shadows from a dynamic light source using a static shadow map. The resulting imple-
mentation uses Cascaded Shadow Maps up to a distance of 30 meters from the camera
and after that Adaptive Shadow Maps covering the next 500 meters range and updated
every 2500 frames. The use of the parallax correction algorithm enables a fairly seam-
less transition between dynamic shadows rendered with these two methods.

—Mauricio Vives

125
1
III

Soft Shadow Approximation


for Dappled Light Sources
Mariano Merchante

1.1 Introduction
Common shadow rendering techniques rely on solving the visibility problem through
buffers that contain information related to the distance to the light source. These buffers
are commonly referred as shadow maps [Williams 1978], and although the results can
be filtered with a wide variety of algorithms to get smooth penumbra effects, filtering
is expensive and does not consider either the size or shape of the light source. Well
known examples of filtering approaches include percentage-closer filtering [Reeves et
al. 1987], percentage-closer soft shadows [Fernando 2005], variance shadow maps
[Donnelly and Lauritzen 2006] and moment shadow maps [Peters and Klein 2015]. A
more recent approach uses raytracing and denoising to estimate the penumbra gener-
ated by a polygonal area light [Heitz et al. 2018], but requires a complex rendering
pipeline setup that may not be available to most real-time engines.
Given the complexity of analytically approximating the penumbra generated by
area lights, these real-time techniques usually ignore the pinhole effect that certain
high frequency objects can generate when lit by such lights. This article proposes an
approximation of this effect, which can be seen working in Figure 1.1. In photography,
this phenomenon is called dappled light, and is very characteristic of tree shadows:
when leaves are very close together, a small patch of light can travel through it while
essentially projecting the sun shape into the shadow receiver. Moreover, it is particu-
larly evident in the case of crescent shadows like shown in Figure 1.2 while a solar
eclipse is occurring. Minnaert [1937] offers a comprehensive introduction of the sub-
ject matter.
This phenomenon does not only happen on perfectly infinitesimal holes in depth-
space, as it is an artifact of complex visibility functions. However, the effect can be
approximated by just identifying these points and using subtractive masking on top of
any other conventional technique.. This article will describe a simple and practical

127
128 1. Soft Shadow Approximation for Dappled Light Sources

Figure 1.1. A sample scene using the proposed technique, inspired by Tufte’s [1997] sculptures.

Figure 1.2. Examples of dappled light in nature. Left: dappled light on a road. Right: Crescent
shadows during a solar eclipse. (Source: Wikipedia)

algorithm that uses an arbitrary shape texture to represent the projected light shape.
Moreover, the majority of the examples presented here will be related to tree shadows,
given that it is one of the most prominent cases of this phenomenon while also being
very common in games and real-time engines. Other situations where this effect occurs
are metal gratis, woven patterns and certain furniture.

1.1.1 Algorithm Overview


The technique can be subdivided into the following steps, which we will discuss in the
following sections.
1. Store the scene’s shadow map.
2. For each pixel on the shadow map:
a. Identify pinholes.
1.2 Detecting Pinholes 129

b. Store pinholes in a uniform grid compute buffer.


3. For each grid cell in the compute buffer:
a. Accumulate newly identified pinholes and merge with old.
b. Flatten linked lists into grid cell arrays.
4. When rendering each shadow-receiving point:
a. Iterate over the uniform grid.
b. Iterate over every pinhole within a radius and accumulate contribution.
c. Estimate a shadow factor computed by any common shadow filtering tech-
nique.
d. Combine the accumulated contribution with the estimated shadow factor
through subtraction.

1.2 Detecting Pinholes


We define pinholes as points in a shadow map that have a substantial depth disconti-
nuity with respect to its neighbors, i.e., they are outliers in a predefined range in UV
space. An additional constraint is that these points have to be further away from the
neighbor cluster, or else they would become blockers.
Specifically, we can identify pinholes by calculating the depth mean and variance
in the shadow map, and storing points that differ wildly to the neighbor mean, with a
threshold proportional to its variance. A naive implementation can be seen in List-
ing 1.1, and an example shadow map with detected pinholes can be found in Figure 1.3.
Note that it is important to store both the raw distance to the light and the average depth
of the neighbor pixels, as it will prove useful in the next sections.
A more robust approach for pinhole detection includes differentiating samples that
fall within an expected radius and computing statistical values to estimate the proba-
bility that the points inside the radius represent a pinhole of that size, as shown in
Figure 1.4. A set of thresholds are used to bound the variances, and by making these
constraints stronger we can prevent finding false positives, such as points on the edge
of the projected shape. More specifically, a pinhole is identified if the following applies:

 The variance on the outer set, B, is smaller than a specified threshold.

 A percentage of the points in the inner set, A, have greater depth than the mean
of set B plus its standard deviation multiplied by some constant.

A smoother version of this idea can be implemented with a sample weight defined
by distance; this helps by giving more importance to the center samples, thus reducing
the total number of detected pinholes.
130 1. Soft Shadow Approximation for Dappled Light Sources

bool FindPinhole(float2 uv)


{
float centerDepth = ShadowMap.SampleLevel(sampler, uv, lod).r;

// Calculate average depth and std deviation of neighbors


float averageDepth = ...
float std = ...

// Now calculate them again, including the center sample


float stdCenter = ...
float avgCenter = ...

if (averageDepth > 0.0)


{
std /= averageDepth;
stdCenter /= avgCenter;
}

return stdCenter > threshold && std < threshold;


}

Listing 1.1. A naive implementation that selects pinholes when neighbor pixels are very similar
and the center pixel is an outlier with a certain threshold.

Figure 1.3. Left: The core concept of this technique is finding pinholes a) and b) and their
estimated distances to the light, and then projecting textures based on the distance to the receiv-
ing surface. Right top: The original shadow map. Right bottom: The identified pinholes that will
project the light shape, shown here as crosshairs.
1.2 Detecting Pinholes 131

Figure 1.4. Left: The neighborhood segmentation based on radius. Right: An example of a
possible set of depth samples at a cross section of this neighborhood. Ideally, we would desire
the cavity to be close in radius to our expected radius r.

The outer neighborhood B has to have low variance so that we can then approxi-
mate the estimated pinhole depth by using the average depth of the neighborhood.
Cases where there’s too much noise on the outer neighborhood must be ignored, as
there’s no clear way to approximate the combination of light spills happening.

1.2.1 Scatter Versus Gather


In a similar approach to percentage-closer soft shadows (PCSS), we separate the tech-
nique into a searching step and a rendering step. However, in contrast to PCSS, it is
possible to search these points just once on the shadow map, and not per pixel being
shaded. Moreover, the shadow map can also be downsampled to accelerate the search,
although it might affect search accuracy. This search can also be masked in a way that
only updates sections of the shadow map that occlude visible surfaces in the screen, as
we will discuss in Section 1.4.3.
The projected light shape size is proportional to the distance from each shadow
receiver point to each pinhole. Thus, it is computationally complex to search for pin-
holes for each shaded point (i.e., a gather operation), as the projected size can be both
spatially incoherent and very big. It is desirable to first use a scatter approach, in which
we precompute all the pinholes and store them into a uniform grid, and then iterate
them efficiently as needed on the receiving shaders. An analogy can be made with
depth of field techniques that splatter sprites that represent the aperture [Pettineo and
de Rousiers 2012], as they solve a similar scatter vs. gather problem.
To achieve this there are multiple options, but a straightforward implementation
uses a compute shader to search pinholes and store them into concurrent linked lists
132 1. Soft Shadow Approximation for Dappled Light Sources

on a compute buffer. This approach is based on Yang et al. [2010], which is generally
used for order-independent transparency. It also exploits the fact that, in general, pin-
holes are sparsely distributed, so these lists can be reasonably small for later inspection.
We can build a grid that subdivides uniformly the shadow map containing these
lists of samples, as shown in Listing 1.2. However, the actual points are distributed
throughout a global buffer in a very noncoherent way. To improve coherency, an inter-
mediate compute shader flattens these lists into flat arrays on a separate compute buffer,
which the receiving shadow shader uses. Additional data can be stored, such as the
“intensity” of this pinhole, which we’ll discuss in Section 1.4.1. Finally, the actual
count of samples are stored in a separate buffer so that iterating through these is easier.
Figure 1.5 shows how the buffers are organized.

uniform RWStructuredBuffer<PinholeLink> g_PinholeLinkBuffer;


uniform RWByteAddressBuffer g_OffsetBuffer;

[numthreads(32,32,1)]
void ComputePinholes(uint2 id : SV_DispatchThreadID)
{
float2 uv = ...
bool foundPinhole = FindPinhole(uv, id, ...);
if (foundPinhole)
{
uint newIndex = g_PinholeLinkBuffer.IncrementCounter();
if (newIndex >= TotalMaxPinholeCount)
return;

// Get cell position from uvs


uint2 cellPos = ...
uint offset = cellPos.y * gridSize + cellPos.x;
uint prevIndex;

// Atomic swapping
g_OffsetBuffer.InterlockedExchange(offset * 4,
newIndex, prevIndex);

// Store detected pinhole information


PinholeLink link = ...;
g_PinholeLinkBuffer[newIndex] = link;
}
}

Listing 1.2. An example implementation of the linked list generation. This approach is very
similar to Yang et al. [2010].
1.3 Shadow Rendering 133

Figure 1.5. Left: The shadow map is subdivided into a global buffer with linked lists and a
separate buffer that indicates each cell’s first link. Right: The required buffers. From top to bot-
tom: 1) The global linked list data, which uses an atomic counter along with a 2) buffer that
maintains the list starting element. 3) The coherent uniform grid buffer; each cell has a maximum
amount of pinholes, but 4) for reducing the amount of iterations, a size buffer is used.

1.3 Shadow Rendering


When rendering the shadow-receiving points, a search is made on the uniform grid
buffer with a predefined maximum radius, and the contribution of each precomputed
pinhole to the shaded point is calculated. To do this, we can estimate that each projected
pinhole has a size proportional to the distance towards the receiver and the solid angle
subtended by the area light. If the receiver is in the pinhole’s projected range, we can
sample an arbitrary shape texture with the local offset as UVs. This shape texture can
easily be animated for interesting effects, such as an eclipse, and can also be mip-
mapped if necessary. In the case of a distant light like the sun, it is possible to approx-
imate the pinhole size S p by just using a constant S L representing the size of the area
light, as Equation (1.1) shows. It is useful to use said constant to exaggerate the effect.

S p   d r  d p  S L (1.1)

d r is the distance from the receiving point to the light, and d p is the estimated average
distance of the pinhole’s neighborhood to the light. d p is used to provide a good esti-
mate of the pinhole’s abstract position.
Using a single shape texture for all pinholes may be limiting, because it implies
that the source light is infinitely far away and thus the projected shape is always similar,
disregarding size. Because of this, it is also possible to extend this method to use dif-
ferent shaped textures depending on the direction towards the light, in a similar way to
view-dependent impostors [Brucks 2018]. This can be useful if the light has an uncon-
ventional shape (e.g., a star polyhedron) and is very close to the occluders.
134 1. Soft Shadow Approximation for Dappled Light Sources

Finally, if the light shape is simple enough, it is possible to define it with a function
of the pinhole’s local UV space. For example, a solar eclipse can be described with
two overlapping circles, and the subtracting inner circle’s position can be driven by a
function of time. This may reduce texture lookups drastically and improve performance
if the shape function is computationally cheap. The local pinhole UV can also be ro-
tated or distorted easily through a matrix multiplication or any kind of domain warping
for specific scenarios, e.g., for shadows seen through animated water.

1.3.1 Handling Occlusion


Because the pinhole can still be occluded by other elements, we need to save the raw
depth at the pinhole position, regardless of its neighbor depth values. With this depth,
it is simple to check if the current shadow receiver has direct visibility within a prede-
fined margin, and can be transitioned smoothly if necessary, as Figure 1.6 shows. Alt-
hough it can be sampled from the unfiltered shadow map when iterating through the
pinhole buffer, it is more efficient to sample it when detecting them, as it only has to
be sampled once per pinhole. The resulting lookup function that samples the texture
can be seen in Listing 1.3.

Figure 1.6. Pinholes can leak through occluders. To prevent this, we must still store the actual
depth (not it’s estimated neighborhood average one) at the pinhole’s position in the depth map,
so we can use that to prevent leaking. Left: No occlusion. Right: occlusion considered.
1.4 Temporal Filtering 135

float CollectPinholes(uint2 cell, float2 uv, float depth)


{
uint bufferIndex = cell.y * PinholeGridSize + cell.x;
int count = g_PinholeCountBuffer.Load(bufferIndex);
float accum = 0.0;

int bufferOffset = bufferIndex * PinholesPerCell;


for (int i = 0; i < count; ++i)
{
PinholeData pinhole = g_PinholeBuffer[bufferOffset + i];
float occlusion = depth - pinhole.rawDepth;

if (occlusion < DEPTH_PROXIMITY)


{
// d_r = depth
// d_p = pinhole.meanDepth
float CoC = abs(depth - pinhole.meanDepth);
CoC = saturate(CoC) * _BokehSize;

float d = distance(pinhole.position, uv);

float2 shapeUV = CalculateUV(uv, pinhole,position, CoC);


accum += SampleLightShape(shapeUV);
}
}

return accum;
}

Listing 1.3. The pinhole rendering code. This method is called over every uniform grid cell
inside a maximum predefined radius.

1.4 Temporal Filtering


Since pinholes are very evident when rendering a shape that has high-frequency details
in light space, it is not surprising that the technique is highly susceptible to changes in
the shadow map. This can happen due to occluder or light animations and is, in a way,
similar to what happens in nature. But because the shadow map is just a discrete rep-
resentation of the depth towards the light, pinholes can appear and disappear sporadi-
cally, suffer great jumps in position and size, generating strong aliasing artifacts.
Ignoring these details will generate a very noisy and distracting temporal pattern that
can take away from the desired effect, so it is essential to design a temporal filter that
at least mitigates the effect, unless the user desires to render just a still image or a static
scene.
136 1. Soft Shadow Approximation for Dappled Light Sources

1.4.1 Accumulation and Decay


Given that pinholes are an emergent feature of the visibility function of each scene, it
is hard to predict how they move in time. A very naive approach to solving this problem
is keeping track of the previous frame’s uniform grid, and accumulating new points,
while decreasing the intensity of pinholes from previous frames. If the intensity is be-
low a certain threshold, it is discarded. This requires an additional variable to be
tracked per pinhole that represents intensity.
This works visually well, but has the secondary effect that if pinholes are moving
fast in a scene, the accumulation effects cannot catch up and the grid is saturated with
decaying points, decreasing performance because of the high amount of iterations re-
quired per shaded pixel.

1.4.2 Merging
In addition to temporal accumulation, it is also desirable to merge any pair of points
that fall within a defined radius. When merging both pinholes, we also combine the
position of both points using linear interpolation, letting the developer control the re-
sult by choosing a bias between new or old points. Merging can be executed while
flattening the linked list generated from the pinhole detector, but it requires   N 2 
iterations over the pinhole buffer. Also, this technique forces the user to do more
bookkeeping, as multiple buffers with different structures need to be kept and traversed.
This filtering technique can cause pinholes to slide and swim through the shadow
map, which can visually break the phenomenon. Thus, it is important to define reason-
able parameters that limit the amount of movement a pinhole can have, and to carefully
merge with neighbor cells. If moving pinholes (Figure 1.7) are not properly merged,
there will be clear artifacts at the edges of cells where pinholes will stop moving and
just decay until discarded.

Figure 1.7. A pinhole has moved from one cell to another in this frame, and should be merged.
If the merging does not occur, the previous pinhole will decay at the edge of the cell, generating
an unwanted artifact.
1.5 Results 137

1.4.3 Screen-space Masking


It is also possible to do the search and filtering in screen space. For example, an initial
mask can be generated based on primary visibility, where only regions of the shadow
map that intersect with visible geometry are marked. Then we can search for pinholes
and merge/accumulate safely within that mask, possibly with some relaxation of its
boundaries so that pinholes don’t appear and disappear at the edges. This concept can
also be implemented without filtering, as it would help reduce the complexity of pin-
hole search for very big shadow maps or lists of cascades that don’t necessarily con-
tribute to the visible range.
This approach would also have to explore many of the solutions that spatiotem-
poral algorithms implement [Korein and Badler 1983, Karis 2014, Marrs et al. 2018],
given that any fast camera or object movement would leave trails on the screen, among
other possible artifacts. However, it is possible that this can be useful in very con-
strained scenarios where the developer has control over these circumstances, such as a
top-down real-time strategy game.

1.4.4 No Filtering
If the scene or the shadow casters responsible for pinhole generation are near static,
then this technique is very effective. It can be used for baked lighting, where temporal
filtering is unnecessary but the shadow phenomenon is desired on the static receivers.
Additionally, if the user desires to have the projected (but static) pinholes interact
with dynamic objects, the pinhole buffer can be calculated and stored offline. This
removes the per-frame cost of pinhole detection, but requires a bit more infrastructure
in the engine. The runtime cost of evaluating the projected textures is still necessary,
however.

1.5 Results
Runtime performance is highly dependent on the scene and amount of pinholes per
frame. Shadow map resolution and the number of shadow-receiving pixels being eval-
uated on screen also contribute to performance. The maximum possible pinhole count
and the uniform grid subdivision count are the two biggest driving performance factors,
as it usually happens when iterating over uniform grids, as Table 1.1 shows. It is also
fill rate bound, considering that the pinhole search is done in the fragment shader of
each receiver.
Because the radius of the projected shape light is proportional to the distance of
the pinhole to the receiver, the number of neighbor cells iterated increase when objects
are far away, negatively impacting performance, as can be seen in Table 1.2. Having
either a maximum search size or transitioning to a default soft shadow model might be
enough to hide this limitation. At very big pinhole sizes the approximation breaks down
visually, so it is desired to prevent this from happening anyway.
138 1. Soft Shadow Approximation for Dappled Light Sources

16 16 , 64 pinholes 3232, 32 pinholes 6464, 16 pinholes


512 512 0.08/0.12/1.11 0.08/0.05/0.95 0.08/0.04/0.95
1024 1024 0.29/0.98/1.71 0.29/0.21/1.19 0.29/0.06/1.07
2048 2048 1.11/1.16/2.73 1.1/0.38/1.93 1.1/0.13/1.53
Table 1.1. Performance measurements for different uniform grid cell subdivisions and shadow
map sizes. The values are in milliseconds, and correspond to pinhole detection, filtering and
rendering. Note that the shadow map size implicitly drives how many pinholes can be detected,
and their size. All measurements were done on an Nvidia 1070 GTX graphics card running the
sample scene fullscreen at 1920 1080, and captured with Nvidia’s Nsight performance tool.
Shape textures are 128128, and the number of pinholes per cell are adjusted with the grid size
to keep the pinhole count constant.

Search radius 20 px 51 px 102 px 204 px 512 px


Total frame time 0.97 2.15 5.30 16.39 64.10
Table 1.2. Example measurements for different search radii when rendering a shadow receiver.
As the radius increases, the number of cells required to iterate increases, negatively impacting
performance. This maximum search radius also limits how big the projected pinholes can be.
Time values are in milliseconds, and radius is defined in terms of the shadow map pixels.

Sampling light shape textures is also another bottleneck, as it must occur for each
pinhole in a specific neighborhood inside the uniform grid. Considering this limitation,
it is better to design shapes that can be procedurally defined, reducing memory
lookups. Finally, downsampling the shadow map also helps substantially, as both the
lookups necessary for pinhole detection and the amount of pinholes found decrease.
This also benefits pinhole detection by offloading the filtering to the rasterizer.
In practice, we found that the simple 9-tap pinhole detector works best, coupled
with the simpler temporal filtering approaches. In a way, if the pinholes are moving too
much the result becomes stochastic and approximate, but at least it is not jarring to the
eye. Using the more robust pinhole detection with a radius has multiple problems. First,
the complexity of estimating the mean and variance of the segmented kernel makes it
very expensive. However, our implementation used a very basic local mean and vari-
ance estimator, and by using shared memory, most of the impact from texture lookups
can probably be mitigated. Additionally, selecting a predefined radius is not good for
arbitrary scenes with animations, and even if we sampled at different radii and averaged
the results, this would require even more samples, making it impractical.
Finally, a comparison with PCSS can prove useful. A naive implementation of
PCSS can generate similar results, but requires many samples to reduce noise and cap-
ture pinholes, as it has no particular search method for them. It also cannot approximate
arbitrary light shapes, although sampling a shape texture may be an interesting ap-
proach for future work. Figure 1.8 shows the resulting shadows. Overall, our method
can achieve similar quality of high-frequency shadows as PCSS with fewer filtering
samples.
1.6 Conclusion and Future Work 139

Figure 1.8. Left: our pinhole approximation with a 64x64 uniform buffer and 16 pinholes per
cell. Right: naive PCSS with 32 blocker search samples and 256 PCF samples. Note that PCSS
has better shadow edges, but cannot approximate the same optical properties (in the case of non-
circular shapes), as well as requiring many more samples for its computation.

1.6 Conclusion and Future Work


We propose a novel approximation technique to represent a common visual phenome-
non that arises when rendering shadows from shapes that have high-frequency details
in light space. It enables artists and developers to enhance the aesthetic of complex
shadows in these circumstances, such as tree shadows. Aside from real-life scenarios
where it is applicable, this effect also lets users be creative with the projected shapes,
bringing non-photorealistic rendering to shadows. Figures 1.9 through 1.11 show ex-
ample renders running in real time.
This technique for implementing shadow pinholes can be optimized in several
ways. For example, to improve the performance, one could use an advanced spatial

Figure 1.9. Left: Common shadow mapping with simple filtering (Unity). Right: Using a cir-
cular shape to estimate the light shape.
140 1. Soft Shadow Approximation for Dappled Light Sources

Figure 1.10. Shadows during a simulated eclipse, where the area light is occluded procedurally.

Figure 1.11. Non-photorealistic shape textures, which can be used for artistic control of a
scene, or simulating uncommon light shapes (for example, an LED array light will generate
shadows similar to those from the light shape in the second image).

acceleration structure for the pinhole search, such as a quadtree. Clever use of down-
sampling techniques can also reduce the lookup complexity in most cases. To improve
the pinhole detection, one could use a moment map for approximating the variance. To
improve the temporal filtering, one could implement a more robust clustering approach
that doesn’t consider just close pairs of samples. Lastly, we could combine our tech-
nique with PCSS, either by extending the blocker search with information from the
pinhole buffer, or trying to apply the same shape texture concept based on the blocker
search region. Finally, partial occlusion of the projected pinhole shapes by close oc-
cluders can also aid in simulating this effect.

Bibliography
BRUCKS, R. 2018. Realistic Foliage Imposter and Forest Rendering in UE4. Game Developers
Conference 2018.
DONNELLY, W. AND LAURITZEN, A. 2006. Variance Shadow Maps. In Proceedings of the 2006
Symposium on Interactive 3D Graphics and Games, pp. 161–165. URL: http://doi.acm.org/
10.1145/1111411.1111440.
Bibliography 141

FERNANDO, R. 2005. Percentage-closer Soft Shadows. In ACM SIGGRAPH 2005 Sketches.


URL: http://doi.acm.org/10.1145/1187112.1187153.
HEITZ, E., HILL, S., AND MCGUIRE, M. 2018. Combining Analytic Direct Illumination and Sto-
chastic Shadows. In Proceedings of the 2018 Symposium on Interactive 3D Graphics and
Games, pp. 2:1–2:11. URL: http://doi.acm.org/10.1145/3190834.3190852.
KARIS, B. 2014. High Quality Temporal Anti-Aliasing. SIGGRAPH 2014.
KOREIN, J. AND BADLER, N. 1983. Temporal Anti-aliasing in Computer Generated Animation.
In Proceedings of ACM SIGGRAPH ‘83, pp. 377–388. URL: http://doi.acm.org/10.1145/
800059.801168.
MARRS, A., SPJUT, J., GRUEN, H, SATHE, R., AND MCGUIRE, M. 2018. Adaptive Temporal An-
tialiasing. In Proceedings of the 2018 Conference on High-Performance Graphics, pp. 1:1–
1:4. URL: http://doi.acm.org/10.1145/3231578.3231579.
MINNAERT, M. 1937. Light and Color in the Outdoors. Springer, 1937.
PETERS, C. AND KLEIN, R. 2015. Moment Shadow Mapping. In Proceedings of the 2015 Sym-
posium on Interactive 3D Graphics and Games, pp. 7–14. URL: http://doi.acm.org/10.1145/
2699276.2699277.
PETTINEO, M. AND DE ROUSIERS, C. 2012. Depth of Field with Bokeh Rendering.
REEVES, W., SALESIN, D., AND COOK, R. 1987. Rendering Antialiased Shadows with Depth
Maps. In Proceedings of ACM SIGGRAPH ‘87, pp. 283–291. URL: http://doi.acm.org/10.
1145/37401.37435.
TUFTE, E. 1997. Escaping Flatland, Sculpture. URL: https://www.edwardtufte.com/tufte/
sculpture.
WILLIAMS, L. 1978. Casting Curved Shadows on Curved Surfaces. In Proceedings of the ACM
SIGGRAPH ‘78, pp. 270–274. URL: http://doi.acm.org/10.1145/800248.807402.
YANG, J., HENSLEY, J., GRÜN, H., AND THIBIEROZ, N. 2010. Real-time Concurrent Linked List
Construction on the GPU. In Proceedings of the 2010 Eurographics Conference on Rendering,
pp. 1297–1304. URL: http://dx.doi.org/10.1111/j.1467-8659.2010.01725.x.
2
III

Parallax-Corrected
Cached Shadow Maps
Pavlo Turchyn

2.1 Introduction
Rendering shadows over large viewing distances often requires processing a large num-
ber of shadow-casting objects. Many game engines, which are using shadow maps for
long-range shadow rendering, opt for some caching schemes that allow distributing the
costs of shadow map rendering over several frames, thus exploiting frame-to-frame
coherency and rendering only a subset of shadow casters per frame, e.g., Schulz and
Mader [2014] and Acton [2012]. Some game engines cache occlusion data derived
from shadow maps rather than keeping plain shadow maps, e.g., Valient [2012] and
Gollent [2014].

Figure 2.1. Shadows from a moving directional light rendered using two shadow map cascades.
The first (near) cascade is updated every frame, and the second (far) cascade is cached and in-
validated infrequently. The left image shows a mismatch between the cascades since the cached
cascade is rendered with a light direction that was captured many frames ago. Parallax correction
fixes this divergence as shown on the right image.

143
144 2. Parallax-Corrected Cached Shadow Maps

However, caching is problematic when the shadow casting light is dynamic, e.g.,
in a game with dynamic day-night cycle where the sun or moon is constantly moving
across the sky, thus making cached data inconsistent with the current light state. One
has to either invalidate the cache often to keep the divergence small, which makes
caching a less efficient optimization, or treat cached shadows as a very rough approxi-
mation of actual shadows because the error is too apparent when viewed up close.
In this paper, we describe a parallax correction algorithm for rendering sweeping
shadows from a dynamic light source using a static shadow map. The use of parallax
correction in Far Cry 5 enabled a fairly seamless transition between dynamic shadows
rendered with two different techniques: near shadows with cascaded shadow maps
(CSM) updated every frame, and far shadows with adaptive shadow maps (ASM) cov-
ering 500 meters range and updated every 2500 frames. As a result, we are using rela-
tively expensive CSM for rendering shadows within only 30 meters from the player’s
camera, which is quite a short range for an open world game.

2.2 Parallax Correction Algorithm


When rendering images of a static scene from different viewpoints, one can reproject
pixels seen from one viewpoint to another provided that the image depth buffer is avail-
able and we know all required camera transforms, e.g., Mark et al. [1997]. It is straight-
forward to apply this approach to shadow maps: given a shadow map rendered with a
shadow camera built for light direction L 0, one can reconstruct world space positions
of all shadow map texels, and then render the resulting set of points (as a point cloud
or a mesh) into a shadow map for a different light direction L1. It’s possible to employ
a more sophisticated method of interpolation [Yang et al. 2011]. Unfortunately, these
reprojection procedures are not very practical since they operate with the entire shadow
map, which is often a high-resolution image. Processing a lot of texels may be compu-
tationally expensive even with relatively simple shaders. Moreover, doing reprojection
over a set of shadow maps arranged via some spatial subdivision scheme, such as cas-
caded shadow maps, isn't straightforward for the border texels, which might end up in
a different cascade when rendered with a new light camera. Here we attempt to approx-
imate the reprojection for a set of pixels on screen instead of warping the entire shadow
map.
Assume we have a shadow map generated for a dynamic directional light with
direction L 0. Consider a point P0 in world space as illustrated in Figure 2.2(a). We can
reconstruct the world space position of its occluder Pocc by computing the distance to
occluder d 0 from the shadow map depth:

Pocc  P0  d 0 L 0. (2.1)

Suppose the light is moving, and its new direction is L1. Let’s project Pocc along the new
direction L1 to get a point P1:

P1  Pocc  d 1 L1 . (2.2)
2.2 Parallax Correction Algorithm 145

(a) (b)
Figure 2.2. Parallax correction algorithm. (a) The shadow map is rendered for a light direction
L 0. The point Pocc is starting to occlude P1 rather than P0 when the light changes its direction from
L 0 to L1. Our idea is to sample shadows at P0 using the shadow map computed for light direction
L 0, and take the resulting shadow factor for shadow intensity at P1. (b) We walk along the direc-
tion D starting from P1 in small increments, sampling the shadow map at each point S i and ac-
cumulating occluder distance values. We stop after certain number of iterations. We compute the
average value of accumulated occluder distances, which gives us an approximation of P0 via
Equation (2.4).

The practical meaning of Equation (2.2) is that we can compute shadows at P1, i.e.,
P1 will be in shadow for any d 1  0 . So far we were following this reprojection route:
take a point P0 , reconstruct its occluder from a shadow map, and then use the repro-
jected occluder when shading the scene. However, we are really interested in doing
these steps in the reverse order. For any given light direction L1 and a point P1, we want
to find the corresponding P0 , so that we can sample shadows at P0 using the shadow
map computed for light direction L 0 and then use the resulting shadow factor for
shadow intensity at P1.
Our parallax correction algorithm can be briefly described as this: we want to get
to P0 from P1. For this, we need two things: a good guess of the direction from P1 to P0
and a good guess of the length of the path from P1 to P0 . Let’s elaborate how to obtain
these values. Substituting Equation (2.1) into Equation (2.2), we get

P0  P1  d 1 L1  d 0 L 0. (2.3)
We attempt to solve Equation (2.3) by assuming d 1  kd 0, where k is a constant. We
will discuss how we choose the value k later in this section. Here we only note that this
assumption enforces a certain relation between P1, Pocc , and P0 . This gets us

P0  P1  d 0  kL1  L 0 . (2.4)
146 2. Parallax-Corrected Cached Shadow Maps

Thus, if we want to compute shadows for light direction L1 at an arbitrary point P1, we
can use Equation (2.4) to find corresponding point P0 provided that we can compute
the distance to occluder d 0 and choose a reasonable value of k.

Occluder search. As follows from Equation (2.4), P0 is located somewhere on the ray
starting from P1 in the direction

D  kL1  L 0. (2.5)

We search for an occluder by marching along this ray with a certain number of itera-
tions, computing occluder depth at each step, and then taking the average for d 0 . This
process is illustrated in Figure 2.2(b) and Figure 2.3(a).
The search distance d s is a scene-dependent parameter that accounts for maximum
displacement of the shadows due to the parallax we are expecting. That is, we need a
larger search distance for a scene with long shadows and tall shadow casting structures.
Conversely, the search distance may be shorter for a scene with an overhead light and
small shadow casters. An increase in the difference between L 0 and L1 also increases
shadow parallax and thus the search distance.

Choosing the parameter k. The point P0 is located somewhere on the ray originating
at P1, with ray direction D being controlled with the parameter k. Figure 2.3(b) illus-
trates that changing the value of k would result in different D, with values k  1 corre-
sponding to the point P0 being closer to the occluder than P1. Ideally, having P0 on the

(a) (b)
Figure 2.3. Occluder search. (a) Occluder search with 5 iterations and k  1.5. We are sampling
shadow map depths at each step to compute the distance from a point on the ray to its occluder,
if there’s any. (b) Effect of parameter k on search vector D. The value k  1.5 would give incor-
rect results because some points on the search ray are located inside shadow casting geometry.
2.2 Parallax Correction Algorithm 147

surface of an object containing P1 would give the most accurate shadows, but this is
hardly possible in practice. In our example, P0 will be either above the surface if we
choose k  2 or k  3, or even below the surface if we choose k  1.5.
The best value for k depends on the scene. Consider a difficult situation when we
want to compute parallax correction at a point located on a concave surface. It’s quite
probable that the occluder search may be testing points under the surfaces of nearby
objects as illustrated in Figure 2.3(a), thus interpreting surrounding geometry as an
occluder. Choosing a larger value for k can help prevent these errors in concavities, as
shown in Figure 2.3(b). However, a larger values of k can cause the method to miss
smaller occluders, thus producing inaccurate results. Due to this tradeoff, one needs to
pick the value of k that produces the best results for a given scene.
We use the following empirical formula, where k depends on the magnitude of the
difference between light directions

k  1  L1  L 0 . (2.6)

Finally, we can gather all the bits we have described so far into the sample code given
in Listing 2.1.

Implementation enhancements. The accuracy of shadows created with parallax cor-


rection generally degrades as the angle between the current light direction and the orig-
inal light direction increases. It’s possible to reduce the divergence and thus the visual
deficiencies if we know the animation curve of the light L  t , where L is the light
direction and t is time. Assume we want to update the shadow map at regular time
intervals with the time step Δt , i.e., if we first update the shadow map at t 0 , then the
following updates will happen at t 0  nΔt , where n is a positive integer. The best strat-
egy when updating the shadow map at t 0 would be to take the light direction at halfway
between directions L  t 0  and L  t 0  Δt , or just taking L  t 0  Δ2t  if the light’s ani-
mation speed is constant. This minimizes angles between the light directions used for
shading and the direction of the shadow map. Even though lookahead sampling re-
quires us to give the shadows subsystem access to the curve L (thus making the imple-
mentation more complicated), we found that the resulting improvement in quality is
worth the effort.
Another way to improve parallax-corrected shadows is adding a cross-fade when
updating cached shadow maps to the most recent light state. We are assuming that two
shadow map textures will be available at this point. It makes sense to render the up-
dated shadow map into a separate texture because its rendering may take several frames
to complete (e.g., it takes more than 80 frames to update the cached shadow map in
Far Cry 5) and we still need to apply shadows to the scene while the new shadow map
is only partially rendered, and thus it can’t be used for shading. With both the old and
new shadow maps ready, we can perform a cross-fade between shadow factors sampled
from these textures, which is less visible than an immediate switch between the textures
in a single frame.
148 2. Parallax-Corrected Cached Shadow Maps

uniform float3 L0; // shadow map light direction


uniform float3 L1; // current light direction
uniform float4x3 searchParams; // contains search distance, etc.
uniform float4x3 worldSpaceToShadowMap;

float CalcShadow(float3 P1 /* a point in world space */)


{
#if ENABLE_PARALLAX_CORRECTION
float k = 1.0 + length(L1 - L0);
float3 D = k * L1 + L0;

// Occluder search
float3 S = P1;
float3 dS = mul(float4(D, 1), searchParams);
float sum = 0, cnt = 0;
for (int i = 0; i < OCCLUDER_SEARCH_ITS; ++i)
{
float3 Sp = mul(float4(S, 1), worldSpaceToShadowMap);
float occDepth = SampleShadowMapDepthTexture(Sp.xy);
if (occDepth < Sp.z)
{
sum += Sp.z - occDepth;
cnt += 1;
}
S += dS;
}

float d0 = cnt > 0 ? sum * rcp(cnt) : 0;


float3 P0 = P1 + d0 * D;
#else
float3 P0 = P1;
#endif
// Compute shadow factor at P0
float3 Pp = mul(float4(P0, 1), worldSpaceToShadowMap);
return step(Pp.z, SampleShadowMapDepthTexture(Pp.xy));
}

Listing 2.1. Example implementation of the parallax correction for a simple orthogonal shadow
map.

Algorithm limitations. While the algorithm works well as long as the parameters d s
and k are chosen appropriately, a high variation in depth of overlapping shadow casters
results in incorrect parallax correction. It is caused by the occluder search hitting the
furthest occluder and computing parallax correction using a biased occluder distance,
which may distort shadows from closer shadow casters overlapping with shadows from
2.3 Applications of Parallax Correction 149

Figure 2.4. Defects occurring with high depth variation of overlapping shadow casters. From
left to right: a shadow from the cone is overlapping with a shadow from the cylinder, which is
much taller and further away from the camera; PCSS estimator gives incorrect penumbra size
resulting in a large penumbra near the cone base; our parallax correction algorithm also produces
incorrect results for the same reason (overestimation of the distance to occluder) resulting in
shadows distortion; parallax-corrected shadows penumbra is incorrect too.

more distant ones, as shown in Figure 2.4. This is similar to the occluder fusion prob-
lem existing in some soft shadows algorithms, such as percentage-closer soft shadows
(PCSS) [Fernando 2005].

2.3 Applications of Parallax Correction


Parallax correction may be applied to a number of algorithms that utilize shadow map
caching. Generally, a static shadow-casting algorithm can be used to generate the initial
shadow map. Parallax correction can then be applied to adjust the map as lights move.
However, the caches need to be invalidated from time to time since parallax correction
is an approximative technique.

Cached cascaded shadow maps. CSM caching implies that not all cascades are
updated within one frame. One way to do that is skipping cascade updates for a certain
small number of frames, either replacing cascade updates with another workload of
similar complexity or updating distant cascades in a round-robin manner. The system
keeps using matrices and shadow map textures cached from previous frames to apply
shadows to the current scene. We suppose that applying parallax correction in this sce-
nario doesn’t offer a lot of improvements since shadow maps are meant to be updated
quite frequently (every other frame or so), thus small discrete changes in light direction
aren’t too noticeable.
150 2. Parallax-Corrected Cached Shadow Maps

The work by Schulz and Mader [2014] employs a different approach to caching
with a single shadow map containing only static objects replacing the last two cascades.
Shadows cover 1.4 km range, so the full update of this shadow map takes 10–15 ms
distributed over many frames. In this scenario, parallax correction can reduce the vis-
ual discontinuity between the cascades that is caused by the lengthy shadow map up-
date process, similar to what is demonstrated in Figure 2.1. One should just update the
static shadow map periodically rather than only in certain preset points in game levels.
Acton [2012] utilizes a toroidal update scheme, also found in other algorithms
such as clipmaps [Asirvatham and Hoppe 2005], to minimize the number of shadow
casting objects rendered into cascades. Their main observation is that moving the
player’s camera only changes the cascade’s frustum translation, but not its size or ori-
entation, as long as the shadow casting light is static. Typically there’s only a small
difference between the current frustum and the frustum from the previous frame. Thus,
contents of the shadow map would nearly be the same save for few small regions. One
can perform a toroidal update reusing a large portion of the previously rendered shadow
map, and only rendering objects falling into the parts near the shadow map border that
were invisible previously. Parallax correction improves consistency between the cas-
cades updated every frame and the cascades updated via toroidal update, thus allowing
cached data to be reused for a longer period before the cached cascades need to be
rebuilt with a new light direction.

Adaptive shadow maps. Shadow map caching is an essential part of the adaptive
shadow maps algorithm, e.g., Turchyn [2011]. It’s possible to discretize light direction
movements and build a separate hierarchy of tiles for each quantized light direction.
Aside from possibly noticeable steps in the light directions, this also implies that we
have to maintain two hierarchies whenever we want to update the shadow maps. One
of the hierarchies is used for shading the current frame, and the other one is in the
process of construction. Having two hierarchies at the same time means we need to
double the size of the tile cache, and we also pay extra costs to render the tiles. Parallax
correction addresses these issues.
We can start using the tiles rendered with the new light direction as they become
available, rather than waiting until the full tile hierarchy is ready. This way shadows
are sampled from a mix of old and new tiles with the parallax correction ensuring
shadow consistency as shown in Figure 2.5. Therefore, we can start discarding old tiles
as the new tiles become available, thus reducing tile cache memory requirements and
improving cache utilization.

2.4 Results
A major challenge in the development of Far Cry 5 was the addition of new rendering
techniques, such as screen-space reflections, that were not present in the engine previ-
ously. The existing subsystems had to become faster to accommodate for the new tech,
thus the shadow rendering budget was reduced from 6 ms to 4.5 ms. The Far Cry series
2.4 Results 151

(a) (b)
Figure 2.5. Parallax correction not only enables smooth sweeping shadows with adaptive
shadow maps algorithm, but also makes it possible to start evicting old tiles from the cache
before the update is fully finished without having discontinuities visible in the left image.
(a) ASM shadows from a mix of new and old tiles. (b) The same set of tiles with parallax cor-
rection applied.

has a long history of using cached shadow maps [Valient 2012]. However, cached shad-
ows were always treated as a low-quality solution for the objects further away from the
player’s camera, hence cascaded shadow maps used to cover quite a large viewing dis-
tance. Adding parallax correction and improving cached shadow map filtering quality
allowed having cached shadows closer to the camera, thus reducing the range covered
by CSM from 80 to 30 meters. Table 2.1 shows examples of the resulting performance
improvements.
We are using adaptive shadow maps for shadows covering the range from 30
to 500 meters from the camera. A typical cost of rendering a single ASM tile is around
0.5 ms on the GPU and up to 1.5 ms on the CPU on PS4. An ASM light direction

Number of objects in CSM CSM GPU render time, ms


Test
80 meters 30 meters 80 meters 30 meters
a 310 104 3.8 2.3
b 132 68 3.9 2.8
Table 2.1. Reducing the range of cascaded shadow maps was a large win performance-wise.
Far Cry 4 used CSM with three cascades covering 80 meters range from the player’s camera.
We have reduced the range of three cascades down to 30 meters in Far Cry 5, since parallax-
corrected adaptive shadow maps are good enough to be used at ranges closer than 80 meters. A
great side effect of this range reduction was the increase of CSM texel density around the player.
152 2. Parallax-Corrected Cached Shadow Maps

update is triggered every 1–2 minutes of normal gameplay time, so we’re avoiding the
update costs in the vast majority of frames. Our implementation of the parallax correc-
tion is relatively lightweight, with the typical GPU cost being 40–70 µs on PS4. We are
performing the occluder search with 7 steps over a low-res downsampled depth map
generated using a min-depth kernel over the shadow map (see depth extent map in
Turchyn [2011]), which improves search accuracy while keeping the number of itera-
tions low.
The importance of consistency between cached and dynamic shadows is clearer in
motion than on static screenshots such as Figure 2.6. Our very infrequently updated
cached shadows without parallax correction often resulted into lighting being com-
pletely different when transitioning between dynamic and cached shadow maps. In mo-
tion this change between the two types of shadows was perceived as a cross-fade
between unrelated images rather than a change in a shadow’s details. See this book’s
online sample code for a demonstration of parallax correction in motion.

Bibliography
ACTON, M. 2012. CSM Scrolling, An Acceleration Technique for The Rendering of Cascaded
Shadow Maps. Advances in Real-Time Rendering in Games course, SIGGRAPH ‘12.
ASIRVATHAM, A. AND HOPPE, H. 2005. Terrain rendering using GPU-based geometry clipmaps.
In GPU Gems 2, pp. 27–45. Addison-Wesley.
FERNANDO, R. 2005. Percentage-closer soft shadows. In ACM SIGGRAPH 2005 Sketches.
URL: http://doi.acm.org/10.1145/1187112.1187153.
GOLLENT, M. 2014. Landscape Creation and Rendering in REDengine 3. Game Developers
Conference 2014.
MARK, W., MCMILLAN, L., AND BISHOP, G. 1997. Post-rendering 3D Warping. In Proceedings
of the 1997 Symposium on Interactive 3D Graphics, pp. 7–ff.
SCHULZ, N. AND MADER, T. 2014. Rendering Techniques in Ryse: Son of Rome. ACM SIG-
GRAPH ‘14.
TURCHYN, P. 2011. Fast Soft Shadows via Adaptive Shadow Maps. In GPU Pro 2, pp. 215–
224, A K Peters.
VALIENT, M. 2012. Shadows in Games—Practical Considerations. Real-Time Shadows course,
SIGGRAPH ‘12.
YANG, L., TSE, Y., SANDER, P., LAWRENCE, J., NEHAB, D., HOPPE, H., AND WILKINS, C. 2011.
Image-based Bidirectional Scene Reprojection. In ACM Trans. Graph., 30:6, pp. 150:1–
150:10.
Bibliography 153

(a)

(b)

Figure 2.6. Parallax correction improves the transition between cascaded shadow maps at the
foreground and adaptive shadow maps at the background in Far Cry 5. (a) A mismatch between
long-range and dynamic shadows due to a slow update of cached long-range shadows. (b) Paral-
lax correction fixes the mismatch so that the long-range shadows are perceived as a level of
details of dynamic shadows rather than something unrelated.
IV
3D Engine Design

Welcome to the 3D Engine Design section of GPU Zen’s second volume. The five
chapters presented here are a reflection of the latest trends in modern 3D engine design,
as shown through advances in realism, material synthesis, ray-tracing, as well as tar-
geting the latest graphics API standards.
The section starts with Sergey Makeev’s chapter “Real-Time Layered Materials
Compositing Using Spatial Clustering Encoding”. Sergey presents an algorithm that
mimics “Allegorithmic Substance” texture pipeline as close as possible but in real-
time. It uses a layered materials method which allows us to create composite materials
using a large number of layers. This technique was successfully applied in the render-
ing of armored vehicles in the published action multiplayer tank game “Armored War-
fare”.
Next, Thomas Deliot and Eric Heitz’s chapter “Procedural Stochastic Textures by
Tiling and Blending” describes a production-ready algorithm that synthesizes infi-
nitely-tiling stochastic textures from small input texture examples. The technique runs
in a fragment shader and requires no more than 4 texture fetches and a few computa-
tions.
The third chapter in this section is “A Ray Casting Technique for Baked Texture
Generation” by Alain Galvan and Jeff Russell. This chapter shows how to bake high-
polygon geometry to textures meant to be used by low-polygon geometry using GPU
ray-casting. Computation times are reduced drastically compared to classical CPU-
based baking tools. The chapter shows example shaders to bake various types of tex-
tures, as well as highlighting a number of potential pitfalls inherent in the process.
In the fourth chapter “Writing an efficient Vulkan renderer”, Arseny Kapoulkine
explores key topics for implementing Vulkan in modern 3D engines. The topics include
memory allocation, descriptor set management, command buffer recording, pipeline
barriers, and render passes. The chapter also discusses ways to optimize CPU and GPU
performance of production desktop/mobile Vulkan renderers today as well as look at
what a future looking Vulkan renderer could do differently.
The fifth chapter “glTF - Runtime 3D Asset Delivery” by Marco Hutter explains
Khronos Group’s glTF – a transmission and delivery format for 3D assets. The chapter
starts with the goals and features that are achieved with glTF and their technical im-
plementation. Then the role of glTF in the 3D content creation workflow is laid out,
showing the tools and libraries that are available to support each step of the content

155
156 IV 3D Engine Design

creation process, and how glTF may open up new application areas that rely on the
efficient transfer and rendering of high-quality 3D content.
I hope you enjoy learning from this section’s authors’ experiences, and do not
hesitate to share with us your latest findings and experiences around 3D Engine Design!

Welcome!

—Wessam Bahnassi
1
IV

Real-Time Layered Materials


Compositing Using Spatial
Clustering Encoding
Sergey Makeev

1.1 Introduction
Most of the modern rendering engines take advantage of using a library of simple and
well-known materials and a layered material representation to author detailed and high-
quality in-game materials. Popular tools used in texturing pipeline nowadays (e.g., Al-
legorithmic Substance Painter and Quixel NDO Painter) are also based on the concept
of layered materials [Neubelt and Pettineo 2013, Deguy et al. 2016, Karis 2013].
In this chapter, we present an algorithm that uses a layered materials method which
allows us to create composite materials using a large number of layers in real-time. Our
algorithm is designed to mimic Allegorithmic Substance texture pipeline as close as
possible but in real-time. The proposed technique based on the blending of multiple
well-known materials where a shared materials library defines the surface properties
for each material used in compositing.
Using our method, each mesh can have one unique UV set and several unique
texture blend masks where each blend mask defines the per-pixel blending weights for
the material from the library. Each material from the library can use a detail textures
technique to improve surface details resolution. Using the materials with detail textures
for the composition has the advantage of breaking the texture resolution barrier and
allows us to produce a final composition at a very high resolution. Having high-resolu-
tion in-game materials is especially crucial in the 4K era.
Our method supports the replacement of a library of materials and transparency
modifications for the texture blend masks at runtime. Material replacement at runtime
leads to a different visual appearance of the resulting composited material which is
especially important for games supporting User Generated Content or in-game

157
158 1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Customization. The presented technique is used for rendering armored vehicles in Ar-
mored Warfare, an action multiplayer tank game published by My.com.

1.2 Overview of Current Techniques


A popular method for compositing is to pre-bake a multilayered material into a set of
textures (albedo, normal, roughness, etc.) that are used by the game engine. The result-
ing texture set are primarily designed for a specific mesh and cannot be shared between
different meshes. We call this method “Static Material Layering.” [Neubelt and Pet-
tineo 2013, Deguy et al. 2016]. This approach gives good results but requires a lot of
GPU memory to store the final high-resolution textures.
To break this limitation, some modern rendering engines use a technique that is
pretty similar to the method that has already been proven for rendering terrain which
is called “Texture Splatting.” [Bloom 2000]. To produce the final composited texture,
these engines blend several textures using a pixel shader and a set of texture blend
masks which define the transparency of the blending. We call this method “Dynamic
Material Layering.” [Inside Unreal 2013, Noguer 2016]. This approach works well as
shown in Figure 1.1 but is limited to a small number of simultaneous material layers
due to memory and performance limitations.

Figure 1.1. An example of Dynamic Material Layering. This example mesh uses three texture
blend masks to define the blending transparency and three library materials to define the surface
properties.
1.3 Introduced Terms 159

1.3 Introduced Terms


Since different game engines and material authoring pipelines use different terms, here
are definitions which are used in this article.

 Material Template. One single well-known material such as gold, steel, wood,
etc. Material Template can use a tiled detail texture to give the illusion of greater
detail for a material. Material Templates are used as basic blocks to create com-
plex multi-layered materials.

 Material Mask. A grayscale texture which is used for defining transparency


while compositing different Material Templates. Usually these textures are cre-
ated by modern texturing tools like Allegorithmic Substance Painter and Quixel
NDO Painter using a semi-procedural approach.

 Color ID. A color-coded texture which defines areas of UVs that belong to dif-
ferent opaque materials. The opaque material does not have a blend mask asso-
ciated with it, and it is always used as a bottom layer in our composition. A
color-coded representation where each unique color represents a single material
is used to simplify the content pipeline and reduce the number of required tex-
tures. Each color-coded texture can represent several opaque materials as shown
in Figure 1.2.

 Layered Material. This is a material definition which is used to build the final
composite material. Each Layered Material has a single Color ID associated
with it and an ordered set of Material Masks which define the composition order
and the blending weights of the materials. Layered Material also has a set of
Material Templates associated with it to define the visual appearance of each
material used in a composition.

1.4 Algorithm Overview


Existing solutions [Inside Unreal 2013, Noguer 2016] which used for Dynamic Mate-
rial Layering store transparency for different materials in RGBA channels of the tex-
ture. When each material mask covers only a small area of the texture such solutions
are inefficient in terms of memory consumption. A lot of texture space is not used for
the composition and wasted.
We observe that blend masks are usually coherent in the texture space and only
partially overlap each other. Using this observation, we propose storing the different
non-overlapped blend masks in the same texture channels. To achieve this, we group
several Material Masks into a set that we call Clusters. We build the clusters based on
the connectivity between texels in texture space. This allows us to use the texture space
more effectively since the different clusters can store their blend masks in the same
160 1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Figure 1.2. Several opaque material masks combined into a single Color ID map and applied
to the mesh.

shared texture channels. At runtime we use this clustered representation and the set of
the material templates to make the final material composition as shown in Figure 1.3.
Since we are storing the different blend masks in the same texture channels, it is
critical to take into account texture filtering boundaries between different clusters. Tex-
ture filtering of different blend masks lead to errors during the composition stage due
to the leaking of texture blend masks from one material into another. While building
the material clusters, we consider which neighboring pixels are involved in the texture
filtering and this information is used while creating the clusters.
To create the initial partitioning into the clusters, we perform a connectivity anal-
ysis for the set of Material Masks. Connectivity analysis classifies all the texels which
are used for texture filtering as connected. If the texels are classified as connected, they
will belong to the same material cluster. When we perform a connectivity analysis, we
should also take mipmap texture filtering into account. At the same time, we should
limit the number of supported mipmap levels otherwise at the very last mipmap level
all the texels will be classified as connected. For our implementation, we decided to
support only the first four mipmap levels. Smaller mipmap levels are not handled by
our implementation and discarded. Supporting only the first four mipmap levels is
enough to preserve a good quality of the texture filtering and keep the number of the
resulting clusters small. An incomplete mipmaps chain might lead to aliasing, but its
level is acceptable [Mittring 2008]. In practice, the resulting aliasing can be barely
visible and effectively removed by most of the modern anti-aliasing algorithms.
Each resulting cluster should not contain more than a limited number of materials
where the number of materials depends on how many per pixel material layers we need
to support. In practice, the number of materials used in the cluster is usually equal to
1.4 Algorithm Overview 161

Figure 1.3. An example of the use of the presented technique. Several texture blend masks
encoded as a single RGB weights texture and a single cluster indirection texture. Encoded mate-
rial blend masks and material templates from the library are composited to get the final image.

four or five since we store the cluster blend weights in the BC1 or BC3 texture format.
A practically unlimited total number of materials and five per-pixel materials is enough
to represent even a very complex layered material.
Constructing the clusters with a limited number of materials is not always possible
since we can find more connected materials than the maximum allowed number of
materials per cluster. As a result, we can find a cluster which is used more than a max-
imum allowed number of materials. We can split such clusters into several smaller ones
that meet our initial requirements. This will lead to a texture filtering error for the texels
shared between the clusters edges. Filtering errors will occur due to erroneous texture
filtering between blend masks from the different clusters in which different materials
are encoded (see Figure 1.4). We propose a solution that minimizes leaking of the tex-
ture blend masks while splitting such clusters. For more details see Section 1.5.7.

1.4.1 Spatial Clustering Encoding Representation


At the preprocessing stage, we build a clustered representation using a single Color ID
to define all the opaque materials and an ordered set of Material Masks to define all
162 1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Figure 1.4. The texture unit uses blend masks from different Clusters during texture filtering.
The result of such filtering is a leaking of the material boundaries which leads to visual artifacts
in the composition stage.

transparent materials. As a result of preprocessing, we get a dataset which is used for


the runtime composition and contains three different types of data.

 Cluster Indirection. An indirection texture that defines the current cluster ID


for a specified texel. The cluster ID is stored using an integer texture format and
defines which set of materials should be used for a given texel. Cluster Indirec-
tion is stored using a texture that has a lower resolution than a resolution of the
source Material Masks. Neighbor texels usually use the same set of materials
and the set of used materials rarely changes which allows us to use a smaller
resolution texture to store this data. Since this texture contains the integer data,
this texture cannot use any texture filtering. Cluster Indirection is stored in the
texture without mipmaps and fetched using a POINT texture filtering mode.

 Cluster Weights. The weight texture defines the blending weight for each ma-
terial in a set of materials which a specified by the cluster ID. We support up to
five different material masks per pixel where the weights are stored using the
BC3 texture format. Cluster Weights are stored using a texture that has the same
resolution, as the input Material Masks, despite the set of used materials rarely
vary, neighbor texels of a blend mask can differ significantly. Cluster Weights
can be correctly filtered inside the same material cluster. Texture data is stored
with mipmaps and fetched using a TRILINEAR or ANISOTROPIC texture filter-
ing mode.

 Cluster Properties. Defines material surface properties such as albedo, rough-


ness, metalness, etc. which are used for the final composition. We decided to
use a Structured Buffer to store the Cluster Properties. Depending on the imple-
mentation, Cluster Properties can also be stored in the Constant Buffer.
1.4 Algorithm Overview 163

Figure 1.5. An example representation and usage of the encoded data.

At the composition stage, we obtain the cluster ID for each fragment which defines
a set of used materials and the blending weights. Then using the Cluster Properties, we
obtain the surface properties for each material which are used in the current cluster.
Afterwards, we use the blend weights and the surface properties for the final composi-
tion of the surface properties for a given fragment. See Figure 1.5 for more details.

1.4.2 Order-independent Representation for Blend Masks


For the final composition, we need to blend materials in the correct order, as defined
in the input data. The most common way of doing this is to perform alpha blending
and composite the fragments in a back-to-front order using the following equation:

C final  C src α  C dst 1  α .


164 1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Then we repeat this operation for all the blend masks used for blending:

C final  C n α n  1  α n    C 2 α 2  1  α 2  C 1α1  1  α1  C 0 .

This approach depends on the order of operations and instead of using alpha-blending
we can rewrite the blending equation in weighted form:

C final  w 0C 0  w 1C 1  w 2C 2    w nC n ,

where:

w k  1 α n 1  α n 1   1  α k 1  α k .

Since the resulting weights are normalized, we know that the sum of all weights are
always equal to one. We can use this property to reconstruct one of the weights inside
a composting pixel-shader instead of storing this weight in the texture channel:

w n  1  w 0  w 1  w 2   w n 1.

For our implementation, we decided to use the order-independent weighted represen-


tation for the texture blend masks. The order-independent representation allows us to
swap the texture channels inside the cluster freely. This property can be very useful for
several further optimizations.

1.5 Algorithm Implementation

1.5.1 Extract Background Materials


First, we define an opaque material for each texel. The opaque material is also used to
determine which texels are inside the UV mapping and which ones are not. Then we
determine which texture resolution we should use for the latest supported mipmap
level. For each texel inside the mip level, we generate a Color ID value from the original
high-resolution Color ID texture. To avoid situations where the resulting mipmap texel
uses several different opaque materials, we forbid using different Colors IDs in the
neighboring texels. This natural limitation allows us to find potential clusterization is-
sues at the very early stage of the art-pipeline and helps create Color ID maps which
can be efficiently clustered. This also allows us to skip the connectivity analysis for all
the opaque materials since the different opaque materials never share the neighboring
texels.

1.5.2 Material Layers


At this step, we have the ordered set of texture blend masks called Material layers. Each
material layer defines the transparency of the blending for each texel. We downsample
1.5 Algorithm Implementation 165

each material layer using a MAX filter to the resolution corresponding the latest sup-
ported mipmap level. If the resulting texel was marked as unused on the previous step,
this texel is located outside of the valid UV mapping and will not be used in the com-
position.

1.5.3 Weighted Sum Representation


At this step, we have several downsampled texture blend masks defined in the specified
order used for the alpha blending. We transform the texture blend masks to a normal-
ized weighted form using Algorithm 1.1. The normalized weighted representation also
helps us to discard texels that are entirely covered by other materials and do not con-
tribute to the final composition.

For (every texel(X,Y) in opaque layer)


{
If (texel(X,Y) is empty in opaque layer)
{
skip texel
}
accum = 1.0
For (every input texture blend mask)
{
alpha = blend_mask(X,Y)
layer_weight(X,Y) = alpha * accum
accum = accum * (1.0 - alpha)
}
}

Algorithm 1.1. Converting the texture blend mask to a normalized weighted form.

1.5.4 Undirected Graph Representation


At this step, we move from a bitmap representation of input data to an undirected graph
representation. The advantage of a graph representation in comparison with a bitmap
representation is that we can use graph theory for analyzing and building clusters with
specific characteristics. For each texel, we find all the texture blend masks that have
contributed to a given texel, and assign a unique texel identifier that corresponds to the
unique combination of blend masks used. Then we find the connected area using an
algorithm similar to a flood-fill algorithm and make a separate graph vertex from each
unique combination. The result produced by the algorithm shown in Figure 1.6. For
each resulting graph vertex, we store an assigned identifier that encodes which blend
masks were used to build this graph vertex.
166 1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Figure 1.6. Splitting bitmap data to the graph vertices. Red, Green and Blue circles represent
areas covered by the different texture blend masks.

1.5.5 Texture Filtering Requirement Analysis


At this step, we build edges between graph vertices according to the following rule:
If two graph vertices have adjacent pixels that are used for texture filtering, these ver-
tices are connected.
As a result, we built a connected undirected graph G  V , E  from the source
bitmap data. V represents the area affected by the different combinations of input tex-
ture blend masks. E represents the texture filtering relationship between these areas as
shown in Figure 1.7. The number of texels used for texture filtering determines the
edge weight and will be used later for the graph partitioning.

Figure 1.7. Resulting undirected graph. Arrows represent edges which indicate the texture fil-
tering relationships between vertices.
1.5 Algorithm Implementation 167

1.5.6 Finding the Number of Connected Components


The resulting undirected graph usually has several connected components. Our goal is
to find the number of connected components and split the graph into several subgraphs
between which no filtering is required (see Figure 1.8). Each subgraph is processed as
an independent graph for the next algorithm steps. If the resulting graph already fits
our initial requirements and does not exceeding the maximum allowed number of ma-
terials, then the next step is redundant and can be skipped. Otherwise, the next algo-
rithm step splits the graph into several subgraphs with specific properties defined by
our initial requirements.

Figure 1.8. A resulting graph with two connected components.

1.5.7 Solving the Graph Partitioning Problem


At this step, we need to solve the graph partitioning problem and split a graph
G  V , E , where V is the set of vertices and E are edges, into smaller components
with specific properties. Typically, graph partition problems are NP-hard problems so
we should use heuristics and approximations to solve the graph partition problem. To
find the optimal solution, we use an iterative greedy algorithm to find a set of edges
with minimal weight to cut. Our solution is inspired by the heuristic algorithm of graph
partitioning proposed by Kernighan and Lin [1970].
Our goal is to divide the graph into subsets A and B where subset A satisfies initial
requirements, and the sum of edge weights from A to B are minimized. Since the weight
of the edge is the number of pixels used for the texture filtering, by minimizing the sum
of edge weights from A to B we reduce the resulting filtering error. Our multi-pass
algorithm maintains and improves a partition, using a greedy algorithm in each pass to
pair up vertices of A with vertices of B, so that moving the paired vertices from one
side of the partition to the other improves the partitioning.
168 1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Next, our algorithm chooses the best solution from all solutions that have been
tried. Thus our algorithm attempts to find the optimal subset A which has a minimum
sum of edge weights to cut. See Algorithm 1.2 for implementation details. To demon-
strate one iteration step of the algorithm, see Figure 1.9.

1.5.8 Generating the Final Data


At the final step, we have a set of undirected graphs, where each graph fits our initial
requirements. In practice, many resulting graphs use fewer materials than the maxi-
mum allowable number. To reduce the resulting number of clusters, we combine such
graphs into larger ones as long as the merged results satisfy the initial requirements.

ForEach (source layers identifier existing in the graph G(V,E))


{
Begin (split graph into initial subsets A and B)
{
Add all graph vertices with same identifier to subset A.
Add all other graph vertices into subset B.
}
Loop
{
Calculate sum of edges weights between subset A and B
and store as solution.

Find a vertex inside a subset B that has the largest sum of


edges crossing between subsets and can be moved to a
subset A without violating our constraints.

If (such vertex found)


{
Move all vertices with same identifier as a found vertex
from the subset B to the subset A.
}
Else
{
break
}
}
}

Return solution with the minimal weight between subsets.

Algorithm 1.2. Graph partitioning algorithm.


1.5 Algorithm Implementation 169

Figure 1.9. One step of the graph partitioning algorithm. The vertex V 5 moved from subset B
into subset A.

Next, we a build normalized weighted representation as described in Section 1.5.3,


but this time for full-resolution texture data. For each resulting graph, we copy all texels
used by the graph vertices from the weight textures into separate channels of the Clus-
ter Weights texture. Then we store the used graph index into a Cluster Indirection tex-
ture using the R8_UINT format. Next, for the Cluster Weights, we generate a partial
mipmap chain which is limited by the number of supported mipmaps and then com-
press the resulting texture using the BC1 or BC3 texture format. See Figure 1.10 for
the example set of resulting textures.

1.5.9 Runtime Composition


For the final composition of the material at runtime, we use the following approach:

 Fetch the encoded cluster ID from the indirection texture.


170 1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Figure 1.10. Final data generated by our implementation. Cluster weight texture (left) and In-
direction texture (right). Indirection texture is colorized and upscaled by 8 times for demonstra-
tion purposes. Image courtesy of Mail.Ru Group.

 Fetch material blend weights from the weight texture.


 Read the surface properties stored in the Structured Buffer using the fetched
cluster ID.
 Make the final composition using the obtained surface parameters and material
blend weights. See Listing 1.1 for an example of a basic compositing shader.
Since we use a normalized weighted representation for storing the blend weights,
we can change the blend weight of any individual material layer and renormalize the
total sum of the weights. This allows us to change the transparency of individual layers
at runtime.
Textures used for the composition usually have insufficient resolution. To increase
the final resolution of the composition, we use detail textures stored in the texture ar-
rays as described by Hamilton and Brown [2016]. To generate a UV set for detail tex-
tures, we multiply original the UV set by tiling factor. The tiling factor for a detail UV
set can be specified per material. We use two sets of texture arrays: First for the surface
parameters (albedo, roughness, metallic) and the second for the normal maps. Each
material used in the composition can use an arbitrary detail map for the surface prop-
erties and an arbitrary detail map for the surface normals. See Figure 1.11 for examples
of the composition with and without detail textures.
For the normal maps blending, we use a weighted blending of partial derivatives
as described in “Blending in Detail” [Barré-Brisebois and Hill 2012]. Additionally, we
can blend only two detail maps with the highest contribution weights at a medium
distance and completely disable detail maps at a long distance to improve the compo-
sition performance.
1.5 Algorithm Implementation 171

struct SurfaceParameters
{
float3 albedo;
};

struct ClusterParameters
{
SurfaceParameters layer0;
SurfaceParameters layer1;
SurfaceParameters layer2;
SurfaceParameters layer3;
};

// Weights texture
Texture2D cWeights;
// Indirection texture
Texture2D cIndirection;
// Material parameters (stored per cluster)
StructuredBuffer<ClusterParameters> clusterParameters;

float4 DecodeAndComposition(float2 uv) : SV_Target0


{
float4 weights;

// Fetch weights
weights.xyz = cWeights.Sample(samplerTrilinear, uv).rgb;

// Reconstruct weight
weights.w = 1.0 - weights.x - weights.y - weights.z;

// Fetch index
uint clusterIndex = cIndirection.Sample(samplerPoint, uv).r;

// Get material params


ClusterParameters params = clusterParameters[clusterIndex];

// Use the material parameters and weights


// for a final composition
float3 albedo = params.layer0.albedo * weights.x +
params.layer1.albedo * weights.y +
params.layer2.albedo * weights.z +
params.layer3.albedo * weights.w;

return float4(albedo, 1.0);


}

Listing 1.1. An example of cluster decoding.


172 1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Figure 1.11. Rendering using detail maps (left) and without detail maps (right). Image courtesy
of Mail.Ru Group.

1.5.10 Source Code


For demonstration purposes, we implemented our method using C# for preprocessing
and the Unity game engine by Unity Technologies for the runtime materials composi-
tion. Full source code can be found in the supplemental materials of this book.
Source code is also available at https://github.com/SergeyMakeev/GpuZen2.

1.6 Results
We used the approach described in this article for rendering the armored vehicles in
the Armored Warfare game. In our game, users can customize coloring and materials
used for rendering the armored vehicles. The presented technique allows minimizing
the number of textures stored on disk while supporting high-quality textures and al-
lowing the customization of the visual appearance. You can see some results of using
our technique in Figure 1.12. We compared the number of instructions resulting for our
technique and number of instructions resulting for Unreal Engine 4 material layering
technique, see Table 1.1. Table 1.2 shows build times and the number of resulting ma-
terials clusters made for the armored vehicle.

1.7 Conclusion and Future Work


The method described in this chapter helps to store efficiently and use more texture
blend masks than would be allowed by existing methods with some natural limitations.
1.7 Conclusion and Future Work 173

Figure 1.12. An example of composited material and some material templates used in compo-
sition (Top) and the same composited material with different material templates applied (Bot-
tom). Image courtesy of Mail.Ru Group.
174 1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Per-pixel Total Per-pixel


Technique number of number of number of ALU TEX
layers layers detail textures
Unreal Engine 4 –
4 4 4 82 14
MatLayerBlendSimple
Unreal Engine 4 –
3 3 0 23 2
Custom material
Spatial Clustering
5 9+ 5 140 13
Encoding – High
Spatial Clustering
4 9+ 4 104 11
Encoding – Standard
Spatial Clustering
4 9+ 1 42 4
Encoding – Fast
Spatial Clustering
3 9+ 0 26 3
Encoding – Fastest
Table 1.1. Number of instructions resulting for different layered material techniques.

Input Graph Memory used


Build Mip Cluster
Asset material vertex for cluster
time count count
count count parameters
Turret 9 1.9 s 3 1791 5 560 bytes
Hull 7 1.4 s 3 3937 3 336 bytes
Cannon 7 0.5 s 3 262 4 448 bytes
Wheels 6 0.9 s 3 1422 3 336 bytes
Tracks 5 0.2 s 3 268 1 112 bytes
Table 1.2. Build time and resulting cluster statistics for example asset.

Also, the proposed method supports real-time material recomposition using the pro-
posed data representation. For the most effective use of our method, it is necessary to
take into account the texture space connectivity of the different texture blend masks at
the earliest stages of the art-pipeline. At the same time, the proposed method is suitable
for any existing art assets without additional preparation with some tolerable texture
filtering errors. We are continue to develop and refine of the proposed technique. Here
are some areas for further development:

 Nonlinear blending for the material composition as proposed by Hardy and


McRoberts [2006].

 Reducing the texture filtering errors when dividing clusters. Since the texture
blend masks are order independent, we can swap the texture channels inside the
Acknowledgments 175

cluster. Using the least squares minimization technique along the “seams”
boundary as proposed by Iwanicki [2013], we can reduce the texture filtering
error almost to zero.

 Composition after evaluating BRDF instead of the surface properties composi-


tion. Using this approach we can create accurate multi-layered materials with
multiple specular lobes.

 Using the vertex color as a blend weight modifier for local dynamic material
recomposition (dynamic dirt, scratches, etc.).

Acknowledgments
First, I would like to thank Vladimir Egorov, my friend and colleague, for his suggestions and
early feedback on this article. Peter Sikachev, Vadim Slyusarev, Bonifacio Costiniano and Alex-
andre Chekroun for their feedback on this article. In addition, I would like to thank all Allods
Team members as well.

Bibliography
BARRÉ-BRISEBOIS, C. AND HILL, S. 2012. Blending in Detail. URL: http://blog.
selfshadow.com/publications/blending-in-detail.
BLOOM, C. 2000. Terrain Texture Compositing by Blending in the Frame-Buffer. URL:
http://www.cbloom.com/3d/techdocs/splatting.txt.
DEGUY, S., OLGUIN, R., AND SMITH, B. 2016. Texturing Uncharted 4: a matter of Substance.
Game Developers Conference 2016.
HAMILTON, A. AND BROWN, K. 2016. Photogrammetry and Star Wars Battlefront. Game De-
velopers Conference 2016.
HARDY, A. AND MCROBERTS, D. 2006. Blend maps: enhanced terrain texturing. In SAICSIT
2006.
INSIDE UNREAL. 2013. A Look at Unreal Engine 4 Layered Materials. URL: https://www.
unrealengine.com/news/look-at-unreal-engine-4-layered-materials.
IWANICKI, M. 2013. Lighting technology of The Last of Us. SIGGRAPH ’13.
KARIS, B. 2013. Real Shading in Unreal Engine 4 : Physically Based Shading in Theory and
Practice. SIGGRAPH ’13.
KERNIGHAN, B. AND LIN, S. 1970. An efficient heuristic procedure for partitioning graphs. In
The Bell System Technical Journal, 49, pp. 291–307.
MITTRING, M. 2008. Advanced Virtual Texture Topics. SIGGRAPH ’08.
176 1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

NEUBELT, D. AND PETTINEO, M. 2013. Crafting a Next-Gen Material Pipeline for The Order:
1886. SIGGRAPH ’13.
NOGUER, J. 2016. The Next Frontier of Texturing Workflows. URL: https://www.
allegorithmic.com/blog/next-frontier-texturing-workflows.
2
IV

Procedural Stochastic
Textures by Tiling
and Blending
Thomas Deliot and Eric Heitz

2.1 Introduction
Heitz and Neyret [2018] recently introduced a new by-example procedural texturing
method for stochastic textures, typically natural textures such as moss, granite, sand,
bark, etc. Their algorithm takes as input a small texture example and synthesizes an
infinite output with the same appearance, as in Figure 2.1. The algorithm is a simple

Figure 2.1. Procedural stochastic textures by tiling and blending. Our algorithm runs in a frag-
ment shader that requires no more than 4 texture fetches and a few computations. It can be effi-
ciently integrated into a rendering engine.

177
178 2. Procedural Stochastic Textures by Tiling and Blending

tiling-and-blending scheme augmented by a histogram-preserving blending operator


that prevents the visual artifacts caused by linear blending. The cornerstone of the im-
plementation is thus this new blending operator, which requires dedicated precompu-
tations. In this chapter, we investigate the details of a practical implementation of this
algorithm with some improvements compared to the original article. The chapter
comes with a C++ OpenGL demo the code snippets are extracted from.

2.2 Tiling and Blending


The fragment shader of our tiling-and-blending algorithm is illustrated in Figure 2.2.
We partition the uv space on a triangle grid and compute the local triangle and the
barycentric coordinates inside the triangle. We use a hash function to associate a ran-
dom offset with each vertex of the triangle grid and use this random offset to fetch the
example texture. Finally, we blend the result using the barycentric coordinates as blend-
ing weights. This method is fast because each pixel requires only a few computations
and 3 texture fetches. The implementation is provided in Listing 2.1.

2.2.1 Tiling
In this section, we provide the implementation of the functions required for the tiling
part of the algorithm in Listing 2.1.

Triangle grid. We use the equilateral-triangle lattice introduced in Simplex Noise [Per-
lin 2001]. Listing 2.2 provides the function that, for a given point in uv space, computes

(a) Example (b) Tiling and blending


Figure 2.2. Tiling and blending. Each pixel is obtained by blending three tiles from the
example.
2.2 Tiling and Blending 179

(a) (b) (c)


Figure 2.3. Results of tiling and blending. The tricky part of the algorithm is the blending
operator. (a) Example image. (b) Linear tiling and blending. (c) Histogram-preserving tiling and
blending.

sampler2D input; // Example texture

vec3 ProceduralTilingAndBlending(vec2 uv)


{
// Get triangle info
float w1, w2, w3;
ivec2 vertex1, vertex2, vertex3;
TriangleGrid(uv, w1, w2, w3, vertex1, vertex2, vertex3);

// Assign random offset to each triangle vertex


vec2 uv1 = uv + hash(vertex1);
vec2 uv2 = uv + hash(vertex2);
vec2 uv3 = uv + hash(vertex3);

// Precompute UV derivatives
vec2 duvdx = dFdx(uv);
vec2 duvdy = dFdy(uv);

// Fetch input
vec3 I1 = textureGrad(input, uv1, duvdx, duvdy).rgb;
vec3 I2 = textureGrad(input, uv2, duvdx, duvdy).rgb;
vec3 I3 = textureGrad(input, uv3, duvdx, duvdy).rgb;

// Linear blending
vec3 color = w1 * I1 + w2 * I2 + w3 * I3;

return color;
}

Listing 2.1. Tiling and blending.


180 2. Procedural Stochastic Textures by Tiling and Blending

the vertices of its containing triangle and its barycentric coordinates w 1 , w 2 , w 3 inside
this triangle. With this partitioning of the uv space, each vertex is associated with a
hexagonal tile chosen randomly in the input image such that each point is covered by
exactly 3 tiles and each tile is weighted by a function falling to 0 at the borders and
such that the sum of the weights equals 1 everywhere (w 1  w 2  w 3  1). Note that
the constant 2 3 controls the size of the input with respect to the size of the tiles. With
this value, the height of a hexagonal tile is half the size of the input texture, which
works well in general. This parameter can be adjusted depending on the input. Using
larger tiles (decreasing the constant) captures more large-scale features but is more
prone to visible repetitions. Using smaller tiles (increasing the constant) increases the
variety of the tiles but misses large-scale features.

// Compute local triangle barycentric coordinates and vertex IDs


void TriangleGrid(vec2 uv,
out float w1, out float w2, out float w3,
out ivec2 vertex1, out ivec2 vertex2, out ivec2 vertex3)
{
// Scaling of the input
uv *= 3.464; // 2 * sqrt(3)

// Skew input space into simplex triangle grid


const mat2 gridToSkewedGrid = mat2(1.0, 0.0,
-0.57735027, 1.15470054);
vec2 skewedCoord = gridToSkewedGrid * uv;

// Compute local triangle vertex IDs and local


// barycentric coordinates
ivec2 baseId = ivec2(floor(skewedCoord));
vec3 temp = vec3(fract(skewedCoord), 0);
temp.z = 1.0 - temp.x - temp.y;
if (temp.z > 0.0)
{
w1 = temp.z;
w2 = temp.y;
w3 = temp.x;
vertex1 = baseId;
vertex2 = baseId + ivec2(0, 1);
vertex3 = baseId + ivec2(1, 0);
}
else
{
w1 = -temp.z;
w2 = 1.0 - temp.y;
w3 = 1.0 - temp.x;
2.2 Tiling and Blending 181

vertex1 = baseId + ivec2(1, 1);


vertex2 = baseId + ivec2(1, 0);
vertex3 = baseId + ivec2(0, 1);
}
}

Listing 2.2. Computing the local triangle vertices and barycentric coordinates.

Hash function. We use the hash function given in Listing 2.3 to associate a random
offset with each vertex of the triangle grid and use it to fetch the example texture. The
choice of the hash function does not really matter as long as it provides enough ran-
domness and does not introduce visible correlations between neighboring tiles.

vec2 hash(vec2 p)
{
return fract(sin((p) * mat2(127.1, 311.7, 269.5, 183.3))
* 43758.5453);
}

Listing 2.3. The hash function used to randomize the tiles.

Fetching the example texture. We fetch the input texture with mipmapping and an-
isotropic filtering like a conventional texture. Note that the hardware uses screen-space
derivatives to compute the mipmap level and parameterize its anisotropic filter. Typi-
cally, these derivatives are computed with the finite differences between neighboring
pixels of the uv positions passed as argument to the texture function. In our case,
these screen-space derivatives are broken by the random offsets if neighboring pixels
are not in the same triangle. To avoid this problem, in Listing 2.1 we compute the
uv derivatives before adding the random offsets and we pass them explicitly to the
texture2DGrad function.

2.2.2 Blending
In this section, we address the blending part of the algorithm in Listing 2.1.

The problem of linear blending. Listing 2.1 implements a classic linear blending
operator:

I  w 1 I 1  w 2 I 2  w 3 I 3. (2.1)
182 2. Procedural Stochastic Textures by Tiling and Blending

Unfortunately, it does not yield satisfying results, as shown in Figure 2.3(b). The result
has heterogeneous contrast and exhibits a grid-revealing pattern. Heitz and Neyret ex-
plain that the problem of linear blending is that is does not preserve the statistical prop-
erties of the input, i.e., its histogram. The problem is thus to find a blending operator
that preserves the histogram.

Variance-preserving blending. Heitz and Neyret notice that in the special case where
the input has a Gaussian histogram, variance-preserving blending preserves the Gauss-
ian histogram. The expression of this operator is
w 1G1  w 2 G 2  w 3G 3    G 
G    G , (2.2)
w 12  w 22  w 32

where the expectation   G  is the average color of the Gaussian input.

Histogram-preserving blending. To generalize this idea to arbitrary non-Gaussian


inputs, Heitz and Neyret use an histogram transformation T that makes the input Gauss-
ian, blend with the variance-preserving blending of Equation (2.2), and finally apply
the inverse histogram transformation T 1. The overview of this algorithm is provided
in Figure 2.4. This operator provides better results than linear blending, as shown in
Figure 2.3(c). The following is dedicated to the implementation of this operator in the
tiling-and-blending algorithm. For more details on histogram-preserving blending, we
refer the reader to the original article [Heitz and Neyret 2018].

Precomputations. The histogram-preserving version of the tiling-and-blending algo-


rithm requires the Gaussian version of the input T  I  and the inverse histogram trans-
formation T 1. We pass them to the fragment shader as textures in Listing 2.4.
Section 2.3 is dedicated to the precomputation of these textures.

uniform sampler2D Tinput; // Gaussian input T(I)


uniform sampler2D invT; // Inverse histogram transformation T^{-1}

Listing 2.4. Textures for histogram-preserving blending.

Fragment shader. We update the blending step of Listing 2.1 with the instructions
provided in Listing 2.5. Instead of sampling the original input, we sample the Gaussian
input stored in texture Tinput and we use the variance-preserving blending operator
of Equation (2.2). Finally, we apply the inverse histogram transformation by fetching
the precomputed look-up table stored in texture invT.
2.2 Tiling and Blending 183

Figure 2.4. Tiling and blending with histogram-preserving blending.


184 2. Procedural Stochastic Textures by Tiling and Blending

// Sample Gaussian values from transformed input


vec3 G1 = textureGrad(Tinput, uv1, duvdx, duvdy).rgb;
vec3 G2 = textureGrad(Tinput, uv2, duvdx, duvdy).rgb;
vec3 G3 = textureGrad(Tinput, uv3, duvdx, duvdy).rgb;

// Variance-preserving blending
vec3 G = w1 * G1 + w2 * G2 + w3 * G3;
G = G - vec3(0.5);
G = G * inversesqrt(w1 * w1 + w2 * w2 + w3 * w3);
G = G + vec3(0.5);

// Fetch LUT
vec3 color;
color.r = texture(invT, vec2(G.r, 0)).r;
color.g = texture(invT, vec2(G.g, 0)).g;
color.b = texture(invT, vec2(G.b, 0)).b;
}

Listing 2.5. Implementation of histogram-preserving blending in Listing 2.1.

2.3 Precomputing the Histogram Transformations


This section is dedicated to the C++ precomputation of the histogram transformation
T applied on the input image and the inverse histogram transformation T 1 stored in a
look-up table that are passed to the fragment shader in Listing 2.4.

2.3.1 Target Gaussian Distribution


As shown in Figure 2.4, T is an histogram transformation that makes the input dis-
tributed as a Gaussian distribution   μ, σ 2  whose Probability Density Function
(PDF) is
  x  μ  2 
1  .
PDF  x   exp   (2.3)
2 πσ 2

 2 σ 2

To do this, we need to choose the parameters of the Gaussian distribution we will be


using and recall some of its properties.

Parameters. We choose the target Gaussian distribution of parameters μ  1 2 and


σ 2  1 6 2 . With these parameters, the distribution fits well in the interval  0,1 and can
be stored with 8-bit precision.
2.3 Precomputing the Histogram Transformations 185

Cumulative Distribution Function. The histogram transformation T in Section 2.3.2


requires the Cumulative Distribution Function (CDF) of the Gaussian distribution.
It is the function that computes the quantile values of the distribution at a given pos-
ition x:
1   x  μ 
CDF  x   1  erf   . (2.4)
2   σ 2 

A quantile value U  CDF  x  is the integral of the distribution below x. For instance,
if U  0.30 it means that 30% of the integral is below x and 70% is above.

float CDF(float x, float mu, float sigma)


{
float U = 0.5f * (1 + Erf((x - mu) / (sigma * sqrtf(2.0f))));
return U;
}

Listing 2.6. Cumulative Distribution Function (CDF) of a Gaussian.

Inverse Cumulative Distribution Function. The inverse histogram transformation


T 1 in Section 2.3.3 requires the inverse CDF:

CDF 1 U   μ  σ 2 erf 1  2U 1. (2.5)

It computes the quantile x  CDF 1 U  of a given value U   0,1.

float invCDF(float U, float mu, float sigma)


{
float x = sigma * sqrtf(2.0f) * ErfInv(2.0f * U - 1.0f) + mu;
return x;
}

Listing 2.7. Inverse Cumulative Distribution Function (ICDF) of a Gaussian.

2.3.2 Applying the Histogram Transformation T on the Input


In this section, we show how to apply the histogram transformation T on the input
(Step 1 in Figure 2.4). Our algorithm makes each color channel of the input distributed
as the target Gaussian chosen in Section 2.3.1.
186 2. Procedural Stochastic Textures by Tiling and Blending

Figure 2.5. Histogram transformation of the input. We sort the pixel values I of the input and
we map them to sorted values G from the target Gaussian distribution.

Algorithm. A discrete 1D histogram transformation T is typically done by replacing


sorted values I from the input by the same number of sorted values G from the target
histogram, as shown in Figure 2.5.

Implementation. In Listing 2.8, we start by sorting the values of the input image. For
this purpose, we use a structure PixelSortStruct that stores the coordinates and the
value of a pixel. Then, we go through the sorted list of pixel values and for the i-th
i 1 2
element we compute its quantile value U  N . It means that U% of the list is before
this element and 1 U  % is after. We replace the pixel value by the same quantile in
the Gaussian distribution using the inverse CDF of Equation (2.5): G  CDF 1 U .

void ComputeTinput(TextureDataFloat& input,


TextureDataFloat& Tinput, int channel)
{
// Sort pixels of example image
vector<PixelSortStruct> sortedInputValues;
sortedInputValues.resize(input.width * input.height);

for (int y = 0; y < input.height; y++)


for (int x = 0; x < input.width; x++)
{
sortedInputValues[y * input.width + x].x = x;
sortedInputValues[y * input.width + x].y = y;
sortedInputValues[y * input.width + x].value =
input.GetPixel(x, y, channel);
}

sort(sortedInputValues.begin(), sortedInputValues.end());

// Assign Gaussian value to each pixel


for (unsigned int i = 0; i < sortedInputValues.size() ; i++)
{
// Pixel coordinates
int x = sortedInputValues[i].x;
int y = sortedInputValues[i].y;
2.3 Precomputing the Histogram Transformations 187

// Input quantile (given by its order in the sorting)


float U = (i + 0.5f) / sortedInputValues.size();
// Gaussian quantile
float G = invCDF(U, GAUSSIAN_AVERAGE, GAUSSIAN_STD);
// Store
Tinput.SetPixel(x, y, channel, G);
}
}

Listing 2.8. Applying the histogram transformation T on the input.

2.3.3 Precomputing the Inverse Histogram Transformation T−1


In this section, we show how to compute the inverse histogram transformation T 1 that
maps Gaussian values to values from the input and store it in a look-up table (Step 3 in
Figure 2.4).

Algorithm. The algorithm consists in mapping sorted values, as in the previous section
(Figure 2.5). However, the computation of the values is different. Since we use a Gauss-
ian distribution that can be well represented in the interval  0,1, we are going to pa-
rameterize the look-up table on this interval and we associate quantiles of the Gaussian
distribution in  0,1 to quantiles of the pixel values.

Implementation. In Listing 2.9, we start by sorting the values of the input image. Note
that an optimized implementation could reuse the sorting step of Listing 2.8. Then, we
go through the texels of the look-up table that parameterizes the interval  0,1 such that
i 1 2
the i-th over N texel is associated with the position x  N . We compute the Gaussian
quantile value at this position using Equation (2.4): U  CDF  x , and we pick up the
same quantile in the sorted pixel values, i.e., we fetch the U . M -th element in the
sorted list if it has M entries. This is the value that we store in the look-up table.

void ComputeinvT(TextureDataFloat& input,


TextureDataFloat& invT, int channel)
{
// Sort pixels of example image
vector<float> sortedInputValues;
sortedInputValues.resize(input.width * input.height);
for (int y = 0; y < input.height; y++)
for (int x = 0; x < input.width; x++)
{
sortedInputValues[y * input.width + x] =
input.GetPixel(x, y, channel);
}
188 2. Procedural Stochastic Textures by Tiling and Blending

sort(sortedInputValues.begin(), sortedInputValues.end());

// Generate invT look-up table


for (int i = 0; i < invT.width; i++)
{
// Gaussian value in [0, 1]
float G = (i + 0.5f) / (invT.width);
// Quantile value
float U = CDF(G, GAUSSIAN_AVERAGE, GAUSSIAN_STD);
// Find quantile in sorted pixel values
int index = (int) floor(U * sortedInputValues.size());
// Get input value
float I = sortedInputValues[index];
// Store in LUT
invT.SetPixel(i, 0, channel, I);
}
}

Listing 2.9. Precomputing the inverse histogram transformation T 1 and storing it in a look-up
table.

2.3.4 Discussion
With the fragment shader of Section 2.2 and the precomputations of Section 2.3 we
already have a standalone implementation. However, this implementation has several
shortcomings: color problems might appear with some inputs. They are due to compu-
ting separate per-channel histogram transformations and the incompatibility of mip-
mapping and using a look-up table. Sections 2.4 and 2.5 are dedicated to overcome
these shortcomings.

2.4 Improvement: Using a Decorrelated Color Space


Our method, as described so far, occasionally produces procedural texture that exhibit
colors that were not present in the example texture, as in Figure 2.6(b). In this section,
we show how to reduce this problem by using a decorrelated color space, such as Hee-
ger and Bergen [1995].

2.4.1 The Problem with Color-space Correlations


In Section 2.3, we computed histogram transformations for each color channel sepa-
rately, which occasionally produces wrong colors in the output. Indeed, the histogram
of an RGB image is not the composition of three 1D functions but rather one 3D func-
tion or a 3D point cloud, as shown in Figure 2.6(a). This 3D histogram might have
2.4 Improvement: Using a Decorrelated Color Space 189

(a) Input (b) Procedural (c) Procedural


(RGB space) (decorrelated space)
Figure 2.6. Improvement: using a decorrelated color space. If the color channels are correlated,
processing them separately might introduce wrong colors (b) that were not present in the input
(a). We reduce this problem by using a color space in which the channels are not correlated (c).

inter-channel correlations and transforming the channels separately does not preserve
these correlations. For instance, the result of Figure 2.6(b) has the same 1D histogram
as the input for each channel. However, since the inter-channels correlations are not
preserved, the 3D shape of this histogram is not preserved and wrong colors appear in
the result. We obtained the result of Figure 2.6(c) by using a color space in which the
channels are not correlated such that processing them separately is less prone to this
problem.

2.4.2 Decorrelating the Color Space


We precompute the color space transformation before Step 1 in Figure 2.4 and revert it
at the end of the fragment shader after Step 3 in Figure 2.4.

Precomputation. We start by computing the covariance matrix of the input histogram


and extracting its eigenvectors, which means extracting the principal axes of the point
cloud given by the pixel's colors in the RGB space, as shown in Figure 2.7(a). Along
these eigenvectors, the coordinates of the points are statistically decorrelated. Then, we
compute the bounding box of the point cloud aligned with these eigenvectors and we
find the coordinates of the points in this bounding box, as shown in Figure 2.7(b). With
this parameterization, all the points are parameterized by

P  O  v 1 V1  v 2 V 2  v 3 V3 with  v 1 , v 2 , v 3    0,1 ,
3
(2.6)

where the bounding box is defined by its corner O and its orthogonal axes V1, V 2 , and
V3.
190 2. Procedural Stochastic Textures by Tiling and Blending

(a) Eigenvectors (b) Parameterization


Figure 2.7. Parameterization of the decorrelated color space.

Fragment shader. Before returning, the fragment shader transforms the result back
to the original color space with Equation (2.6). This is done in the function Return-
ToOriginalColorSpace provided in Listing 2.10.

uniform vec3 colorSpaceVector1;


uniform vec3 colorSpaceVector2;
uniform vec3 colorSpaceVector3;
uniform vec3 colorSpaceOrigin;

vec3 ReturnToOriginalColorSpace(vec3 color)


{
vec3 result =
colorSpaceOrigin +
colorSpaceVector1 * color.r +
colorSpaceVector2 * color.g +
colorSpaceVector3 * color.b;
return result;
}

Listing 2.10. Return to the original color space in the fragment shader.

2.5 Improvement: Prefiltering the Look-up Table


Our method, as described so far, uses mipmap levels to fetch the Gaussian input in
Step 2 of Figure 2.4. However, when the lower levels of detail are used, comparing its
2.5 Improvement: Prefiltering the Look-up Table 191

appearance to a regular tiling of the input reveals an issue of color deviation, as shown
in Figure 2.8. In this section, we show how to solve this problem by prefiltering the
look-up table.

2.5.1 The Problem with Texture Filtering and Look-up Tables


Classic texture filtering. To understand the problem when filtering our procedural
texture, we look into the equation of classic texture filtering. We define texture  uv  as

(a) Input (repeat) (b) Procedural (c) Procedural


(prefiltered LUT)
Figure 2.8. Improvement: prefiltering the look-up table. The procedural texture uses a look-up
table on top of the mipmapped input. This results in a noticeable color shift as we zoom out (b)
compared to the input (a). We solve this problem by prefiltering the look-up table (c).
192 2. Procedural Stochastic Textures by Tiling and Blending

Figure 2.9. Classic texture filtering.

the color of the input texture at a position uv and P the domain covered by the pixel
footprint. Figure 2.9 illustrates that the filtered color is the integral of the texture over
the pixel footprint:

 P
texture  uv  duv . (2.7)

Texture mipmapping (with anisotropic filtering for more accuracy) provides a fast way
to evaluate this integral.

Procedural texture filtering (reference). Our tiling-and-blending method computes


a procedural texture that is the composition of the Gaussian input texture and a look-
up table (LUT) that contains the inverse histogram transformation:

procedural texture  uv   LUT texture  uv . (2.8)

If we apply Equation (2.7) to this formulation, we obtain the following filtering


equation:

filtered procedural texture 


 LUT texture uv  duv.
P
(2.9)

As shown in Figure 2.10, this integral can be computed by sampling the values of the
texture over the footprint P, passing the values through the look-up table, and averaging
the results. Unfortunately, this process is too costly and we are thus willing to use mip-
mapping, as for a conventional texture.
2.5 Improvement: Prefiltering the Look-up Table 193

Figure 2.10. Filtering the procedural texture. The correct filtering averages the values after the
application of the look-up table. Filtering the texture before and applying the look-up table after
does not produce the same result.

Filtering the procedural texture (wrong). A simple approach consists in using a mip-
mapped version of the input texture, fetching a single sample from it as for a conven-
tional texture, and then passing it through the look-up table, as shown in Figure 2.10.
However, this computes

 texture uv  duv ,


 
filtered procedural texture  LUT  (2.10)
 P

which is not the right result because the integral and the look-up table do not commute:

  LUT texture uv  duv .


 
LUT  texture  uv  duv   (2.11)
 P  P

This inequality explains the color difference between Figure 2.8(a) and (b).

2.5.2 Alternative Filtering Formulation with a Look-up Table


We use the solution of Heitz et al. [2013] to the problem of filtering procedural textures
with look-up tables (also called “color maps”). Their solution is based on the observa-
tion that the reference result of Equation (2.9) is a weighted average of values from the
look-up table. Hence, the equation can be rewritten
194 2. Procedural Stochastic Textures by Tiling and Blending



filtered procedural texture  LUT  t  H P  t  dt , (2.12)


where H P gives the weight of each entry of the look-up table. This weight depends on
the distribution of texture values t inside the pixel footprint P. The more a value t of
the texture is represented, the more the entry LUT  t  contributes to the weighted av-
erage. Hence, H P is the histogram of the values of the texture inside P. This equiva-
lence is shown in Figure 2.11.

Implementation with a prefiltered look-up table. Applying the result of Equa-


tion (2.12) in practice requires estimating H P for a given footprint P and computing its
product integral with the look-up table. To do this in real-time, we approximate H P by
a Gaussian distribution and use a look-up table prefiltered with a Gaussian filter for
each level of detail of the input texture. The motivation for this approximation is that
at the texture has effectively a Gaussian histogram. Hence, the approximation becomes
exact at the highest level of detail and remains reasonable at intermediate levels.

2.5.3 Computing and Fetching the Prefiltered Look-up Table


Precomputation. In our implementation, we prefilter the look-up table in a function
PrefilterLUT. This function creates a 2D look-up table whose width is the same as

Figure 2.11. Alternative filtering formulation with a look-up table. Filtering the texture with
the look-up table is equivalent to convolving the look-up table by the histogram of the texture
values inside the pixel footprint.
2.6 Improvement: Using Compressed Texture Formats 195

the unfiltered look-up table and whose height is the number of levels of detail of the
input texture. For each level of detail L we compute the average variance in all the
subwindows of width 2 L. At the first level of detail the variance is 0 and at the highest
level of detail the variance is the variance of the full Gaussian texture, which is 1 6 2
as explained in Section 2.3.1. For each level of detail, we filter the look-up table by a
Gaussian filter of the associated variance.

Fragment shader. We update the fragment shader in Listing 2.11 where we use the
function textureQueryLod to obtain the level of detail of the input texture and we
remap it to a value in  0,1 to obtain a y coordinate to fetch the look-up table.

// Compute LOD level to fetch the prefiltered look-up table invT


float LOD = textureQueryLod(Tinput, uv).y /
float(textureSize(invT, 0).y);

// Fetch prefiltered LUT (T^{-1})


vec3 color;
color.r = texture(invT, vec2(G.r, LOD)).r;
color.g = texture(invT, vec2(G.g, LOD)).g;
color.b = texture(invT, vec2(G.b, LOD)).b;

Listing 2.11. Fetching the prefiltered look-up table in the fragment shader.

2.6 Improvement: Using Compressed Texture Formats


In Figure 2.12, we test our algorithm with the DXT1 compressed texture format applied
to the Gaussian version of the input Tinput and the look-up table invT. We notice
that the compression occasionally introduces visible artifact when it is applied directly
on our textures (Figure 2.12(b)) and a modification is necessary to support a com-
pressed texture format. The problem is that our histogram transformation makes all the
channels have the same range of Gaussian values. This impacts the quality of the com-
pression because the compressor optimizes an error that has become equally distrib-
uted among the channels while the true error should be more important for channels
with wide ranges. Fixing this issue is simple: instead of using the same Gaussian dis-
tribution  1 2 ,1 6 2  for all the channels, we scale the Gaussian distribution such
that its standard deviation around the average 1 2 becomes proportional to the actual
range of the channel data. We do this modification just before sending the data to the
DXT compressor and we revert it in the fragment shader. With this minor modification
we were able to fix the issue and safely use the DXT1 format for all our textures (Fig-
ure 2.12(c)). Our C++ OpenGL demo provides a binary flag #define USE_DXT_COM-
PRESSION that enables these modifications.
196 2. Procedural Stochastic Textures by Tiling and Blending

(a) RGB8 (b) DXT1 (c) DXT1 (fixed)


Figure 2.12. Using a compressed texture format. The DXT1 texture format fails with some
inputs if it is applied directly on the textures compute by our algorithm (b). We fix this problem
by scaling the range of the Gaussian texture (c).

2.7 Results
Performance and storage. In Table 2.1, we compare the performance and storage of
our method compared to a classic texture repeat, as in Figures 2.13 and 2.14. On aver-
age, it is 4–5 times costlier, which makes sense since we fetch the input 3 times, use
one additional look-up table fetch and use a few additional operations. The repeated
tiling only requires the storage of the input texture while our method requires the stor-
age of the Gaussian input Tinput and the look-up table invT. Since the Gaussian input
has the same size as the input, the memory overhead of our method is only the storage
of the look-up table, which is small in comparison.

Generative textural space. Our method is dedicated to stochastic textures, such as


the rock in Figure 2.15. It does not produce plausible results if the input presents a
strong pattern-like organization like in Figure 2.16.
2.8 Conclusion 197

Input Performance Memory


Size Format (T)input invT Repeat Procedural
64 2 RGB8 16 KB 2 KB 0.035 ms 0.179 ms
2
128 RGB8 65 KB 3 KB 0.035 ms 0.180 ms
256 2 RGB8 262 KB 3 KB 0.036 ms 0.181 ms
512 2 RGB8 1048 KB 3 KB 0.039 ms 0.186 ms
1024 2 RGB8 4194 KB 4 KB 0.052 ms 0.200 ms
2048 2 RGB8 16777 KB 5 KB 0.112 ms 0.341 ms
64 2 DXT1 3 KB 1 KB 0.035 ms 0.180 ms
128 2 DXT1 11 KB 1 KB 0.035 ms 0.180 ms
256 2 DXT1 48 KB 1 KB 0.035 ms 0.180 ms
2
512 DXT1 174 KB 1 KB 0.036 ms 0.180 ms
1024 2 DXT1 699 KB 1 KB 0.039 ms 0.182 ms
2048 2 DXT1 2796 KB 1 KB 0.046 ms 0.207 ms
Table 2.1. Performance and storage comparison. We compare our method to a single texture
fetch in a repeated texture for various sizes of the input texture and storage formats. The classic
repeat requires only the storage of the input texture. Our method requires the storage of the
Gaussian version of the input Tinput, which has the same size as the input, and the look-up
table invT. We measured the performance by rendering a full-screen quad at 1920 1080 reso-
lution on a GeForce GTX 980.

2.8 Conclusion
We have presented an implementation of our procedural texturing algorithm that works
well for breaking the repetition of tiled textures. This algorithm is meant to be used
with stochastic textures (moss, granite, sand, etc.) and cannot be used with repetitive
or strongly correlated patterns. It has little memory overhead, works well with the com-
pressed DXT texture format, and is about four times the cost of a classic texture re-
peated tiling. Finally, it is straightforward to adapt it to other inputs than RGB color
data such as the normal map in Figure 2.14.

Acknowledgments
This chapter is the result of Thomas Deliot’s master thesis, which was supervised by Eric Heitz.
Both authors conducted this work at Unity Technologies.
198 2. Procedural Stochastic Textures by Tiling and Blending

(a) Repeat (b) Procedural


Figure 2.13. Comparison of classic texture repeat and our procedural texturing algorithm ap-
plied on the ground texture of a video game scene.

(a) Repeat (b) Procedural


Figure 2.14. Our algorithm applied on non-RGB input. We compare classic texture repeat and
our procedural texturing algorithm on a small-scale skin pore normal map.
Acknowledgments 199

Figure 2.15. Our procedural texturing algorithm applied on a rock texture.

Figure 2.16. Failure case of our method. Our method does not produce plausible results if the
input presents a strong pattern-like organization.
200 2. Procedural Stochastic Textures by Tiling and Blending

Bibliography
HEEGER, D. AND BERGEN, J. 1995. Pyramid-based Texture Analysis/Synthesis. In Proceedings
of ACM SIGGRAPH ’95, pp. 229–238.
HEITZ, E. AND NEYRET, F. 2018. High-Performance By-Example Noise using a Histogram-Pre-
serving Blending Operator. In Proceedings of the ACM on Computer Graphics and Interactive
Techniques, 1:2, pp. 25.
HEITZ, E., NOWROUZEZAHRAI, D., POULIN, P., AND NEYRET, F. 2013. Filtering Color Mapped
Textures and Surfaces. 2013. In Proceedings of the Symposium on Interactive 3D Graphics
and Games 2013, pp. 129–136.
PERLIN, K. 2001. Noise hardware. Real-time shading languages, SIGGRAPH 2001 Course.
3
IV

A Ray Casting Technique for


Baked Texture Generation
Alain Galvan and Jeff Russell

Baking is a process of transferring surface data from high-polygon geometry to a tex-


ture meant to be used by low-polygon geometry. By baking, static data such as a
model’s normals, vertex colors, displacement, and even shading terms like ambient
occlusion and diffuse lighting can be efficiently represented on a low-polygon mesh. In
this way, greater detail can be provided with vastly reduced geometry. This process has
been crucial in asset preparation for real-time systems for many years, and will likely
continue to see heavy use.

Figure 3.1. An example scene from Marmoset Toolbag 3 showing a model baked using our
technique.

201
202 3. A Ray Casting Technique for Baked Texture Generation

As baking is generally an offline process, early tools relied on CPU processing for
greater flexibility. Computation times were on the order of minutes and hours. More
recent implementations have used GPU processing to great effect. Increased parallel-
ism as well as improved support for general computing, in particular ray tracing, have
made GPU baking an appealing choice. Modern systems can bake results in seconds
or even in real-time. This chapter outlines a technique to bake geometry on the GPU
with user input, as well as a number of potential pitfalls inherent in the process.
The technology demo that this chapter follows is based on the baker used in Mar-
moset Toolbag. This example can be downloaded at:

https://github.com/alaingalvan/gpu-zen-2-baker/

3.1 Baking in Practice


The primary task of any baking system is finding the corresponding high-polygon sur-
face for a given point on the low-polygon mesh. Typically an artist will provide two
meshes: one high-polygon mesh at full detail (hereinafter referred to as the reference
model), and a low-polygon version of the same model meant for use in a real-time
renderer (hereinafter referred to as the working model) [Teixeira 2008]. The expected
result is an image of the surface of the reference model, laid out according to the texture
coordinates of the working model.
This projection is performed by tracing rays from the surface of the working model
into the reference model. To determine these intersections, it is best to store the refer-
ence model in an acceleration structure such as a k-d tree [Lantz 2013] for efficient
traversal. Once the nearest intersecting triangle has been found for a given ray, the de-
sired vertex properties can be interpolated with the barycentric coordinates of the in-
tersection and returned.

Figure 3.2. A model representation of baking, where rays are cast from the working model to
the reference model.
3.1 Baking in Practice 203

This conceptually straightforward process is complicated considerably by differ-


ences between the reference and working models. While the shape and size of the two
meshes are generally very similar, the reference mesh is not always completely bounded
by the working mesh (or vice versa), and as a result there is some question about how
to find the correct corresponding points between the two. Rays cast from the surface of
the working model may start inside the reference model, and thereby miss their in-
tended targets on the reference mesh.
A common solution to this problem is to allow the user to specify, either proce-
durally or explicitly with a third mesh, a “cage” that completely surrounds the reference
model and serves as the origin for sample rays. This bounding of the reference mesh
greatly reduces the possibility of incorrect intersections, as it correctly encloses areas
of the reference mesh both “above” and “below” the working model.

Figure 3.3. A model of baking which includes a Cage projected from the working model.

Another issue with baking projection relates to ray directions. In general, sample
rays should point “inward” in the direction of the reference model. However, the spe-
cific direction chosen obviously has a significant effect on the sample’s ultimate result.
A simple approach to determine sample directions is to make sample rays run parallel
to the working model’s polygon normals. This works somewhat well, but has the major
drawback of creating discontinuities along polygon edges, as Figure 3.4 shows.

Figure 3.4. A comparison of baking projection techniques. The left shows ray directions based
off the average normal of that surface, and the right shows rays based off the working model’s
face normal. Note the discontinuities on the edge of different faces.
204 3. A Ray Casting Technique for Baked Texture Generation

A more common alternative approach is to use the interpolated vertex normals of


the working mesh to specify sample ray directions. This removes any first-order dis-
continuities in the projection, and is a fairly robust default. Most models bake fairly
well with this technique.
In practice, artists will often require a combination of these two techniques for
sample directions. This need arises due to “skewing”, a kind of distortion of the bake
result that occurs as a result of the gradient of change of the sample directions. Details
from the reference model may appear undesirably skewed or otherwise distorted in
these cases. Where this occurs, choosing sample directions closer to the face normals
(as in Figure 3.5) will mitigate the distortion. We introduce the idea of a Skew Map, a
grayscale map that allows users to specify which ray direction to use. The ray’s direc-
tion is thus determined by interpolating between the computed face normal and out-
ward smooth normal on a per-pixel basis.
We also introduce the idea of an Offset Map, which allows users to specify ray
origin offset magnitude for certain parts of the mesh. Users specify a minimum and
maximum offset, and ray origins are determined by interpolating between these min
and max distance values along the required ray direction. In short, the offset map allows
the user fine-grained control over how far the ray origin should “back up” from the
working mesh, creating a de-facto cage enclosure.
A baking system with this degree of user control has proven to be a powerful tool.
Users may paint skew and offset maps in texture coordinate space or in the 3D view-
port, and quickly address baking errors that have been difficult to fix in the past. The

Figure 3.5. A visualization of a cage determined by a positional offset map and skew map. On
the left shows the offset cage colored with its skew map, on the right shows the direction of rays
projected from that cage with the mesh rotated 180 degrees for easier visualization. Red indicates
that rays should point the direction of the working model’s normal and green indicates rays
should point in the direction of the calculated smooth normal.
3.1 Baking in Practice 205

process of baking models remains more art than science, but by exposing the technical
aspects of baking in an intuitive way artists are able to work more effectively.
In summary, the following data are needed to bake a mesh with our approach:

 Output Render Target. Your output render target, which can vary in format and
size.

 Working Model. The low polygon geometry used to determine smooth normals
and used as input for any final transformations to output fragments.

 Reference Model. The high polygon geometry used when searching for ray col-
lisions.

 Skew Map. A map that allows users to interpolate between either raycasting in
the direction of the computed smooth normal, or the direction of the working
model’s normal.

 Offset Map. A map defines the offset between the cage and the working model.

 Offset Bounds. A minimum and maximum offset value to offset the cage from
the working model.

3.1.1 Pitfalls
One pitfall we countered however was in the computing of smooth normals. We noticed
that for convex geometry such as holes or corners, the smooth direction would point
away from the reference model. To fix this for a given vertex, the starting ray’s direction
should be the average normal if the dot product of it and that vertex normal is greater
than 0, otherwise it should be reflected by that vertex’s normal.
In addition, when baking geometry, it is sometimes necessary for the user to split
apart a bake to different sections to avoid rays being cast from the inside of the refer-
ence model. This splitting of working models and reference models can speed up the
time it takes to bake, but now different sections can be completely isolated from all
reference geometry. One possible way to mitigate this issue is to expose a user option
to bake using all reference geometry. This can be useful when working maps that re-
quire scene information such as ambient occlusion.
Finally, since every user has unique use cases for their baked textures, it’s best to
be aware of the tangent space the working model will be used in, as well as the hand-
edness of the target application to avoid issues such as baked normals facing the wrong
direction.

3.1.2 Implementation
Before we begin discussing implementation details it’s important to introduce base
functions that we’ll be depending on. We’ll be using the function findTraceDir-
206 3. A Ray Casting Technique for Baked Texture Generation

ection to determine the direction our ray will be cast from, findTraceOrigin to
determine the origin of our rays, and finally traceRay to perform the ray tracing op-
eration. For the sake of brevity, we’ll omit traceRay from any source listings in this
chapter, but a working implementation can be found in our included example.

// Interpolate between the current input smooth normal and


// face normal to determine the final ray trace direction.
vec3 findTraceDirection(vec3 position, vec3 smoothNormal,
vec2 uv, sampler2D dirMask)
{
vec3 dx = dFdx(position);
vec3 dy = dFdy(position);
vec3 faceNormal = normalize(cross(dx, dy));
float traceBlend = textureLod(dirMask, uv, 0.0).x;
if (dot(faceNormal, smoothNormal) < 0.0)
{
faceNormal = -faceNormal;
}

vec3 diff = smoothNormal - faceNormal;


float diffLen = length(diff);
float maxLen = sqrt(2.0) * traceBlend;

// Interpolate final direction


if (diffLen > maxLen)
{
diff *= maxLen / diffLen;
}

vec3 dir = faceNormal + diff;


return -normalize(dir);
}

// Interpolate between a range of offsets to determine the


// origin of rays cast.
vec3 findTraceOrigin(vec3 position, vec3 direction, vec2 uv,
sampler2D offsetMask, vec2 offsetRange)
{
float offset = texture2DLod(offsetMask, uv, 0.0).x;
offset = mix(offsetRange.x, offsetRange.y, offset);
vec3 traceOrigin = position - offset * direction;
return traceOrigin;
}

Listing 3.1. Base functions implementations when performing raycast calculations.


3.1 Baking in Practice 207

Normal Map. Normals are available directly from the reference model’s vertex data.
One would simply need to compute the fragment as the current reference vertex’s nor-
mal value, and interpolate between vertices through barycentric coordinates. Tangent
based normals require that one transform the reference model normal with the working
model’s normal orientation:

Figure 3.6. A tangent space normal map texture applied to a model rendered with 16×
multisampling.

vec3 n = vec3(0.0, 0.0, 1.0);


vec3 traceDir = findTraceDirection(inPosition, normalize(inBakeDir)
inTextureCoords, tSkewMap);
vec3 tracePos = findTraceOrigin(inPosition, traceDir,
inTextureCoords, tOffsetMap, uOffsetRange);

TriangleHit hit;
bool didhit = traceRay(tracePos, traceDir, hit);
if (didHit)
{
n = hit.coords.x * uNormals[hit.vertices.x]
+ hit.coords.y * uNormals[hit.vertices.y]
+ hit.coords.z * uNormals[hit.vertices.z];
n = normalize(n);
}

outObjectNormals.rgb = n;
outObjectNormals.a = 1.0;
outTangentNormals.rgb = vec3(dot(n, inTangents), dot(n, inBitangents),
dot(n, inNormals));
outTangentNormals.a = 1.0;

Listing 3.2. An example implementation of baking the reference model’s vertex normal.
208 3. A Ray Casting Technique for Baked Texture Generation

Height Map. Height maps are used as inputs in tessellation to determine areas that
require more subdivisions, and to offset those areas by the input texture.

Figure 3.7. Height map computed from cage offset distance to reference model.

vec3 traceDir = findTraceDirection(inPosition, normalize(inBakeDir),


inTextureCoords, dirMask);

vec3 tracePos = findTraceOrigin(inPosition, traceDir,


inTextureCoords, tOffsetMap, uOffsetRange);

TriangleHit hit;
bool didhit = traceRay(tracePos, traceDir, hit);

float height = didhit ? hit.distance : 0.0;


height -= length(inPosition - tracePos);

// Interpolate between user specified minimum and


// maximum height values
height = height * uHeightScaleBias.x + uHeightScaleBias.y;

outHeight.xyz = vec3(height, height, height);


outHeight.w = didhit ? 1.0 : 0.0;

Listing 3.3. Height map computed from cage offset distance to reference model.

Ambient Occlusion Map. Ambient occlusion describes average amount of light that
would be expected to miss a region from an omnidirectional light source. This value
can be determined through monte-carlo stochastic sampling of rays cast hemispheri-
cally from the first initial raycast.
3.1 Baking in Practice 209

Figure 3.8. An example of a baked ambient occlusion map set to 4096 rays.

#define SAMPLES 16

outAO = vec4(1.0,1.0,1.0,0.0);
vec3 traceDir = findTraceDirection(inPosition, normalize(inBakeDir),
inTextureCoords, dirMask);
vec3 tracePos = findTraceOrigin(inPosition, traceDir,
inTextureCoords, tOffsetMap, uOffsetRange);

TriangleHit hit;
if (traceRay(tracePos, traceDir, hit))
{
vec3 pos = tracePos + traceDir
* (hit.distance - uHemisphereOffset);
vec3 basisY = normalize(hit.coords.x * uNormals[hit.vertices.x]
+ hit.coords.y * uNormals[hit.vertices.y]
+ hit.coords.z * uNormals[hit.vertices.z]);
vec3 basisX = normalize(cross(basisY, fTangent));
vec3 basisZ = cross(basisX, basisY);

float ao = 0.0;
float hits = 0.0;
TriangleHit hit2;
for (int i = 0; i < SAMPLES; ++i)
{
// Random Direction in hemisphere of first hit
vec3 d = normalize(rand3(fTexCoord + uRandSeed + float(i)));

// Give rays that point away from the top of the hemisphere
// more weight when averaging the final ambient occlusion.
float omega = d.y;
210 3. A Ray Casting Technique for Baked Texture Generation

d = d.x * basisX + d.y * basisY + d.z * basisZ;


if (traceRay(pos, d, hit2))
{
ao += omega;
hits += omega;
}
}

ao = omega >= 1.0 ? 1.0 - ao / omega : 1.0;


}

outAO.xyz = vec3(ao, ao, ao);


outAO.w = 1.0;

Listing 3.4. An example implementation of monte-carlo ambient occlusion baking.

Material Atlas. Scene descriptions introduce the concept of a Mesh being composed
of several Primitives, each coupled with a Material. Different parts of a mesh can cor-
respond with different materials, which can lead to dense geometry with made up of
many materials. This is great for authoring reference modes, however when designing
assets to be used in real time rendering, the need for a simpler working model that
encodes all these materials as textures to be used in a single material with a Physically
Based Rendering (PBR) workflow arises.
One solution to this problem is to process each material the working model is
composed of, masking out the geometry of what’s not being baked with an alpha of
0.0.

Figure 3.9. An example of an albedo map that uses the metalness workflow, baked from the
reference model’s Physically Based (PBR) materials using our technique. On the right is the
mesh textured with the generated albedo map rendered with a PBR metalness workflow.
3.2 GPU Considerations 211

3.2 GPU Considerations


Baking represents a potentially heavy workload, even when properly optimized for
GPU processing. Reference models are often composed of tens or hundreds of millions
of triangles, and output resolutions up to 8k or even 16k are not uncommon. While
simple bakes can be performed in just a few milliseconds, a high resolution ambient
occlusion bake can take several minutes to complete. Short preview and iteration times
are of high importance to artists, which makes the performance of a baking system a
primary consideration.

3.2.1 Acceleration Structure


In terms of processing time, the main task for a bake is the tracing of rays against the
reference mesh. These intersection tests must run quickly, and a properly built acceler-
ation structure is vital to improving ray tracing speed. Our system uses a k-d tree, a
special case of a binary spatial partitioning tree. Each node in the tree points either to
two children, or if the node is a leaf, to a list of triangles. In this way a mesh is recur-
sively subdivided into manageable chunks so that rays can test much smaller subsets of
triangles instead of the whole mesh. A full explanation of k-d tree construction and
traversal is outside the scope of this chapter; we encourage interested readers to read
more on the topic in Akenine-Möller et al. [2018].
Traversal of k-d trees can be thought of as a two-step process: traversing nodes,
and testing triangles once a leaf node is found. Traversal time tends to grow as a func-
tion of average tree depth, and triangle testing time tends to grow as a function of av-
erage leaf triangle count. A tree must be built to balance these quantities for minimal
execution time: a deep tree will spend too much time in traversal, and a shallow tree
will have too many triangles to test at leaf nodes. Our k-d tree is built with metrics for
preferred and maximal depth and triangle counts, and some careful exceptions to these
rules. A two-level heuristic is used: depth and triangle counts are kept within preferred
limits, except in cases where leaf triangle counts would be unacceptably high, in which
case a looser maximum depth is enforced. This produces trees well balanced for tra-
versal on the GPU in a variety of cases.
The code for tree traversal for ray tests rests in loops of several iterations for each
ray. We utilize a stackless traversal algorithm [Popov et al. 2007], to minimize GPU
register use. The complex flow of branches in algorithms like this one is handled rela-
tively well by modern GPU hardware, but these algorithms have an inherent large var-
iability in the time to trace rays. As a result, the shared flow control between threads
in a group causes traversal to stall as a thread group waits on the slowest thread. Cases
where neighboring rays are relatively incoherent (that is, landing in different parts of
the tree) perform relatively poorly for this reason. Ray incoherence is also a source of
cache misses, the k-d tree structure often being quite large in memory. Optimizing co-
herence in flow control in GPU ray tracing is an area of ongoing research.
212 3. A Ray Casting Technique for Baked Texture Generation

preferredTrisPerLeaf := 12
maxTrisPerLeaf := 64
preferredNodeDepth := 23
maxNodeDepth := preferredNodeDepth + 10

buildNode(depth, triCount):
if (depth < preferredNodeDepth and
triCount > preferredTrisPerLeaf) or
(depth < maxNodeDepth and
triCount > maxTrisPerLeaf)
Split triangles into two new nodes
for each new node
buildNode()
else
Attach triangles to leaf node

Listing 3.5. Pseudocode for k-d tree construction. Triangle and depth limits have been chosen
empirically with a brute-force performance search.

3.2.2 Dividing Work


Baking processes vary greatly in their time requirements, depending on reference mesh
size and shape, output resolution, sample count, and GPU capabilities. A low resolu-
tion preview bake may take only a millisecond or two, but a final resolution bake may
take several seconds or even minutes. At these longer time scales, we must contend
with limits imposed by many desktop operating systems on the length of time a set of
GPU render commands may take. Microsoft Windows in particular imposes a two-
second timeout for any GPU process, after which time the driver is reset. Even if this
timeout period were not an issue, users generally like to be able to do other things on
their machines while baking occurs. This motivates the subdivision of bakes into jobs
of manageable duration, which are executed in series. The GPU is then yielded to the
operating system between these jobs, to allow other programs to process and redraw,
as well as preventing driver resets.
A straightforward way of dividing work in baking processes is to divide the output
canvas into two dimensional sectors and process each sector as a separate job. The size
of these sectors must be carefully estimated to keep their execution time low, which
can be difficult. A good estimate, if an imperfect one, is found by multiplying the pixel
area of a sector with the number of rays each pixel will require. This produces fairly
large sectors for simple bakes, but small sectors for bakes that cast hundreds of rays
(such as ambient occlusion). Processing many small sectors comes with significant
performance cost, so there is a balance to be struck between bake speed and system
responsiveness. A user-specified setting for this priority has also proven helpful.
3.3 Future Work 213

3.2.3 Memory Use


Finite video memory has proven to be a troublesome constraint, as data sizes vary with
user-specified inputs and settings, and are effectively unbounded. A large reference
mesh of 100M triangles and a corresponding k-d tree can easily occupy multiple giga-
bytes of video memory on its own, let alone multiple 8k render targets and other inter-
mediate results. As a result, the maximum mesh size and output resolution is limited
somewhat by the user’s installed video memory.
Paging or piecemeal processing of meshes would at first seem an ideal solution to
this problem of memory limits, but this approach is in practice extremely difficult. It is
very nearly impossible to robustly predict where sample rays will land on the reference
mesh without first tracing them, which rules out piecemeal processing of meshes. This
requires, at minimum, the k-d tree structure and reference mesh to always fit in video
memory, alongside the render output surface and the working mesh.
Output render target sizes are often limited by the video driver to 16k or less. In
situations where users desire higher resolution, piecemeal processing is possible by
rendering subsections and compositing into a final buffer on the CPU. A similar ap-
proach can be used for efficient multisampling of bake outputs: subsections of the im-
age are rendered at an enlarged size, and then resolved to their final resolution in the
output image. This allows for high resolution images at high sample counts to be ren-
dered regardless of GPU limitations.

3.3 Future Work


Even incremental optimization of processing time or memory use are likely to be well
received by users of baking software. Inconvenient processing times and hard memory
limits have long been the norm in this space, and fast GPU bakers are starting to change
the way 3D artwork is produced. Quicker turnaround times with larger meshes is the
ever-present goal. In addition to speed and capacity, new uses for baking algorithms
are always being devised and requested by users.
Further investigation into improved k-d tree construction and traversal is likely to
be productive. Other acceleration structures may be faster or use less memory; in par-
ticular a bounding volume hierarchy [Akenine-Möller et al. 2018, Chapter 26] may
offer a more compact representation of a reference mesh. Additionally, current GPU
traversal algorithms leave much to be desired, particularly when it comes to achieving
full occupancy in thread groups. The use of persistent worker threads that consume a
queue of rays, and as a result rarely go idle, has been proposed as a possible solution
to the thread occupancy problem [Akenine-Möller et al. 2018, Chapter 3].
Additionally, there may be some opportunity for caching ray hits between bake
outputs. When a user bakes multiple maps, some redundant work is performed between
these outputs. Retaining a map of ray hits may help amortize the cost of ray casting
between outputs, speeding up the crucial case of complex bakes.
214 3. A Ray Casting Technique for Baked Texture Generation

Bibliography
TEIXEIRA, D. 2008. Baking Normal Maps on the GPU. In GPU Gems 3, Addison-Wesley.
POPOV, S., GÜNTHER, J., SEIDEL, H., AND SLUSALLEK, P. 2007. Stackless KD-Tree Traversal for
High Performance GPU Ray Tracing. In Computer Graphics Forum, 26:3, pp. 415–424. URL:
https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8659.2007.01064.x.
LANTZ, K. 2013. KD Tree Construction Using the Surface Area Heuristic, Stack-Based Tra-
versal, and The Hyperplane Separation Theorem. URL: https://www.keithlantz.net/2013/04.
AKENINE-MÖLLER, T., HAINES, E., HOFFMAN, N., PESCE, A., IWANICKI, M., AND HILLAIRE, S.
2018. Real-Time Rendering, 4th Ed. CRC Press.
4
IV

Writing an Efficient
Vulkan Renderer
Arseny Kapoulkine

Vulkan is a new explicit cross-platform graphics API. It introduces many new concepts
that may be unfamiliar to even seasoned graphics programmers. The key goal of Vul-
kan is performance—however, attaining good performance requires in-depth
knowledge about these concepts and how to apply them efficiently, as well as how par-
ticular driver implementations implement these. This article will explore topics such
as memory allocation, descriptor set management, command buffer recording, pipeline
barriers, render passes and discuss ways to optimize CPU and GPU performance of
production desktop/mobile Vulkan renderers today as well as look at what a future
looking Vulkan renderer could do differently.
Modern renderers are becoming increasingly complex and must support many dif-
ferent graphics APIs with varying levels of hardware abstraction and disjoint sets of
concepts. This sometimes makes it challenging to support all platforms at the same
level of efficiency. Fortunately, for most tasks Vulkan provides multiple options that
can be as simple as reimplementing concepts from other APIs with higher efficiency
due to targeting the code specifically towards the renderer needs, and as hard as rede-
signing large systems to make them optimal for Vulkan. We will try to cover both ex-
tremes when applicable—ultimately, this is a tradeoff between maximum efficiency on
Vulkan-capable systems and implementation and maintenance costs that every engine
needs to carefully pick. Additionally, efficiency is often application-dependent—the
guidance in this article is generic and ultimately best performance is achieved by pro-
filing the target application on a target platform and making an informed implementa-
tion decision based on the results.
This article assumes that the reader is familiar with the basics of Vulkan API, and
would like to understand them better and/or learn how to use the API efficiently.

215
216 4. Writing an Efficient Vulkan Renderer

4.1 Memory Management


Memory management remains an exceedingly complex topic, and in Vulkan it gets
even more so due to the diversity of heap configurations on different hardware. Earlier
APIs adopted a resource-centric concept—the programmer doesn’t have a concept of
graphics memory, only that of a graphics resource, and different drivers are free to
manage the resource memory based on API usage flags and a set of heuristics. Vulkan,
however, forces to think about memory management up front, as you must manually
allocate memory to create resources.
A perfectly reasonable first step is to integrate VulkanMemoryAllocator (hence-
forth abbreviated as VMA), which is an open-source library developed by AMD that
solves some memory management details for you by providing a general purpose re-
source allocator on top of Vulkan functions. Even if you do use that library, there are
still multiple performance considerations that apply; the rest of this section will go over
memory caveats without assuming you use VMA; all of the guidance applies equally
to VMA.

4.1.1 Memory Heap Selection


When creating a resource in Vulkan, you have to choose a heap to allocate memory
from. Vulkan device exposes a set of memory types where each memory type has flags
that define the behavior of that memory, and a heap index that defines the available
size. Most Vulkan implementations expose two or three of the following flag combi-
nations1:

 VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT—this is generally referring to


GPU memory that is not directly visible from CPU; it’s fastest to access from
the GPU and this is the memory you should be using to store all render targets,
GPU-only resources such as buffers for compute, and also all static resources
such as textures and geometry buffers.

 VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_
HOST_VISIBLE_BIT—on AMD hardware, this memory type refers to up to 256
MB of video memory that the CPU can write to directly, and is perfect for allo-
cating reasonable amounts of data that is written by CPU every frame, such as
uniform buffers or dynamic vertex/index buffers

1
We only cover memory allocation types that are writable from host and readable or writable
from GPU; for CPU readback of data that has been written by GPU, memory with VK_MEMORY_
PROPERTY_HOST_CACHED_BIT flag is more appropriate.
4.1 Memory Management 217

 VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_
HOST_COHERENT_BIT2—this is referring to CPU memory that is directly visible
from GPU; reads from this memory go over PCI-express bus. In absence of the
previous memory type, this generally speaking should be the choice for uniform
buffers or dynamic vertex/index buffers, and also should be used to store staging
buffers that are used to populate static resources allocated with VK_MEMORY_
PROPERTY_DEVICE_LOCAL_BIT with data.

 VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_
LAZILY_ALLOCATED_BIT—this is referring to GPU memory that might never
need to be allocated for render targets on tiled architectures. It is recommended
to use lazily allocated memory to save physical memory for large render targets
that are never stored to, such as MSAA images or depth images.

On integrated GPUs, there is no distinction between GPU and CPU memory—


these devices generally expose VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_
MEMORY_PROPERTY_HOST_VISIBLE_BIT that you can allocate all static resources
through as well.
When dealing with dynamic resources, in general allocating in non-device-local
host-visible memory works well—it simplifies the application management and is ef-
ficient due to GPU-side caching of read-only data. For resources that have a high degree
of random access though, like dynamic textures, it’s better to allocate them in VK_
MEMORY_PROPERTY_DEVICE_LOCAL_BIT and upload data using staging buffers allo-
cated in VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT memory—similarly to how you
would handle static textures. In some cases you might need to do this for buffers as
well—while uniform buffers typically don’t suffer from this, in some applications us-
ing large storage buffers with highly random access patterns will generate too many
PCIe transactions unless you copy the buffers to GPU first; additionally, host memory
does have higher access latency from the GPU side that can impact performance for
many small draw calls.
When allocating resources from VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, in
case of VRAM oversubscription you can run out of memory; in this case you should
fall back to allocating the resources in non-device-local VK_MEMORY_PROPERTY_
HOST_VISIBLE_BIT memory. Naturally you should make sure that large frequently
used resources such as render targets are allocated first. There are other things you can
do in an event of an oversubscription, such as migrating resources from GPU memory
to CPU memory for less frequently used resources—this is outside of the scope of this
article; additionally, on some operating systems like Windows 10 correct handling of
oversubscription requires APIs that are not currently available in Vulkan.

2
Note that VK_MEMORY_PROPERTY_HOST_COHERENT_BIT generally implies that the memory
will be write-combined; on some devices it’s possible to allocate non-coherent memory and
flush it manually with vkFlushMappedMemoryRanges.
218 4. Writing an Efficient Vulkan Renderer

4.1.2 Memory Suballocation


Unlike some other APIs that allow an option to perform one memory allocation per
resource, in Vulkan this is impractical for large applications—drivers are only required
to support up to 4096 individual allocations. In addition to the total number being lim-
ited, allocations can be slow to perform, may waste memory due to assuming worst
case possible alignment requirements, and also require extra overhead during com-
mand buffer submission to ensure memory residency. Because of this, suballocation is
necessary. A typical pattern of working with Vulkan involves performing large (e.g.,
16 MB to 256 MB depending on how dynamic the memory requirements are) alloca-
tions using vkAllocateMemory, and performing suballocation of objects within this
memory, effectively managing it yourself. Critically, the application needs to handle
alignment of memory requests correctly, as well as bufferImageGranularity limit
that restricts valid configurations of buffers and images.
Briefly, bufferImageGranularity restricts the relative placement of buffer and
image resources in the same allocation, requiring additional padding between individ-
ual allocations. There are several ways to handle this:

 Always over-align image resources (as they typically have larger alignment to
begin with) by bufferImageGranularity, essentially using a maximum of
required alignment and bufferImageGranularity for address and size align-
ment.

 Track resource type for each allocation, and have the allocator add the requisite
padding only if the previous or following resource is of a different type. This
requires a somewhat more complex allocation algorithm.

 Allocate images and buffers in separate Vulkan allocations, thus sidestepping


the entire problem. This reduces internal fragmentation due to smaller align-
ment padding but can waste more memory if the backing allocations are too big
(e.g., 256 MB).

On many GPUs the required alignment for image resources is substantially bigger
than it is for buffers which makes the last option attractive—in addition to reducing
waste due to lack of extra padding between buffers and images, it reduces internal frag-
mentation due to image alignment when an image follows a buffer resource. VMA
provides implementations for option 2 (by default) and option 3 (see VMA_POOL_
CREATE_IGNORE_BUFFER_IMAGE_GRANULARITY_BIT).

4.1.3 Dedicated Allocations


While the memory management model that Vulkan provides implies that the applica-
tion performs large allocations and places many resources within one allocation using
suballocation, on some GPUs it’s more efficient to allocate certain resources as one
4.2 Descriptor Sets 219

dedicated allocation. That way the driver can allocate the resources in faster memory
under special circumstances.
To that end, Vulkan provides an extension (core in 1.1) to perform dedicated allo-
cations—when allocating memory, you can specify that you are allocating this memory
for this individual resource instead of as an opaque blob. To know if this is worthwhile,
you can query the extended memory requires via vkGetImageMemoryRequire-
ments2KHR or vkGetBufferMemoryRequirements2KHR; the resulting struct,
VkMemoryDedicatedRequirementsKHR, will contain requiresDedicatedAllo-
cation (which might be set if the allocated resource needs to be shared with other
processes) and prefersDedicatedAllocation flags.
In general, applications may see performance improvements from dedicated allo-
cations on large render targets that require a lot of read/write bandwidth depending on
the hardware and drivers.

4.1.4 Mapping Memory


Vulkan provides two options when mapping memory to get a CPU-visible pointer:

 Do this before CPU needs to write data to the allocation, and unmap once the
write is complete.

 Do this right after the host-visible memory is allocated, and never unmap
memory.

The second option is otherwise known as persistent mapping and is generally a better
tradeoff—it minimizes the time it takes to obtain a writeable pointer (vkMapMemory is
not particularly cheap on some drivers), removes the need to handle the case where
multiple resources from the same memory object need to be written to simultaneously
(calling vkMapMemory on an allocation that’s already been mapped and not unmapped
is not valid) and simplifies the code in general.
The only downside is that this technique makes the 256 MB chunk of VRAM that
is host visible and device local on AMD GPU that was described in “Memory heap
selection” less useful—on systems with Windows 7 and AMD GPU, using persistent
mapping on this memory may force WDDM to migrate the allocations to system
memory. If this combination is a critical performance target for your users, then map-
ping and unmapping memory when needed might be more appropriate.

4.2 Descriptor Sets


Unlike earlier APIs with a slot-based binding model, in Vulkan the application has
more freedom in how to pass resources to shaders. Resources are grouped into de-
scriptor sets that have an application-specified layout, and each shader can use several
descriptor sets that can be bound individually. It’s the responsibility of the application
to manage the descriptor sets to make sure that CPU doesn’t update a descriptor set
220 4. Writing an Efficient Vulkan Renderer

that’s in use by the GPU, and to provide the descriptor layout that has an optimal bal-
ance between CPU-side update cost and GPU-side access cost. In addition, since dif-
ferent rendering APIs use different models for resource binding and none of them
match Vulkan model exactly, using the API in an efficient and cross-platform way be-
comes a challenge. We will outline several possible approaches to working with Vulkan
descriptor sets that strike different points on the scale of usability and performance.

4.2.1 Mental Model


When working with Vulkan descriptor sets, it’s useful to have a mental model of how
they might map to hardware. One such possibility—and the expected design—is that
descriptor sets map to a chunk of GPU memory that contains descriptors—opaque
blobs of data, 16-64 bytes in size depending on the resource, that completely specify
all resource parameters necessary for shaders to access resource data. When dispatch-
ing shader work, CPU can specify a limited number of pointers to descriptor sets; these
pointers become available to shaders as the shader threads launch.
With that in mind, Vulkan APIs can map more or less directly to this model—
creating a descriptor set pool would allocate a chunk of GPU memory that’s large
enough to contain the maximum specified number of descriptors. Allocating a set out
of descriptor pool can be as simple as incrementing the pointer in the pool by the cu-
mulative size of allocated descriptors as determined by VkDescriptorSetLayout
(note that such an implementation would not support memory reclamation when free-
ing individual descriptors from the pool; vkResetDescriptorPool would set the
pointer back to the start of pool memory and make the entire pool available for alloca-
tion again). Finally, vkCmdBindDescriptorSets would emit command buffer com-
mands that set GPU registers corresponding to descriptor set pointers.
Note that this model ignores several complexities, such as dynamic buffer offsets,
limited number of hardware resources for descriptor sets, etc. Additionally, this is just
one possible implementation—some GPUs have a less generic descriptor model and
require the driver to perform additional processing when descriptor sets are bound to
the pipeline. However, it’s a useful model to plan for descriptor set allocation/usage.

4.2.2 Dynamic Descriptor Set Management


Given the mental model above, you can treat descriptor sets as GPU-visible memory—
it’s the responsibility of the application to group descriptor sets into pools and keep
them around until GPU is done reading them.
A scheme that works well is to use free lists of descriptor set pools; whenever
you need a descriptor set pool, you allocate one from the free list and use it for subse-
quent descriptor set allocations in the current frame on the current thread. Once you
run out of descriptor sets in the current pool, you allocate a new pool. Any pools that
were used in a given frame need to be kept around; once the frame has finished render-
ing, as determined by the associated fence objects, the descriptor set pools can reset
4.2 Descriptor Sets 221

via vkResetDescriptorPool and returned to free lists. While it’s possible to free
individual descriptors from a pool via VK_DESCRIPTOR_POOL_CREATE_FREE_
DESCRIPTOR_SET_BIT, this complicates the memory management on the driver side
and is not recommended.
When a descriptor set pool is created, application specifies the maximum number
of descriptor sets allocated from it, as well as the maximum number of descriptors of
each type that can be allocated from it. In Vulkan 1.1, the application doesn’t have to
handle accounting for these limits—it can just call vkAllocateDescriptorSets
and handle the error from that call by switching to a new descriptor set pool. Unfortu-
nately, in Vulkan 1.0 without any extensions, it’s an error to call vkAllocate-
DescriptorSets if the pool does not have available space, so application must track
the number of sets and descriptors of each type to know beforehand when to switch to
a different pool.
Different pipeline objects may use different numbers of descriptors, which raises
the question of pool configuration. A straightforward approach is to create all pools
with the same configuration that uses the worst-case number of descriptors for each
type—for example, if each set can use at most 16 texture and 8 buffer descriptors, one
can allocate all pools with maxSets = 1024, and pool sizes 16 × 1024 for texture de-
scriptors and 8 × 1024 for buffer descriptors. This approach can work but in practice it
can result in very significant memory waste for shaders with different descriptor
count—you can’t allocate more than 1024 descriptor sets out of a pool with the afore-
mentioned configuration, so if most of your pipeline objects use 4 textures, you’ll be
wasting 75% of texture descriptor memory.
Two alternatives that provide a better balance with respect to memory use are:

 Measure an average number of descriptors used in a shader pipeline per type for
a characteristic scene and allocate pool sizes accordingly. For example, if in a
given scene we need 3000 descriptor sets, 13400 texture descriptors, and 1700
buffer descriptors, then the average number of descriptors per set is 4.47 textures
(rounded up to 5) and 0.57 buffers (rounded up to 1), so a reasonable configu-
ration of a pool is maxSets = 1024, 5 × 1024 texture descriptors, 1024 buffer
descriptors. When a pool is out of descriptors of a given type, we allocate a new
one—so this scheme is guaranteed to work and should be reasonably efficient
on average.

 Group shader pipeline objects into size classes, approximating common patterns
of descriptor use, and pick descriptor set pools using the appropriate size class.
This is an extension of the scheme described above to more than one size class.
For example, it’s typical to have large numbers of shadow/depth prepass draw
calls, and large numbers of regular draw calls in a scene—but these two groups
have different numbers of required descriptors, with shadow draw calls typically
requiring 0 to 1 textures per set and 0 to 1 buffers when dynamic buffer offsets
are used. To optimize memory use, it’s more appropriate to allocate descriptor
set pools separately for shadow/depth and other draw calls. Similarly to general-
222 4. Writing an Efficient Vulkan Renderer

purpose allocators that can have size classes that are optimal for a given appli-
cation, this can still be managed in a lower-level descriptor set management layer
as long as it’s configured with application specific descriptor set usages before-
hand.

4.2.3 Choosing Appropriate Descriptor Types


For each resource type, Vulkan provides several options to access these in a shader;
application is responsible for choosing an optimal descriptor type.
For buffers, application must choose between uniform and storage buffers, and
whether to use dynamic offsets or not. Uniform buffers have a limit on the maximum
addressable size—on desktop hardware, you get up to 64 KB of data, however on mo-
bile hardware some GPUs only provide 16 KB of data (which is also the guaranteed
minimum by the specification). The buffer resource can be larger than that, but shader
can only access this much data through one descriptor.
On some hardware, there is no difference in access speed between uniform and
storage buffers, however for other hardware depending on the access pattern uniform
buffers can be significantly faster. Prefer uniform buffers for small to medium sized
data especially if the access pattern is fixed (e.g., for a buffer with material or scene
constants). Storage buffers are more appropriate when you need large arrays of data
that need to be larger than the uniform buffer limit and are indexed dynamically in the
shader.
For textures, if filtering is required, there is a choice of combined image/sampler
descriptor (where, like in OpenGL, descriptor specifies both the source of the texture
data, and the filtering/addressing properties), separate image and sampler descriptors
(which maps better to Direct3D 11 model), and image descriptor with an immutable
sampler descriptor, where the sampler properties must be specified when pipeline ob-
ject is created.
The relative performance of these methods is highly dependent on the usage pat-
tern; however, in general immutable descriptors map better to the recommended usage
model in other newer APIs like Direct3D 12, and give driver more freedom to optimize
the shader. This does alter renderer design to a certain extent, making it necessary to
implement certain dynamic portions of the sampler state, like per-texture LOD bias for
texture fade-in during streaming, using shader ALU instructions.

4.2.4 Slot-based Binding


A simplistic alternative to Vulkan binding model is Metal/Direct3D11 model where an
application can bind resources to slots, and the runtime/driver manage descriptor
memory and descriptor set parameters. This model can be implemented on top of Vul-
kan descriptor sets; while not providing the most optimal results, it generally is a good
model to start with when porting an existing renderer, and with careful implementation
it can be surprisingly efficient.
4.2 Descriptor Sets 223

To make this model work, application needs to decide how many resource
namespaces are there and how they map to Vulkan set/slot indices. For example, in
Metal each stage (VS, FS, CS) has three resource namespaces—textures, buffers, sam-
plers—with no differentiation between, e.g., uniform buffers and storage buffers. In
Direct3D 11 the namespaces are more complicated since read-only structured buffers
belong to the same namespace as textures, but textures and buffers used with unordered
access reside in a separate one.
Vulkan specification only guarantees a minimum of 4 descriptor sets accessible to
the entire pipeline (across all stages); because of this, the most convenient mapping
option is to have resource bindings match across all stages—for example, a texture slot
3 would contain the same texture resource no matter what stage it’s accessed from—
and use different descriptor sets for different types, e.g., set 0 for buffers, set 1 for
textures, set 2 for samplers. Alternatively, an application can use one descriptor set per
stage3 and perform static index remapping (e.g., slots 0-16 would be used for textures,
slots 17–24 for uniform buffers, etc.)—this, however, can use much more descriptor
set memory and isn’t recommended. Finally, one could implement optimally compact
dynamic slot remapping for each shader stage (e.g., if a vertex shader uses texture slots
0, 4, 5, then they map to Vulkan descriptor indices 0, 1, 2 in set 0, and at runtime
application extracts the relevant texture information using this remapping table.
In all these cases, the implementation of setting a texture to a given slot wouldn’t
generally run any Vulkan commands and would just update shadow state; just before
the draw call or dispatch you’d need to allocate a descriptor set from the appropriate
pool, update it with new descriptors, and bind all descriptor sets using vkCmdBind-
DescriptorSets. Note that if a descriptor set has 5 resources, and only one of them
changed since the last draw call, you still need to allocate a new descriptor set with 5
resources and update all of them.
To reach good performance with this approach, you need to follow several guide-
lines:

 Don’t allocate or update descriptor sets if nothing in the set changed. In the
model with slots that are shared between different stages, this can mean that if
no textures are set between two draw calls, you don’t need to allocate/update
the descriptor set with texture descriptors.

 Batch calls to vkAllocateDescriptorSets if possible—on some drivers,


each call has measurable overhead, so if you need to update multiple sets, allo-
cating both in one call can be faster

 To update descriptor sets, either use vkUpdateDescriptorSets with de-


scriptor write array, or use vkUpdateDescriptorSetWithTemplate from

3
Note that with the 4 descriptors per pipeline, this approach can’t handle full pipeline setup for
VS, GS, FS, TCS and TES—which is only a problem if you use tessellation on drivers that only
expose 4 descriptor sets.
224 4. Writing an Efficient Vulkan Renderer

Vulkan 1.1. Using the descriptor copy functionality of vkUpdateDescriptor-


Sets is tempting with dynamic descriptor management for copying most de-
scriptors out of a previously allocated array, but this can be slow on drivers that
allocate descriptors out of write-combined memory. Descriptor templates can
reduce the amount of work application needs to do to perform updates—since
in this scheme you need to read descriptor information out of shadow state
maintained by application, descriptor templates allow you to tell the driver the
layout of your shadow state, making updates substantially faster on some driv-
ers.

 Finally, prefer dynamic uniform buffers to updating uniform buffer descriptors.


Dynamic uniform buffers allow to specify offsets into buffer objects using
pDynamicOffsets argument of vkCmdBindDescriptorSets without allo-
cating and updating new descriptors. This works well with dynamic constant
management where constants for draw calls are allocated out of large uniform
buffers, substantially reduce CPU overhead, and can be more efficient on GPU.
While on some GPUs the number of dynamic buffers must be kept small to
avoid extra overhead in the driver, one or two dynamic uniform buffers should
work well in this scheme on all architectures.

In general, the approach outlined above can be very efficient in terms of perfor-
mance—it’s not as efficient as approaches with more static descriptor sets that are de-
scribed below, but it can still run circles around older APIs if implemented carefully.
On some drivers, unfortunately the allocate and update path is not very optimal—on
some mobile hardware, it may make sense to cache descriptor sets based on the de-
scriptors they contain if they can be reused later in the frame.

4.2.5 Frequency-based Descriptor Sets


While slot-based resource binding model is simple and familiar, it doesn’t result in
optimal performance. Some mobile hardware may not support multiple descriptor sets;
however, in general Vulkan API and driver expect an application to manage descriptor
sets based on frequency of change.
A more Vulkan centric renderer would organize data that the shaders need to ac-
cess into groups by frequency of change, and use individual sets for individual frequen-
cies, with set = 0 representing least frequent change, and set = 3 representing most
frequent. For example, a typical setup would involve:

 Set = 0 descriptor set containing uniform buffer with global, per-frame or per-
view data, as well as globally available textures such as shadow map texture
array/atlas

 Set = 1 descriptor set containing uniform buffer and texture descriptors for per-
material data, such as albedo map, Fresnel coefficients, etc.
4.2 Descriptor Sets 225

 Set = 2 descriptor set containing dynamic uniform buffer with per-draw data,
such as world transform array

For set = 0, the expectation is that it only changes a handful of times per frame;
it’s sufficient to use a dynamic allocation scheme similar to the previous section.
For set = 1, the expectation is that for most objects, the material data persists be-
tween frames, and as such could be allocated and updated only when the gameplay
code changes material data.
For set = 2, the data would be completely dynamic; due to the use of a dynamic
uniform buffer, we’d rarely need to allocate and update this descriptor set—assuming
dynamic constants are uploaded to a series of large per-frame buffers, for most draws
we’d need to update the buffer with the constant data, and call vkCmdBind-
DescriptorSets with new offsets.
Note that due to compatibility rules between pipeline objects, in most cases it’s
enough to bind sets 1 and 2 whenever a material changes, and only set 2 when material
is the same as that for the previous draw call. This results in just one call to vkCmd-
BindDescriptorSets per draw call.
For a complex renderer, different shaders might need to use different layouts—for
example, not all shaders need to agree on the same layout for material data. In rare
cases it might also make sense to use more than 3 sets depending on the frame struc-
ture. Additionally, given the flexibility of Vulkan it’s not strictly required to use the
same resource binding system for all draw calls in the scene. For example, post-pro-
cessing draw call chains tend to be highly dynamic, with texture/constant data changing
completely between individual draw calls. Some renderers initially implement the dy-
namic slot-based binding model from the previous section and proceed to additionally
implement the frequency-based sets for world rendering to minimize the performance
penalty for set management, while still keeping the simplicity of slot-based model for
more dynamic parts of the rendering pipeline.
The scheme described above assumes that in most cases, per-draw data is larger
than the size that can be efficiently set via push constants. Push constants can be set
without updating or rebinding descriptor sets; with a guaranteed limit of 128 bytes per
draw call, it’s tempting to use them for per-draw data such as a 4x3 transform matrix
for an object. However, on some architectures the actual number of constants available
to push quickly depends on the descriptor setup the shaders use, and is closer to 12
bytes or so. Exceeding this limit can force the driver to spill the push constants into
driver-managed ring buffer, which can end up being more expensive than moving this
data to a dynamic uniform buffer on the application side. While limited use of push
constants may still be a good idea for some designs, it’s more appropriate to use them
in a fully bindless scheme described in the next section.
226 4. Writing an Efficient Vulkan Renderer

4.2.6 Bindless Descriptor Designs


Frequency-based descriptor sets reduce the descriptor set binding overhead; however,
you still need to bind one or two descriptor sets per draw call. Maintaining material
descriptor sets requires a management layer that needs to update GPU-visible de-
scriptor sets whenever material parameters change; additionally, since texture de-
scriptors are cached in material data, this makes global texture streaming systems hard
to deal with—whenever some mipmap levels in a texture get streamed in or out, all
materials that refer to this texture need to be updated. This requires complex interaction
between material system and texture streaming system and introduces extra overhead
whenever a texture is adjusted—which partially offsets the benefits of the frequency-
based scheme. Finally, due to the need to set up descriptor sets per draw call it’s hard
to adapt any of the aforementioned schemes to GPU-based culling or command
submission.
It is possible to design a bindless scheme where the number of required set binding
calls is constant for the world rendering, which decouples texture descriptors from ma-
terials, making texture streaming systems easier to implement, and facilitates GPU-
based submission. As with the previous scheme, this can be combined with dynamic
ad-hoc descriptor updates for parts of the scene where the number of draw calls is
small, and flexibility is important, such as post-processing.
To fully leverage bindless, core Vulkan may or may not be sufficient; some bind-
less implementations require updating descriptor sets without rebinding them after the
update, which is not available in core Vulkan 1.0 or 1.1 but is possible to achieve with
VK_EXT_descriptor_indexing extension. However, basic design described below
can work without extensions, given high enough descriptor set limits. This requires
double buffering for the texture descriptor array described below to update individual
descriptors since the array would be constantly accessed by GPU.
Similarly to the frequency-based design, we’ll split the shader data into global
uniforms and textures (set 0), material data and per-draw data. Global uniforms and
textures can be specified via a descriptor set the same way as described the previous
section.
For per-material data, we will move the texture descriptors into a large texture
descriptor array (note: this is a different concept than a texture array—texture array
uses one descriptor and forces all textures to have the same size and format; descriptor
array doesn’t have this limitation and can contain arbitrary texture descriptors as array
elements, including texture array descriptors). Each material in the material data will
have an index into this array instead of texture descriptor; the index will be part of the
material data, which will also have other material constants.
All material constants for all materials in the scene will reside in one large storage
buffer; while it’s possible to support multiple material types with this scheme, for sim-
plicity we’ll assume that all materials can be specified using the same data. An example
of material data structure is below:
4.2 Descriptor Sets 227

struct MaterialData
{
vec4 albedoTint;

float tilingX;
float tilingY;
float reflectance;
float unused0; // pad to vec4

uint albedoTexture;
uint normalTexture;
uint roughnessTexture;
uint unused1; // pad to vec4
};

Similarly, all per-draw constants for all objects in the scene can reside in another
large storage buffer; for simplicity, we’ll assume that all per-draw constants have iden-
tical structure. To support skinned objects in a scheme like this, we’ll extract transform
data into a separate, third storage buffer:

struct TransformData
{
vec4 transform[3];
};

Something that we’ve ignored so far is the vertex data specification. While Vulkan
provides a first-class way to specify vertex data by calling vkCmdBindVertex-
Buffers, having to bind vertex buffers per-draw would not work for a fully bindless
design. Additionally, some hardware doesn’t support vertex buffers as a first-class en-
tity, and the driver has to emulate vertex buffer binding, which causes some CPU-side
slowdowns when using vkCmdBindVertexBuffers. In a fully bindless design, we
need to assume that all vertex buffers are suballocated in one large buffer and either
use per-draw vertex offsets (firstVertex argument to vkCmdDrawIndexed) to have
hardware fetch data from it, or pass an offset in this buffer to the shader with each draw
call and fetch data from the buffer in the shader. Both approaches can work well, and
might be more or less efficient depending on the GPU; here we will assume that the
vertex shader will perform manual vertex fetching.
Thus, for each draw call we need to specify three integers to the shader:

 Material index; used to look up material data from material storage buffer. The
textures can then be accessed using the indices from the material data and the
descriptor array.
 Transform data index; used to look up transform data from transform storage
buffer.
228 4. Writing an Efficient Vulkan Renderer

 Vertex data offset; used to look up vertex attributes from vertex storage buffer.

We can specify these indices and additional data, if necessary, via draw data:

struct DrawData
{
uint materialIndex;
uint transformOffset;
uint vertexOffset;
uint unused0; // vec4 padding

// ... extra gameplay data goes here


};

The shader will need to access storage buffers containing MaterialData, Trans-
formData, DrawData as well as a storage buffer containing vertex data. These can be
bound the shader via the global descriptor set; the only remaining piece of information
is the draw data index, that can be passed via a push constant.
With this scheme, we’d need to update the storage buffers used by materials and
draw calls each frame and bind them once using our global descriptor set; additionally,
we need to bind index data—assuming that, like vertex data, index data is allocated in
one large index buffer, we only need to bind it once using vkCmdBindIndexBuffer.
With the global setup complete, for each draw call we need to call vkCmdBind-
Pipeline if the shader changes, followed by vkCmdPushConstants to specify an
index into the draw data buffer4, followed by vkCmdDrawIndexed.
In a GPU-centric design, we can use vkCmdDrawIndirect or vkCmdDraw-
IndirectCountKHR (provided by KHR_draw_indirect_count extension) and fetch
per-draw constants using gl_DrawIDARB (provided by KHR_shader_draw_parame-
ters extension) as an index instead of push constants. The only caveat is that for GPU-
based submission, we’d need to bucket draw calls based on pipeline object on CPU
since there’s no support for switching pipeline objects otherwise.
With this, vertex shader code to transform the vertex could look like this:

DrawData dd = drawData[gl_DrawIDARB];
TransformData td = transformData[dd.transformOffset];
vec4 positionLocal = vec4(positionData[gl_VertexIndex
+ dd.vertexOffset], 1.0);
vec3 positionWorld = mat4x3(td.transform[0], td.transform[1],
td.transform[2]) * positionLocal;

4
Depending on the GPU architecture it might also be beneficial to pass some of the indices, like
material index or vertex data offset, via push constants to reduce the number of memory indi-
rections in vertex/fragment shaders.
4.3 Command Buffer Recording and Submission 229

Fragment shader code to sample material textures could look like this:

DrawData dd = drawData[drawId];
MaterialData md = materialData[dd.materialIndex];
vec4 albedo = texture(sampler2D(materialTextures[md.albedoTexture],
albedoSampler), uv * vec2(md.tilingX, md.tilingY));

This scheme minimizes the CPU-side overhead. Of course, fundamentally it’s a bal-
ance between multiple factors:
 While the scheme can be extended to multiple formats of material, draw and
vertex data, it gets harder to manage.
 Using storage buffers exclusively instead of uniform buffers can increase GPU
time on some architectures.
 Fetching texture descriptors from an array indexed by material data indexed by
material index can add an extra indirection on GPU compared to some alterna-
tive designs.
 On some hardware, various descriptor set limits may make this technique im-
practical to implement; to be able to index an arbitrary texture dynamically from
the shader, maxPerStageDescriptorSampledImages should be large
enough to accomodate all material textures—while many desktop drivers ex-
pose a large limit here, the specification only guarantees a limit of 16, so bind-
less remains out of reach on some hardware that otherwise supports Vulkan.
As the renderers get more and more complex, bindless designs will become more
involved and eventually allow moving even larger parts of rendering pipeline to GPU;
due to hardware constraints this design is not practical on every single Vulkan-compat-
ible device, but it’s definitely worth considering when designing new rendering paths
for future hardware.

4.3 Command Buffer Recording and Submission


In older APIs, there is a single timeline for GPU commands; commands executed on
CPU execute on the GPU in the same order, as there is generally only one thread re-
cording them; there is no precise control over when CPU submits commands to GPU,
and the driver is expected to manage memory used by the command stream as well as
submission points optimally.
In contrast, in Vulkan the application is responsible for managing command buffer
memory, recording commands in multiple threads into multiple command buffers, and
submitting them for execution with appropriate granularity. While with carefully writ-
ten code a single-core Vulkan renderer can be significantly faster than older APIs, the
peak efficiency and minimal latency is obtained by utilizing many cores in the system
for command recording, which requires careful memory management.
230 4. Writing an Efficient Vulkan Renderer

4.3.1 Mental Model


Similarly to descriptor sets, command buffers are allocated out of command pools; it’s
valuable to understand how a driver might implement this to be able to reason about
the costs and usage implications.
Command pool has to manage memory that will be filled with commands by CPU
and subsequently read by GPU command processor. The amount of memory used by
the commands can’t be statically determined; a typical implementation of a pool would
involve thus a free list of fixed-size pages. Command buffer would contain a list of
pages with actual commands, with special jump commands that transfer control from
each page to the next one so that GPU can execute all of them in sequence. Whenever
a command needs to be allocated from a command buffer, it will be encoded into the
current page; if the current page doesn’t have space, the driver would allocate the next
page using a free list from the associated pool, encode a jump to that page into the
current page and switch to the next page for subsequent command recording.
Each command pool can only be used from one thread concurrently, so the oper-
ations above don’t need to be thread-safe5. Freeing the command buffer using
vkFreeCommandBuffers may return the pages used by the command buffer into the
pool by adding them to the free list. Resetting the command pool may put all pages
used by all command buffers into the pool free list; when VK_COMMAND_POOL_
RESET_RELEASE_RESOURCES_BIT is used, the pages can be returned to the system so
that other pools can reuse them.
Note that there is no guarantee that vkFreeCommandBuffers actually returns
memory to the pool; alternative designs may involve multiple command buffers allo-
cating chunks within larger pages, which would make it hard for vkFreeCommand
Buffers to recycle memory. Indeed, on one mobile vendor, vkResetCommandPool is
necessary to reuse memory for future command recording in a default setup when pools
are allocated without VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT.

4.3.2 Multithreaded Command Recording


Two crucial restrictions in Vulkan for command pool usage are:

 Command buffers allocated from one pool may not be recorded concurrently by
multiple threads

 Command buffers and pools can not be freed or reset while GPU is still execut-
ing the associated commands

5
Regrettably, Vulkan doesn’t provide a way for the driver to implement thread-safe command
buffer recording so that one command pool can be reused between threads; in the scheme de-
scribed, cross-thread synchronization is only required for switching pages which is relatively rare
and can be lock-free for the most part.
4.3 Command Buffer Recording and Submission 231

Because of these, a typical threading setup requires a set of command buffer pools.
The set has to contain F * T pools, where F is the frame queue length—F is usually 2
(one frame is recorded by the CPU while another frame is being executed by the GPU)
or 3; T is the number of threads that can concurrently record commands, which can be
as high as the core count on the system. When recording commands from a thread, the
thread needs to allocate a command buffer using the pool associated with the current
frame & thread and record commands into it. Assuming that command buffers aren’t
recorded across a frame boundary, and that at a frame boundary the frame queue length
is enforced by waiting for the last frame in the queue to finish executing, we can then
free all command buffers allocated for that frame and reset all associated command
pools.
Additionally, instead of freeing command buffers, it’s possible to reuse them after
calling vkResetCommandPool—which would mean that command buffers don’t have
to be allocated again. While in theory allocating command buffers could be cheap,
some driver implementations have a measurable overhead associated with command
buffer allocation. This also makes sure that the driver doesn’t ever need to return com-
mand memory to the system which can make submitting commands into these buffers
cheaper.
Note that depending on the frame structure, the setup above may result in unbal-
anced memory consumption across threads; for example, shadow draw calls typically
require less setup and less command memory. When combined with effectively random
workload distribution across threads that many job schedulers produce, this can result
in all command pools getting sized for the worst-case consumption. If an application
is memory constrained and this becomes a problem, it’s possible to limit the parallel-
ism for each individual pass and select the command buffer/pool based on the recorded
pass to limit the waste.
This requires introducing the concept of size classes to the command buffer man-
ager. With a command pool per thread and a manual reuse of allocated command buff-
ers as suggested above, it’s possible to keep a free list per size class, with size classes
defined based on the number of draw calls (e.g., “<100”, “100–400”, etc.) and/or the
complexity of individual draw calls (depth-only, gbuffer). Picking the buffer based on
the expected usage leads to a more stable memory consumption. Additionally, for
passes that are too small it is worthwhile to reduce the parallelism when recording
these—for example, if a pass has <100 draw calls, instead of splitting it into 4 recording
jobs on a 4-core system, it can be more efficient to record it in one job since that can
reduce the overhead of command memory management and command buffer submis-
sion.

4.3.3 Command Buffer Submission


While it’s important to record multiple command buffers on multiple threads for effi-
ciency, since state isn’t reused across command buffers and there are other scheduling
limitations, command buffers need to be reasonably large to make sure GPU is not idle
232 4. Writing an Efficient Vulkan Renderer

during command processing. Additionally, each submission has some overhead both
on the CPU side and on the GPU side. In general a Vulkan application should target
<10 submits per frame (with each submit accounting for 0.5 ms or more of GPU work-
load), and <100 command buffers per frame (with each command buffer accounting
for 0.1 ms or more of GPU workload). This might require adjusting the concurrency
limits for command recording for individual passes, e.g., if a shadow pass for a specific
light has <100 draw calls, it might be necessary to limit the concurrency on the record-
ing for this pass to just one thread; additionally, for even shorter passes combining them
with neighboring passes into one command buffer becomes beneficial. Finally, the
fewer submissions a frame has the better—this needs to be balanced with submitting
enough GPU work earlier in the frame to increase CPU and GPU parallelism though,
for example it might make sense to submit all command buffers for shadow rendering
before recording commands for other parts of the frame.
Crucially, the number of submissions refers to the total number of VkSubmitInfo
structured submitted in all vkQueueSubmit calls in a frame, not to the number of
vkQueueSubmit calls per se. For example, when submitting 10 command buffers, it’s
much more efficient to use one VkSubmitInfo that submits 10 command buffers com-
pared to 10 VkSubmitInfo structures with one command buffer per each, even if in
both cases only one vkQueueSubmit call is performed. Essentially, VkSubmitInfo
is a unit of synchronization/scheduling on GPU since it has its own set of fences/
semaphores.

4.3.4 Secondary Command Buffers


When one of the render passes in the application contains a lot of draw calls, such as
the gbuffer pass, for CPU submission efficiency it’s important to split the draw calls
into multiple groups and record them on multiple threads. There are two ways to do
this:

 Record primary command buffers that render chunks of draw calls into the same
framebuffer, using vkCmdBeginRenderPass and vkCmdEndRenderPass; ex-
ecute the resulting command buffers using vkQueueSubmit (batching submits
for efficiency)

 Record secondary command buffers that render chunks of draw calls, passing
the render pass to vkBeginCommandBuffer along with VK_COMMAND_
BUFFER_USAGE_RENDER_PASS_CONTINUE_BIT; use vkCmdBeginRender-
Pass with VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS in the
primary command buffer, followed by vkCmdExecuteCommands to execute all
recorded secondary command buffers

While on immediate mode GPUs the first approach can be viable, and it can be a bit
easier to manage with respect to synchronization points on the CPU, it’s vital to use
the second approach on GPUs that use tiled rendering instead. Using the first approach
4.4 Pipeline Barriers 233

on tilers would require that the contents of the tiles is flushed to memory and loaded
back from memory between each command buffer, which is catastrophic for perfor-
mance.

4.3.5 Command Buffer Reuse


With the guidance on the command buffer submission above, in most cases submitting
a single command buffer multiple times after recording becomes impractical. In gen-
eral approaches that pre-record command buffers for parts of the scene are counter-
productive since they can result in excessive GPU load due to inefficient culling re-
quired to keep command buffer workload large and can trigger inefficient code paths
on some tiled renderers, and instead applications should focus on improving the thread-
ing and draw call submission cost on the CPU. As such, applications should use
VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT to make sure the driver has
freedom to generate commands that don’t need to be replayed more than once.
There are occasional exceptions for this rule. For example, for VR rendering, an
application might want to record the command buffer for the combined frustum be-
tween left and right eye once. If the per-eye data is read out of a single uniform buffer,
this buffer can then be updated between the command buffers using vkCmdUpdate-
Buffer, followed by vkCmdExecuteCommands if secondary command buffers are
used, or vkQueueSubmit. Having said that, for VR it might be worthwhile to explore
VK_KHR_multiview extension if available, since it should allow the driver to perform
a similar optimization.

4.4 Pipeline Barriers


Pipeline barriers remain one of the most challenging parts of Vulkan code. In older
APIs, the runtime and driver were responsible for making sure appropriate hardware-
specific synchronization was performed in case of hazards such as fragment shader
reading from the texture that was previously rendered to. This required meticulous
tracking of every single resource binding and resulted in an unfortunate mix of exces-
sive CPU overhead to perform a sometimes excessive amount of GPU synchronization
(for example, Direct3D 11 driver typically inserts a barrier between any two consecu-
tive compute dispatches that use the same UAV, even though depending on the appli-
cation logic the hazards may be absent). Because inserting barriers quickly and
optimally can require knowledge about the application’s use of resources, Vulkan re-
quires the application to do this.
For optimal rendering, the pipeline barrier setup must be perfect. A missing barrier
risks the application encountering a timing-dependent bug on an untested—or, worse,
not-yet-existing—architecture, that in the worst case could cause a GPU crash. An un-
necessary barrier can reduce the GPU utilization by reducing potential opportunity for
parallel execution—or, worse, trigger very expensive decompression operations or the
like. To make matters worse, while the cost of excessive barriers can be now visualized
234 4. Writing an Efficient Vulkan Renderer

by tools like Radeon Graphics Profiler, missing barriers are generally not detected by
validation tools.
Because of this, it’s vital to understand the behavior or barriers, the consequences
of overspecifying them as well as how to work with them.

4.4.1 Mental Model


The specification describes barriers in terms of execution dependencies and memory
visibility between pipeline stages (e.g., a resource was previously written to by a com-
pute shader stage, and will be read by the transfer stage), as well as layout changes for
images (e.g., a resource was previously in the format that is optimal to write via the
color attachment output and should be transitioned to a format that is optimal to read
from the shader). However, it might be easier to think about barriers in terms of their
consequences—as in, what can happen on a GPU when a barrier is used. Note that the
GPU behavior is of course dependent on the specific vendor and architecture, but it
helps to map barriers that are specified in an abstract fashion to more concrete con-
structs to understand their performance implications.
A barrier can cause three different things to happen:

1. Stalling execution of a specific stage until another stage is drained of all current
work. For example, if a render pass renders data to a texture, and a subsequent
render pass uses a vertex shader to read from this shader, GPU must wait for all
pending fragment shader and ROP work to complete before launching shader
threads for the vertex work in a subsequent pass. Most barrier operations will
lead to execution stalling for some stages.6

2. Flushing or invalidating an internal GPU-side cache and waiting for the


memory transactions to finish to make sure another stage can read the resulting
work. For example, on some architectures ROP writes might go through the L2
texture cache, but transfer stage might operate directly on memory. If a texture
has been rendered to in a render pass, then the following transfer operation
might read stale data unless the cache is flushed before the copy. Similarly, if a
texture stage needs to read an image that was copied using transfer stage, L2
texture cache may need to get invalidated to make sure it doesn’t contain stale
data. Not all barrier operations will need to do this.

3. Converting the format the resource is stored in, most commonly to decompress
the resource storage. For example, MSAA textures on some architectures are
stored in a compressed form where each pixel has a sample mask indicating
how many unique colors this pixel contains, and a separate storage for sample

6
It’s crucial to note that a commonly held belief that individual draw calls execute in isolation
without overlap with other work is wrong—GPUs commonly run subsequent draw calls in par-
allel across render state, shader and even render target switches.
4.4 Pipeline Barriers 235

data. Transfer stage or shader stage might be unable to read directly from a
compressed texture, so a barrier that transitions from VK_IMAGE_LAYOUT_
COLOR_ATTACHMENT_OPTIMAL to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_
OPTIMAL or VK_IMAGE_USAGE_TRANSFER_SRC_BIT might need to decom-
press the texture, writing all samples for all pixels to memory. Most barrier
operations won’t need to do this, but the ones that do can be incredibly
expensive.

With this in mind, let’s try to understand the guidance for using barriers.

4.4.2 Performance Guidelines


When generating commands for each individual barrier, the driver only has a local
view of the barrier and is unaware of past or future barriers. Because of this, the first
important rule is that barriers need to be batched as aggressively as possible. Given a
barrier that implies a wait-for-idle for fragment stage and an L2 texture cache flush, the
driver will dutifully generate that every time you call vkCmdPipelineBarrier. If you
specify multiple resources in a single vkCmdPipelineBarrier call, the driver will
only generate one L2 texture cache flush command if it’s necessary for any transitions,
reducing the cost.
To make sure the cost of the barriers isn’t higher than it needs to be, only relevant
stages need to be included. For example, one of the most common barrier types is one
that transitions a resource from VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL. When specifying this barrier,
you should specify the shader stages that will actually read this resource via
dstStageMask. It’s tempting to specify the stage mask as VK_PIPELINE_STAGE_
ALL_COMMANDS_BIT to support compute shader or vertex shader reads. Doing so, how-
ever, would mean that vertex shader workload from the subsequent draw commands
can not start, which is problematic:

 On immediate mode renderers, this slightly reduces the parallelism between


draw calls, requiring all fragment threads to finish before vertex threads can
start, which leads to GPU utilization dropping to 0 at the end of the pass and
gradually rising from 0 to, hopefully, 100% as the next render pass begins;

 On tiled mode renderers, for some designs the expectation is that all vertex work
from the subsequent pass executes to completion before fragment work can
start; waiting for fragment work to end for any vertex work to begin thus com-
pletely eliminates the parallelism between vertex and fragment stages and is one
of the largest potential performance problems that a naively ported Vulkan title
can encounter.

Note that even if the barriers are specified correctly—in this case, assuming the
texture is read from the fragment stage, dstStageMask should be VK_PIPELINE_
236 4. Writing an Efficient Vulkan Renderer

STAGE_FRAGMENT_SHADER_BIT—the execution dependency is still present, and it can


still lead to reduced GPU utilization. This can come up in multiple situations including
compute, where to read data from a compute shader generated by another compute
shader you need to express an execution dependency between CS and CS but specify-
ing a pipeline barrier is guaranteed to drain the GPU of compute work entirely, fol-
lowed by slowly filling it with compute work again. Instead, it can be worthwhile to
specify the dependency via what’s called a split barrier: instead of using vkCmd-
PipelineBarrier, use vkCmdSetEvent after the write operation completes, and
vkCmdWaitEvents before the read operations starts. Of course, using vkCmdWait-
Events immediately after vkCmdSetEvent is counter-productive and can be slower
than vkCmdPipelineBarrier; instead you should try to restructure your algorithm to
make sure there’s enough work submitted between Set and Wait, so that by the time
GPU needs to process Wait, the event is most likely already signaled and there is no
efficiency loss.
Alternatively, in some cases the algorithm can be restructured to reduce the num-
ber of synchronization points while still using pipeline barriers, making the overhead
less significant. For example, a GPU-based particle simulation might need to run two
compute dispatches for each particle effect: one to emit new particles, and another one
to simulate particles. These dispatches require a pipeline barrier between them to syn-
chronize execution, which requires a pipeline barrier per particle system if particle sys-
tems are simulated sequentially. A more optimal implementation would first submit all
dispatches to emit particles (that would not depend on each other), then submit a bar-
rier to synchronize emission and simulation dispatches, then submit all dispatches to
simulate particles—which would keep GPU well utilized for longer. From there on
using split barriers could help completely hide the synchronization cost.
As far as resource decompression goes, it’s hard to give a general advice—on some
architectures this never happens, and on some it does but depending on the algorithm
it might not be avoidable. Using vendor specific tools such as Radeon Graphics Profiler
is critical to understanding the performance impact decompression has on your frame;
in some cases, it may be possible to adjust the algorithm to not require the decompres-
sion in the first place, for example by moving the work to a different stage. Of course
it should be noted that resource decompression may happen in cases where it’s com-
pletely unnecessary and is a result of overspecifying barriers—for example, if you ren-
der to a framebuffer that contains a depth buffer and never read depth contents in the
future, you should leave the depth buffer in VK_IMAGE_LAYOUT_DEPTH_STENCIL_
OPTIMAL layout instead of needlessly transitioning it into VK_IMAGE_LAYOUT_
SHADER_READ_ONLY_OPTIMAL which might trigger a decompression (remember, the
driver doesn’t know if you are going to read the resource in the future!).

4.4.3 Simplifying Barrier Specification


With all the complexity involved in specifying barriers, it helps to have examples of
commonly required barriers. Fortunately, Khronos Group provides many examples of
4.4 Pipeline Barriers 237

valid and optimal barriers for various types of synchronization as part of Vulkan-Docs
repository on GitHub:

https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples

These can serve to improve the understanding of general barrier behavior, and can also
be used directly in a shipping application.
Additionally, for cases not covered by these examples and, in general, to simplify
the specification code and make it more correct, it is possible to switch to a simpler
model where, instead of fully specifying access masks, stages and image layouts, the
only concept that needs to be known about a resource is the resource state that encap-
sulates the stages that can use the resource and the usage mode for most common types
of access. Then all transitions involve transitioning a resource from state A from state
B, which is much easier to understand. To that end, Tobias Hector, a member of
Khronos Group and a co-author of the Vulkan specification, wrote an open-source li-
brary, simple_vulkan_synchronization, that translates resource state (otherwise
known as access type in the library) transitions into Vulkan barrier specification. The
library is small and simple and provides support for split barriers as well as full pipeline
barriers.

4.4.4 Predicting the Future with Render Graphs


The performance guidelines outlined in the previous section are hard to follow in prac-
tice, especially given conventional immediate mode rendering architectures.
To make sure that the stages and image layout transitions are not overspecified, it’s
important to know how the resource is going to be used in the future—if you want to
emit a pipeline barrier after render pass ends, without this information you’re generally
forced to emit a barrier with all stages in the destination stage mask, and an inefficient
target layout.
To solve this problem, it’s tempting to instead emit the barriers before the resource
is read, since at that point it’s possible to know how the resource was written to; how-
ever, this makes it hard to batch barriers. For example, in a frame with 3 render passes,
A, B, and C, where C reads A’s output and B’s output in two separate draw calls, to
minimize the number of texture cache flushes and other barrier work it’s generally
beneficial specify a barrier before C that correctly transitions outputs of both A and B;
instead what would happen is that there’s a barrier before each of C’s draw calls. Split
barriers in some cases can reduce the associated costs, but in general just-in-time bar-
riers will be overly expensive.
Additionally, using just-in-time barriers requires tracking the resource state to
know the previous layout; this is very hard to do correctly in a multithreaded system
since the final execution order on GPU can only be known once all commands are
recorded and linearized.
238 4. Writing an Efficient Vulkan Renderer

Due to the aforementioned problems, many modern renderers are starting to ex-
periment with render graphs as a way to declaratively specify all dependencies between
frame resources. Based on the resulting DAG structure, it’s possible to establish correct
barriers, including barriers required for synchronizing work across multiple queues,
and allocate transient resources with minimal use of physical memory.
A full description of a render graph system is out of scope of this article, but in-
terested readers are encouraged to refer to the following talks and articles:

 FrameGraph: Extensible Rendering Architecture in Frostbite, Yuriy O’Donnell,


GDC 2017.

 Advanced Graphics Tech: Moving to DirectX 12: Lessons Learned, Tiago Ro-
drigues, GDC 2017.

 Render graphs and Vulkan—a deep dive, Hans-Kristian Arntzen.

Different engines pick different parameters of the solution, for example Frostbite ren-
der graph is specified by the application using the final execution order (which the
author of this article finds more predictable and preferable), whereas two other presen-
tations linearize the graph based on certain heuristics to try to find a more optimal
execution order. Regardless, the important part is that dependencies between passes
must be declared ahead of time for the entire frame to make sure that barriers can be
emitted appropriately. Importantly, the frame graph systems work well for transient
resources that are limited in number and represent the bulk of required barriers; while
it’s possible to specify barriers required for resource uploads and similar streaming
work as part of the same system, this can make the graphs too complex and the pro-
cessing time too large, so these are generally best handled outside of a frame graph
system.

4.5 Render Passes


One concept that is relatively unique to Vulkan compared to both older APIs and new
explicit APIs is render passes. Render passes allow an application to specify a large
part of their render frame as a first-class object, splitting the workload into individual
subpasses and explicitly enumerating dependencies between subpasses to allow the
driver to schedule the work and place appropriate synchronization commands. In that
sense, render passes are similar to render graphs described above and can be used to
implement these with some limitations (for example, render passes currently can only
express rasterization workloads which means that multiple render passes should be
used if compute workloads are necessary to support). This section, however, will focus
on simpler uses of render passes that are more practical to integrate into existing ren-
derers, and still provide performance benefits.
4.5 Render Passes 239

4.5.1 Load and Store Operations


One of the most important features of render passes is the ability to specify load and
store operations. Using these, the application can choose whether the initial contents
of each framebuffer attachments needs to be cleared, loaded from memory, or remain
unspecified and unused by the application, and whether after the render pass is done
the attachment needs to be stored to memory.
These operations are important to get right—on tiled architectures, using redun-
dant load or store operations leads to wasted bandwidth which reduces performance
and increases power consumption. On non-tiled architectures, driver can still use these
to perform certain optimizations for subsequent rendering—for example, if the previ-
ous contents of an attachment is irrelevant but the attachment has associated compres-
sion metadata, driver may clear this metadata to make subsequent rendering more
efficient.
To allow maximum freedom for the driver, it’s important to specify the weakest
load/store operations necessary—for example, when rendering a full-screen quad to
the attachment that writes all pixels, on tiled GPUs VK_ATTACHMENT_LOAD_OP_CLEAR
is likely to be faster than VK_ATTACHMENT_LOAD_OP_LOAD, and on immediate mode
GPUs LOAD is likely to be faster—specifying VK_ATTACHMENT_LOAD_OP_DONT_CARE
is important so that the driver can perform an optimal choice. In some cases VK_
ATTACHMENT_LOAD_OP_DONT_CARE can be better than either LOAD or CLEAR since it
allows the driver to avoid an expensive clear operation for the image contents, but still
clear image metadata to accelerate subsequent rendering.
Similarly, VK_ATTACHMENT_STORE_OP_DONT_CARE should be used in case the
application is not expecting to read the data rendered to the attachment—this is com-
monly the case for depth buffers and MSAA targets.

4.5.2 Fast MSAA Resolve


After rendering data to an MSAA texture, it’s common to resolve it into a non-MSAA
texture for further processing. If fixed-function resolve functionality is sufficient, there
are two ways to implement this in Vulkan:

 Using VK_ATTACHMENT_STORE_OP_STORE for the MSAA texture and vkCmd-


ResolveImage after the render pass ends.

 Using VK_ATTACHMENT_STORE_OP_DONT_CARE for the MSAA texture and


specifying the resolve target via pResolveAttachments member of
VkSubpassDescription.

In the latter case, the driver will perform the necessary work to resolve MSAA contents
as part of work done when subpass/renderpass ends.
The second approach can be significantly more efficient. On tiled architectures,
using the first approach requires storing the entire MSAA texture to main memory,
240 4. Writing an Efficient Vulkan Renderer

followed by reading it from memory and resolving to the destination; the second ap-
proach can perform in-tile resolve in the most efficient manner. On immediate mode
architectures, some implementation may not support reading compressed MSAA tex-
tures using the transfer stage—the API requires a transition into VK_IMAGE_LAYOUT_
TRANSFER_SRC_OPTIMAL layout before calling vkCmdResolveImage, which may
lead to decompression of the MSAA texture, wasting bandwidth and performance.
With pResolveAttachments, the driver can perform the resolve operation at maxi-
mum performance regardless of the architecture.
In some cases, fixed function MSAA resolve is insufficient. In this case, it’s nec-
essary to transition the texture to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
and do the resolve in a separate render pass. On tiled architectures, this has the same
efficiency issues as vkCmdResolveImage fixed-function method; on immediate mode
architectures the efficiency depends on GPU and driver. One possible alternative is to
use an extra subpass that reads the MSAA texture via an input attachment.
For this to work, the first subpass that renders to MSAA texture has to specify the
MSAA texture via pColorAttachments, with VK_ATTACHMENT_STORE_OP_DONT_
CARE as the store op. The second subpass that performs the resolve needs to specify
MSAA texture via pInputAttachments and the resolve target via pColor-
Attachments; the subpass then needs to render a full-screen quad or triangle with a
shader that uses subpassInputMS resource to read MSAA data. Additionally, the ap-
plication needs to specify a dependency between two subpasses that indicates the
stage/access masks, similarly to pipeline barriers, and dependency flags VK_
DEPENDENCY_BY_REGION_BIT. With this, the driver should have enough information
to arrange the execution such that on tiled GPUs, the MSAA contents never leaves the
tile memory and instead is resolved in-tile, with the resolve result being written to main
memory7. Note that whether this happens depends on the driver and is unlikely to result
in significant savings on immediate mode GPUs.

4.6 Pipeline Objects


Older APIs typically used to split the GPU state into blocks based on functional units—
for example, in Direct3D 11 the full state of GPUs modulo resource bindings can be
described using the set of shader objects for various stages (VS, PS, GS, HS, DS) as
well as a set of state objects (rasterizer, blend, depth stencil), input assembly configu-
ration (input layout, primitive topology) and a few other implicit bits like output render
target formats. The API user then could set individual bits of the state separately, with-
out regards to the design or complexity of the underlying hardware.
Unfortunately, this model doesn’t match the model hardware typically uses, with
several performance pitfalls that can occur:

7
Of course, it’s not guaranteed that the driver will perform this optimization—it depends on the
hardware architecture and driver implementation.
4.6 Pipeline Objects 241

 While an individual state object is supposed to model parts of GPU state and
could be directly transferred to commands that setup GPU state, on some GPUs
the configuration of the GPU state required data from multiple different state
blocks. Because of this, drivers typically must keep a shadow copy of all state
and convert the state to the actual GPU commands at the time of Draw/Draw-
Indexed.
 With the rasterization pipeline getting more complex and gaining more pro-
grammable stages, some GPUs didn’t map them directly to hardware stages,
which means that the shader microcode can depend on whether other shader
stages are active and, in some cases, on the specific microcode for other stages;
this meant that the driver might have to compile new shader microcode from
state that can only be discovered at the time of Draw/DrawIndexed.
 Similarly, on some GPUs, fixed functional units from the API description were
implemented as part of one of the shader stages—changing the vertex input for-
mat, blending setup, or render target format could affect the shader microcode.
Since the state is only known at the time of Draw/DrawIndexed, this, again, is
where the final microcode had to be compiled.
While the first problem is more benign, the second and third problem can lead to sig-
nificant stalls during rendering as, due to the complexity of modern shaders and shader
compilation pipelines, shader compilation can take tens to hundreds of milliseconds
depending on hardware. To solve this, Vulkan and other new APIs introduce the con-
cept of pipeline object—it encapsulates most GPU state, including vertex input format,
render target format, state for all stages and shader modules for all stages. The expec-
tation is that on every supported GPU, this state is sufficient to build final shader mi-
crocode and GPU commands required to set the state up, so the driver never has to
compile microcode at draw time and can optimize pipeline object setup to the extent
possible.
This model, however, presents challenges when implementing renderers on top of
Vulkan. There are multiple ways to solve this problem, with different tradeoffs with
respect to complexity, efficiency, and renderer design.

4.6.1 Just-In-Time Compilation


The most straightforward way to support Vulkan is to use just-in-time compilation for
pipeline objects. In many engines, due to the lack of first-class concepts that match
Vulkan, the rendering backend must gather information about various parts of the pipe-
line state as a result of various state setup calls, similarly to what a Direct3D 11 driver
might do. Then, just before the draw/dispatch where the full state is known, all indi-
vidual bits of state would be grouped together and looked up in a hash table; if there’s
already a pipeline state object in the cache, it can be used directly, otherwise a new
object can be created. This scheme works to get the application running but suffers
from two performance pitfalls.
242 4. Writing an Efficient Vulkan Renderer

A minor concern is that the state that needs to be hashed together is potentially
large; doing this for every draw call can be time consuming when the cache already
contains all relevant objects. This can be mitigated by grouping state into objects and
hashing pointers to these objects, and in general simplifying the state specification from
the high-level API point of view.
A major concern, however, is that for any pipeline state object that must be created,
the driver might need to compile multiple shaders to the final GPU microcode. This
process is time consuming; additionally, it can not be optimally threaded with a just-
in-time compilation model—if the application only uses one thread for command sub-
mission, this thread would typically also compile pipeline state objects; even with mul-
tiple threads, often multiple threads would request the same pipeline object, serializing
compilation, or one thread would need several new pipeline objects, which increases
the overall latency of submission since other threads would finish first and have no
work to do.
For multithreaded submission, accessing the cache can result in contention be-
tween cores even when the cache is full. Fortunately, this can be solved by a two-level
cache scheme as follows:
The cache would have two parts, the immutable part that never changes during the
frame, and the mutable part. To perform a pipeline cache lookup, we first check if the
immutable cache has the object—this is done without any synchronization. In the event
of the cache miss, we lock a critical section and check if the mutable cache has the
object; if it doesn’t, we unlock the critical section, create the pipeline object, and then
lock it again and insert the object into the cache, potentially displacing another object
(additional or synchronization might be required if, when two threads request the same
object, only one compilation request is issued to the driver). At the end of the frame,
all objects from the mutable cache are added to the immutable cache and the mutable
cache is cleared, so that on the next frame access to these objects can be free-threaded.

4.6.2 Pipeline Cache and Cache Pre-warming


While just-in-time compilation can work, it results in significant amount of stuttering
during gameplay. Whenever an object with a new set of shaders/state enters the frame,
we end up having to compile a pipeline object for it which could be slow. This is a
similar problem to what Direct3D 11 titles would have, however in Direct3D 11 the
drivers did a lot of work behind the scenes to try to hide the compilation latency, pre-
compiling some shaders earlier and implementing custom schemes for patching
bytecode on the fly that didn’t require a full recompilation. In Vulkan, the expectation
is that the application handles pipeline object creation manually and intelligently, so a
naive approach doesn’t work very well.
To make just-in-time compilation more practical, it’s important to use the Vulkan
pipeline cache, serialize it between runs, and pre-warm the in-memory cache described
in the previous section at application startup from multiple threads.
4.6 Pipeline Objects 243

Vulkan provides a pipeline cache object, VkPipelineCache, that can store driver-
specific bits of state and shader microcode to improve compilation time for pipeline
objects. For example, if an application creates two pipeline objects with identical setup
except for culling mode, the shader microcode would typically be the same. To make
sure the driver only compiles the object once, the application should pass the same
instance of VkPipelineCache to vkCreateGraphicsPipelines in both calls, in
which case the first call would compile the shader microcode and the second call would
be able to reuse it. If these calls happen concurrently in different threads the driver
might still compile the shaders twice since the data would only be added to the cache
when one of the calls finishes.
It’s vital to use the same VkPipelineCache object when creating all pipeline
objects and serialize it to disk between runs using vkGetPipelineCacheData and
pInitialData member of VkPipelineCacheCreateInfo. This makes sure that the
compiled objects are reused between runs and minimizes the frame spikes during sub-
sequent application runs.
Unfortunately, during the first play through the shader compilation spikes will still
occur since the pipeline cache will not contain all used combinations. Additionally,
even when the pipeline cache contains the necessary microcode, vkCreate-
GraphicsPipelines isn’t free and as such compilation of new pipeline objects can
still increase the frame time variance. To solve that, it’s possible to pre-warm the in-
memory cache (and/or vkPipelineCache) during load time.
One possible solution here is that at the end of the gameplay session, the renderer
could save the in-memory pipeline cache data—which shaders were used with which
state8—to a database. Then, during QA playthroughs, this database could be populated
with data from multiple playthroughs at different graphics settings etc.—effectively
gathering the set of states that are likely to be used during the actual gameplay.
This database can then be shipped with the game; at game startup, the in-memory
cache could be prepopulated with all states created using the data from that database
(or, depending on the amount of pipeline states, this pre-warming phase could be lim-
ited to just the states for the current graphics settings). This should happen on multiple
threads to reduce the load time impact; the first run would still have a longer load time
(which can be further reduced with features like Steam pre-caching), but frame spikes
due to just-in-time pipeline object creation can be mostly avoided.
If a particular set of state combinations wasn’t discovered during QA play-
throughs, the system can still function correctly—at the expense of some amount of
stuttering. The resulting scheme is more or less universal and practical—but requires a
potentially large effort to play through enough levels with enough different graphics
settings to capture most realistic workloads, making it somewhat hard to manage.

8
This can use an application-specific format, or a library like Fossilize (https://github.com/
Themaister/Fossilize).
244 4. Writing an Efficient Vulkan Renderer

4.6.3 Ahead of Time Compilation


The “perfect” solution—one that Vulkan was designed for—is to remove just-in-time
compilation caches and pre-warming, and instead just have every single possible pipe-
line object available ahead of time.
This typically requires changing the renderer design and integrating the concept
of the pipeline state into the material system, allowing a material to specify the state
completely. There are different possible designs; this section will outline just one, but
the important thing is the general principle.
An object is typically associated with the material that specifies the graphics state
and resource bindings required to render the object. In this case, it’s important to sep-
arate resource bindings from the graphics state as the goal is to be able to enumerate
all combinations of graphics state in advance. Let’s call the collection of the graphics
state a “technique” (this terminology is intentionally similar to terminology from Di-
rect3D Effect Framework, although there the state was stored in the pass). Techniques
can then be grouped into effects, and a material would be referring to the effect, and to
some sort of key to specify the technique from the effect.
The set of effects and set of techniques in an effect would be static; the set of
effects would also be static. Effects are not as vital to being able to precompile pipeline
objects as techniques but can serve as useful semantical grouping of techniques—for
example, often material is assigned an effect at material creation time, but technique
can vary based on where the object is rendered (e.g., shadow pass, gbuffer pass, reflec-
tion pass) or on the gameplay effects active (e.g., highlight).
Crucially, the technique must specify all state required to create a pipeline object,
statically, ahead of time—typically as part of the definition in some text file, whether
in a D3DFX-like DSL, or in a JSON/XML file. It must include all shaders, blend states,
culling states, vertex format, render target formats, depth state. Here’s an example of
how this might look:

technique gbuffer
{
vertex_shader gbuffer_vs
fragment_shader gbuffer_fs

#ifdef DECAL
depth_state less_equal false
blend_state src_alpha one_minus_src_alpha
#else
depth_state less_equal true
blend_state disabled
#endif

render_target 0 rgba16f
render_target 1 rgba8_unorm
4.7 Conclusion 245

render_target 2 rgba8_unorm

vertex_layout gbuffer_vertex_struct
}

Assuming all draw calls, including ones used for post-effects etc., use the effect
system to specify render state, and assuming the set of effects and techniques is static,
it’s trivial to precreate all pipeline objects—each technique needs just one—at load
time using multiple threads, and at runtime use very efficient code with no need for in-
memory caches or possibility of frame spikes.
In practice, implementing this system in a modern renderer is an exercise in com-
plexity management. It’s common to use complex shader or state permutations—for
example, for two-sided rendering you typically need to change culling state and perhaps
change the shaders to implement two-sided lighting. For skinned rendering, you need
to change vertex format and add some code to the vertex shader to transform the attrib-
utes using skinned matrices. On some graphics settings, you might decide that the ren-
der target format needs to be floating-point R10G11B10 instead of RGBA16F, to
conserve bandwidth. All these combinations multiply and require you to be able to
represent them concisely and efficiently when specifying technique data (for example,
by allowing #ifdef sections inside technique declarations as shown above), and—
importantly—being aware of the steadily growing amount of combinations and refac-
toring/simplifying them as appropriate. Some effects are rare enough that they could
be rendered in a separate pass without increasing the number of permutations. Some
computations are simple enough that always running them in all shaders can be a better
tradeoff than increasing the number of permutations. And some rendering techniques
offer better decoupling and separation of concerns, which can also reduce the number
of permutations.
Importantly though, adding state permutations to the mix makes the problem
harder but doesn’t make it different—many renderers have to solve the problem of a
large number of shader permutations anyway, and once you incorporate all render state
into shader/technique specification and focus on reducing the number of technique per-
mutations, the same complexity management solutions apply equally to both problems.
The benefit of implementing a system like this is perfect knowledge of all required
combinations (as opposed to having to rely on fragile permutation discovery systems),
great performance with minimal frame-to-frame variance including the first load, and
a forcing function to keep the complexity of rendering code at bay.

4.7 Conclusion
Vulkan API shifts a large amount of responsibility from driver developers onto appli-
cation developers. Navigating the landscape of various rendering features becomes
more challenging when many implementation options are available; it’s challenging
enough to write a correct Vulkan renderer, but performance and memory consumption
246 4. Writing an Efficient Vulkan Renderer

is paramount. This article tried to discuss various important considerations when deal-
ing with specific problems in Vulkan, present multiple implementation approaches that
provide different tradeoffs between complexity, ease of use and performance, and span
the range between porting existing renderers to redesigning renderers around Vulkan.
Ultimately, it’s hard to give a general advice that works across all vendors and is
applicable to all renderers. For this reason, it’s vital to profile the resulting code on the
target platform/vendor—for Vulkan, it’s important to monitor the performance across
all vendors that the game is planning to ship on as the choices the application makes
are even more important, and in some cases a specific feature, like fixed-function vertex
buffer bindings, is the fast path on one vendor but a slow path on another.
Beyond using validation layers to ensure code correctness and vendor-specific pro-
filing tools, such as AMD Radeon Graphics Profiler or NVidia Nsight Graphics, many
open-source libraries that can help optimize your renderer for Vulkan are available:
 VulkanMemoryAllocator. https://github.com/GPUOpen-LibrariesAndSDKs/
VulkanMemoryAllocator. Provides convenient and performant memory alloca-
tors for Vulkan as well as other memory-related algorithms such as defragmen-
tation.
 volk. https://github.com/zeux/volk. Provides an easy way to use driver-provided
Vulkan entry points from the driver directly which can reduce function call over-
head.
 simple_vulkan_synchronization. https://github.com/Tobski/simple_vulkan_
synchronization. Provides a way to specify Vulkan barriers using a simplified
access type model, which helps balance correctness and performance.
 Fossilize. https://github.com/Themaister/Fossilize. Provides serialization sup-
port for various Vulkan objects, most notably for pipeline state creation info
which can be used to implement pre-warming for a pipeline cache.
 perfdoc. https://github.com/ARM-software/perfdoc. Provides layers similar to
validation layers, that analyze the stream of rendering command and identify
potential performance problems on ARM GPUs.
Finally, some vendors develop open-source Vulkan drivers for Linux; studying their
sources can help gain more insight into performance of certain Vulkan constructs:
 AMD. https://github.com/GPUOpen-Drivers/. Contains xgl which has the Vul-
kan driver source, and PAL which is a library used by xgl; many Vulkan func-
tion calls end up going through both xgl and PAL.
 AMD. https://github.com/mesa3d/mesa/tree/master/src/amd/vulkan. Contains
community-developed open-source radv driver.
 Intel. https://github.com/mesa3d/mesa/tree/master/src/intel/vulkan. Contains
Anvil driver.
Acknowledgments 247

Acknowledgments
Author wishes to thank Alex Smith (Feral Interactive), Daniel Rákos (AMD), Hans-Kristian
Arntzen (ARM), Matthäus Chajdas (AMD), Wessam Bahnassi (INFramez Technology Corp)
and Wolfgang Engel (CONFETTI) for reviewing the drafts of this article and helping make it
better.
5
IV

glTF—Runtime 3D
Asset Delivery
Marco Hutter

The widespread availability of client devices and web-browsers that are capable of
high-quality 3D rendering offers new application areas that involve 3D assets. These
applications range from online product presentation and configuration websites, virtual
museums, geo-information systems to general virtual environments or 3D video games
on mobile devices. Despite the variety of these applications, they pose many similar
requirements for the 3D assets. In order to meet these requirements, the Khronos Group
has designed glTF -- a transmission and delivery format for 3D assets. Its development
is based on the collaborative efforts of 3D content creators and providers, application-
and engine developers and end users. Since the initial release of glTF in 2015, software
companies from different application domains have adopted the format, and it is be-
coming the standard for the delivery of 3D assets for efficient real-time rendering.
This chapter explains the goals and features that are achieved with glTF, and their
technical implementation. The role of glTF in the 3D content creation workflow is laid
out, showing the tools and libraries that are available to support each step of the content
creation process, and how glTF may open up new application areas that rely on the
efficient transfer and rendering of high-quality 3D content.

5.1 The Goals of glTF


The overarching goal of glTF is to be a transmission format for 3D assets. In contrast
to previous 3D file formats, glTF is specifically intended for the efficient delivery of
3D assets to a large variety of clients. This means that the assets have to be stored in a
very compact form, so that they can efficiently be transferred over networks.
The assets should not only be single, geometric objects, but also multiple objects
that are arranged as parts of more complex scenes, including material definitions, ani-
mations and camera configurations. At the client side, it must still be possible to effi-
ciently render the assets, with as little preprocessing as possible.

249
250 5. glTF—Runtime 3D Asset Delivery

Another important goal is consistency of the visual representation among different


target platforms. Although a large number of different client platforms should be able
to consume these assets, the final appearance of the rendered objects should be the
same, regardless of the platform, application, and the underlying graphics API.
Referring to these goals, glTF is sometimes called the “JPEG of 3D”: An open
standard format that can efficiently be transferred via networks and displayed on a large
variety of different clients in a device-independent manner: Create once, view every-
where.
The specification of glTF is open and extensible. It is maintained by the Khronos
Group members, on public repositories, and everybody is free to implement or provide
input to the specification. An extension mechanism allows for a careful evolution of
the standard and an adaption to the needs of end-users.

5.2 Design Choices


Figure 5.1 shows the basic structure of a glTF asset. The scene structure and properties
of the asset are defined in JSON format. JSON is supported in all major programming
languages, either natively or via libraries. It is compact and easy to process, but still
can be read or edited by developers and has a rich toolchain support. The JSON con-
tains URIs that refer to files that contain the geometry data and the textures.
The geometry data, like vertex positions and triangle indices, are stored in binary
form. A key aspect here is that the binary data are stored in a form that is directly
understood by modern graphics APIs like OpenGL, WebGL, DirectX and Vulkan. This
means that the data can directly be uploaded to the GPU without further preprocessing.
This minimizes the computational load at the client side and allows a high rendering
performance.

Figure 5.1. The basic structure of a glTF asset. The JSON file contains information about the
asset structure. The geometry data and textures are stored in binary files. Optionally, all files may
be combined into a single glTF binary (GLB) file.
5.3 Feature Summary 251

Textures are stored directly as JPG or PNG images. These formats allow for an
efficient network transfer, and can easily and efficiently be decompressed using stand-
ard image processing libraries.
In the default glTF format, the JSON structure, binary data and images are stored
as individual files. In order to simplify the delivery of assets to the end users, there is
the option to combine all data into a single, binary file. Such a file is called a glTF
binary (GLB) file.
The goal of consistency of the visual representation of the rendered objects is
achieved by making Physically Based Rendering (PBR) part of the standard. The phys-
ically based rendering is based on a model that describes the surface structure of ob-
jects, like reflectivity and color, in a way that precisely defines the final appearance of
the rendered objects.
There are two common models for PBR: The metallic-roughness model and the
specular-glossiness model. Content creators, researchers and rendering engine devel-
opers analyzed the advantages of both models in view of the goals of glTF. Both mod-
els have similar capabilities regarding the range of materials that they can represent.
But the metallic-roughness model has a lower memory consumption, is easier to im-
plement in the rendering client, and is supported by default in most modern rendering
engines. Therefore, the metallic-roughness model is the model that is used in the core
glTF specification. The specular-glossiness model is supported in glTF via an exten-
sion. For both models, the glTF specification contains a definition of the possible ma-
terial properties, and how to interpret these properties for the light and rendering
computations.
glTF establishes a set of conventions for the assets: The assets are given in a right-
handed coordinate system, the positive y axis is pointing upwards, and the front of a
glTF asset points along the positive z axis. The basic unit for all coordinates and dis-
tances are meters. Rotations are defined with quaternions, encoding the rotation axis
and angle unambiguously. These design decisions aim at minimizing the fragmenta-
tion, and ensure that assets that can be displayed by all glTF viewers without the need
for additional conversions.

5.3 Feature Summary


The features that are supported by glTF allow the representation of a whole scene,
including geometric objects and their materials, viewport definitions, and animations.
The following sections summarize how these scene elements and their relationships
are represented in a glTF asset.

5.3.1 Scene Structure


The JSON part of a glTF asset contains a description of the overall structure of the
scene and the spatial arrangement of objects. This scene structure is represented using
a hierarchy of nodes. Each node can define a transformation, which can be given as
252 5. glTF—Runtime 3D Asset Delivery

translation, rotation, and scaling, or as a transformation matrix. Thus, each node de-
fines a local transform, and the concatenation of the transformations of nodes along
one path of the hierarchy defines a global transform. The transformation of the nodes
allows for a free arrangement of the scene elements in the virtual environment.
This arrangement of the scene elements is achieved by attaching the objects to
nodes. In the core glTF specification, the objects that may be attached to nodes are
meshes, skins and cameras. Meshes represent the geometric objects that appear in the
scene, and skins are attached to nodes that define the skeleton of a virtual character.
Both concepts are explained in more detail in the following sections. Orthographic and
perspective cameras can be attached to the nodes in order to define the view configu-
ration for the scene.
Additional types of elements can be attached to nodes by defining an extension.
For example, it is possible to define an extension for point or spot lights that have a
certain position and direction. As a part of the extension, these objects may be attached
to nodes, in order to define their spatial arrangement in the scene.

5.3.2 Meshes
The rendered objects that appear in the scene consist of meshes. Different kinds of
meshes are supported by glTF: The primitive type of the meshes can be points, lines
or triangles, and the meshes can either be indexed or non-indexed. Additionally, each
mesh defines a set of vertex attributes.
The core specification of glTF supports a set of vertex attributes that already cover
most application cases: The positions and normals of the vertices are encoded as 3D
floating point vectors. Tangents that may be required for lighting computations based
on normal maps are given as 4D floating point vectors.
Texture coordinates can be given as 2D vectors. Additionally, vertex colors can be
defined. The colors can be 3D or 4D vectors, consisting of the red, green, blue (and
optional alpha) components of the color. The texture coordinates and vertex colors can
either be floating point values, or 8- or 16-bit integer values that are normalized by the
renderer to the range  0,1.
Further vertex attributes that are supported by the standard are the joint- and
weight information of the vertices that are required for vertex skinning, as explained in
the section about Morphing and Skinning.
Client implementations must at least support the vertex attributes that are defined
in the core specification, which are positions, normals, tangents, vertex colors, two sets
of texture coordinates and joint and weight information. But the specification of glTF
also allows for the definition of additional vertex attributes. This may be additional
texture coordinate sets or vertex colors, but also custom attributes with application-
specific semantics. For example, an application may choose to assign physical proper-
ties like a temperature or pressure to each vertex. A custom renderer implementation
may then use these vertex attributes to offer a color-cored visualization of the physical
property, extending the range of possible application areas of glTF to scientific visual-
izations or engineering applications.
5.3 Feature Summary 253

5.3.3 Morphing and Skinning


Deformations of the meshes may also be stored in the glTF asset. This is achieved by
either defining morph targets, or skinning information.
Morph targets represent different deformation states of a mesh. At runtime, it is
possible to blend between different morph targets, by assigning different weights to
them. This allows a single mesh to be deformed into many different shapes using a
simple combination of morph target weights. A very compact representation of the
different mesh states is achieved in glTF by using sparse accessors, which only store
the deformation itself, referring to the base mesh. The sparse accessor therefore stores
the indices of the vertices that differ from the base mesh, as well as the actual vertex
attribute values of these vertices.
The morph targets in glTF may refer to different vertex attributes: The morph tar-
gets may contain the positions, normals and tangents of the deformed mesh state. The
number of morph targets is not limited, but client implementations must support at
least eight morphed attributes. For assets with a larger number of morphed attributes,
renderers may choose to only use the morph attributes with the highest weights, or to
perform the morphing computation in software.
In order to realistically model virtual characters, there is another way of defining
the mesh deformation, namely vertex skinning. In this case, a mesh is associated with
a skeleton. This skeleton is created from a hierarchy of nodes, which are treated as the
joints between bones of a virtual character. Each of these bones may affect the relative
position of nearby mesh vertices. When the skeleton is animated, the bones will deform
the mesh vertices to achieve a realistic appearance of the skin surface of the animated
character.
The information about how the bones between the joints influence each vertex is
stored as vertex attributes in the mesh: One vertex attribute stores the indices of the
joints that affect the vertex, and one attribute stores the weights that determine how
strongly the respective joint influences the vertex. The joint information is encoded as
a 4D vector containing the indices of these joints inside the skeleton as 8- or 16-bit
unsigned integer values. The weights stored as 4D vectors that define an affine combi-
nation of the joint influences on the vertex. The weights can either contain floating
point values, or 8- or 16-bit integer values that are normalized to the range  0,1 by the
implementation. When a vertex is influenced by more than four joints, additional ver-
tex attributes may be added to encode the additional joint indices and weights.
In addition to the joint indices and weights, the skinning information consists of
inverse bind matrices that are stored as binary data and referred to by the skin defini-
tion. Together, this information allows to compute the joint matrix of each joint: The
inverse bind matrix transforms vertices into the local coordinate system of each joint,
in its undeformed state. The global transform of the joint applies the actual transfor-
mation. The inverse of the transform of the node that the mesh is attached to transforms
the skinned mesh into its original coordinate system. Combining multiple of these joint
matrices therefore allows combining only the deformations that are caused by the
joints, regardless of the current transformation of the mesh itself.
254 5. glTF—Runtime 3D Asset Delivery

The actual skinning computation then usually takes place in a vertex shader. The
shader receives the vectors of joint indices and weights as 4D attribute variables, as
well as the joint matrices that describe the current deformations caused by each joint.
The joint matrices for the respective indices are selected and the weights are used to
compute a linear combination of these joint matrices, to obtain the skinning matrix.
The skinning matrix eventually transforms each vertex based on the current pose of the
skeleton.

5.3.4 Animations
Animations may be one part of the description of a scene in a glTF asset. So in addition
to the overall structure and contents of a scene, the asset may also contain information
about the behavior of the objects in the scene over time.
These animations may affect the transformation properties of nodes, or the weights
of morph targets. This is achieved by storing key frames that associate time stamps
with a set of values for certain properties. At runtime, the property values that corre-
spond to the current time stamp are read, and passed to the target of the animation—
for example, to the translation property of a node. The key frames may be iterated
through stepwise, interpolated linearly or using cubic splines, which allows for a
smooth, realistic movement of the objects that are attached to the animated nodes.
There may be multiple animations combined in a single glTF asset. The general
structure of a single animation is that it consists of channels and samplers. The sam-
plers define the interpolation behavior as well as the key frame data, which consists of
the time stamps and the values of the animated properties. The channel establishes the
connection between a sampler and the animated target property. The time stamps for a
sampler are given in seconds of animation time. The client application can either ad-
vance the animation time in real time, or offer user controls, for example, for playing
back animations at different speeds.

5.3.5 Materials
The core of the material definition consists of the properties that are used in the me-
tallic-roughness model of Physically Based Rendering (PBR): The base color, the
metallness and the roughness. The base color defines the main color of the object sur-
face, and is sometimes also referred to as albedo. The metalness determines the reflec-
tivity characteristics. The roughness of the object surface affects how matt or glossy
the object appears. Each of these properties may be set uniformly for the whole mate-
rial, or defined using a texture.
The materials in glTF may have a set of additional properties: A normal texture
allows adding the appearance of fine geometric details to the object surface, without
the need for a high resolution mesh. An occlusion texture can be defined, to emulate
self-shadowing effects of the object in concave areas. An emissive color or emissive
texture may be added to the material, which allows the object surface to appear
5.3 Feature Summary 255

Figure 5.2. Illustration of material properties supported in glTF: The base color and metalness
and roughness textures are part of the core material definition for physically based rendering.
The emissive, occlusion and normal maps allow for additional details and thus an increased
realism of the rendered result.

illuminated regardless of the lighting conditions. Transparency in materials can be de-


fined via the base color texture or the base color alpha factor, and the blend mode and
an alpha cutoff value.
Figure 5.2 shows how several material properties, given as textures, are combined
to create the final, rendered image.

5.3.6 Textures
Details about the textures that are the basis for the definition of a materials are encoded
in the JSON part of a glTF asset. A texture definition consists of an image and a sam-
pler. The image data can be in PNG or JPG format, and can either be stored as an
external file that is referred to using a URI, or as part of the binary data of an asset.
The sampler further defines wrapping- and filtering options for the texture. Wrapping
modes allow textures to be clamped or repeated at the border. Filtering modes define
the magnification- and minification filters, as well as mipmapping behavior of the sam-
pled texture.

5.3.7 Binary Data


The binary data of a glTF asset, like mesh geometry and vertex attributes, animation
keyframes and skinning information, are stored in external, little-endian binary files.
256 5. glTF—Runtime 3D Asset Delivery

Different options exist for structuring the binary data: It is possible to combine multiple
binary data sets into a single binary file, or split them into multiple binary files. For
example, there may be one binary file that stores the vertex positions and triangle in-
dices of one mesh, and multiple other binary files storing the data for different anima-
tions.
The binary files are referred to by the JSON part of a glTF asset, using URIs. At
runtime, the contents of such a binary data file is loaded into a buffer. The JSON part
of the glTF asset contains further information about the structure of this binary data:
So-called accessors provide information about how the buffer data has to be inter-
preted. For example, the binary data of a glTF asset may consist of vertex positions
and triangle indices for a mesh. In this case, two accessors will exist: The first one
defines the range of the buffer that contains the positions as 3D floating point vectors.
The second accessor defines that a range of the buffer contains the indices, for example,
as 32-bit scalar integer values.
In general, an accessor defines the range of a buffer that contains the relevant data,
as well as the number of elements in this range. The elements can be defined to be
scalar values, 2D, 3D, or 4D vectors or matrices. The accessor also defines the com-
ponent type of the elements, which may be single-precision floating point values or 8-,
16-, or 32-bit integer values. An optional stride for the data elements even allows for
an interleaved storage of multiple vertex attributes.
Although a large variety of different data layouts can be represented with the ac-
cessor concept of glTF, most of them are tailored for the efficient rendering. This
means that in most cases, the parts of the buffers that correspond to one accessor, and
may, for example, represent a single vertex attribute of a mesh, can directly be uploaded
to the GPU using standard graphics API calls, without further preprocessing.

5.4 Ecosystem
The core of the glTF ecosystem is maintained by the Khronos Group, and the main
entry point and an overview of the available resources is the Khronos glTF landing
page at https://www.khronos.org/gltf/. The most important resources are summarized
in the following sections.

5.4.1 Main Repository


A public repository with technical documentation and further resources for content
creators and developers is available at https://github.com/KhronosGroup/glTF. This
repository also hosts the specification, which consists of a textual description of the
concepts of glTF and a definition of the relevant terms and structure, as well as a JSON
schema that defines the JSON structure of a glTF asset. Additionally, it is used for issue
tracking and change management, and contains lists of converters and exporters, opti-
mizers, validators, editors and modeling tools, viewers and debugging tools, engines in
various programming languages, loaders and viewers, and applications that support
glTF.
5.5 Tools and Workflows 257

5.4.2 Sample Models


An important part of the glTF ecosystem are the sample models that are collected at
https://github.com/KhronosGroup/glTF-Sample-Models. There are different catego-
ries of sample models, with different levels of complexity: A set of simple test models
allows the developers to gradually implement features, using each of the models as a
milestone indicator for the implementation progress. More complex models show the
combination of different features. These are models that are generally created with 3D
authoring tools and either exported as glTF, or converted from an interchange format
into glTF using one of the existing conversion tools. In these cases, the source files are
made available, and a screenshot shows the expected rendered result. This allows these
models to serve as additional integration tests for glTF loaders and viewers.

5.4.3 Asset generator


In order to test the robustness of loaders and viewers and increase the test coverage for
glTF libraries, a tool for the automatic generation of glTF assets has been developed:
https://github.com/KhronosGroup/glTF-Asset-Generator. The generated assets cover
the ranges of values and combinations of properties that are allowed by the specifica-
tion, so that it is possible to detect whether a loader is feature-complete and complies
to the specification.

5.4.4 Tutorials
Tutorials for developers who want to implement loaders or viewers for glTF are avail-
able at https://github.com/KhronosGroup/glTF-Tutorials. These tutorials describe the
technical part of the specification and the underlying concepts. They are associated
with a set of sample models that show and explain each feature individually.

5.5 Tools and Workflows


Despite the variety of possible toolchains that aim at producing and delivering 3D as-
sets, common steps can be identified. After the creation or acquisition of the 3D con-
tent, the generated 3D asset has to be prepared for the delivery to the rendering client.
This involves optimization- and preprocessing steps. It has to be made sure that the
resulting asset is valid and can properly be displayed by all clients. At the client side,
the asset has to be imported, and finally rendered using the underlying graphics API.
These steps and the associated tools and workflows are illustrated in Figure 5.3.
The primary goal of glTF is to be a format for the final step of this pipeline—
namely, the efficient delivery of 3D assets to rendering clients. But the toolchain that
has developed around glTF further supports the content creators and the authoring
workflow. The following sections will give examples of the tasks that may be involved
in each of the pipeline steps, and present the corresponding tools from the glTF eco-
system and how they interact.
258 5. glTF—Runtime 3D Asset Delivery

Figure 5.3. A schematic content creation pipeline. The role of glTF is that of a format that is
used for the final delivery to the client application that will import and render the 3D asset.
Preprocessing and optimization steps are supported via the glTF toolchain.

5.5.1 Creation
One of the main sources of 3D content are authoring tools like Blender, 3D Studio
Max, Maya, Cinema 4D or specialized applications, for example, for CAD or character
modeling. These tools allow artists, designers and engineers to define the geometry of
objects, their appearance in terms of the surface structure and material properties, their
behavior over time in form of articulated animation, vertex skinning or morphing, and
how several objects are arranged in a virtual environment.
An increasing number of 3D authoring tools offers the option to directly export
assets as glTF. This is usually accomplished by exporter plugins for the authoring tools.
For the case that no direct export from the authoring tool is possible, there are convert-
ers from different input file formats to glTF. Most importantly, there are converters that
can convert the standard authoring exchange formats COLLADA and FBX into glTF.
Another important source of 3D content are digitization processes. The digitiza-
tion of products or cultural heritage artifacts is usually accomplished with 3D scanners.
The results are often stored as plain geometry formats like OBJ, and there is a variety
of converters for these file formats into glTF.
For the case that existing assets should be converted to glTF, possibly in a batch
process, these converters can often be run as standalone, command-line applications.
Many of the converters are also available as online services that allow converting the
assets simply via drag-and-drop.
The result of an export or conversion will usually be an asset in glTF format where
the internal structures still resemble the data structure of the creation tool or the source
file. Further preprocessing may then be applied to this asset, in order to optimize it for
the delivery to the client.

5.5.2 Optimization
In order to prepare a complex 3D asset for the delivery to the client, it may have to be
preprocessed or optimized. In some cases, the focus may be on optimization or
5.5 Tools and Workflows 259

simplification of the geometry. This may include the use of extensions to integrate spe-
cial compression methods for the geometry data, or just simplifications of the scene
and node structure. In other cases, there may be the need to combine different geome-
tries or materials in one asset. For the final delivery to the client, it may be desirable to
convert the asset into a single file that contains all the elements of the asset, including
geometry data and textures. The main goal here is to have the complete asset, self-
contained, so that it may be downloaded without issuing requests to external resources.
The official tool for most of these optimization and conversion tasks is the glTF
pipeline tool from Analytical Graphics Inc., which is available at https://github.com/
AnalyticalGraphicsInc/gltf-pipeline. It supports the optimization of the mesh structure
using a special mesh compression extension, as well as the conversion between default
glTF assets and binary glTF binary files.

5.5.3 Validation
An important criterion for the broad acceptance of a glTF is the robustness and relia-
bility of the tools that are used to process the 3D assets. In order to make sure that
exporters and converters are generating files that are syntactically and structurally
valid, the files may be validated using the Khronos glTF validator that is available at
https://github.com/KhronosGroup/glTF-Validator.
The validator can be used as an online drag-and-drop tool, as a standalone appli-
cation, or as a library. The online tool allows to quickly and easily validate a given
glTF asset, without the need of installing any software. The standalone validator ap-
plication may be used to perform individual- or batch validation of generated assets,
for example, before they are uploaded to a content distribution network to be delivered
to the clients. As a library, developers may integrate the validator into their own glTF
tools and libraries. Input files can then be validated before passing them to the down-
stream processing. The validation allows the developer to reduce the internal error han-
dling to a minimum, because he can rely on all the assertions that are made by the
specification for valid glTF assets.
In all cases, the validator performs a validation of the JSON part of the asset, and
of the binary data, to make sure that the files containing the binary data match the
intended usage in terms of size and data layout. The validation of the JSON part is
based on a JSON schema that is part of the specification, and defines the JSON struc-
ture as well as ranges of possible values for the JSON properties. Additional constraints
that are established by the specification are checked and violations of these constraints
are reported.
The validator generates a report of the validation process, in JSON format. This
report lists warnings and errors that have been detected, and contains detailed infor-
mation about the source of the problem, to help the developer to locate an correct the
issue. Additionally, it can automatically be processed. This makes it possible to use the
validator as part of automated unit-tests for exporters.
260 5. glTF—Runtime 3D Asset Delivery

A tool for the validation and inspection of individual glTF assets is available in
form of a plugin for the Microsoft Visual Studio Code editor (https://code.visualstudio.
com/). The plugin is available at https://github.com/AnalyticalGraphicsInc/gltf-vscode
and uses the official Khronos glTF validator for validating glTF assets. Additionally, it
allows developers to preview the asset, inspect the textures and binary data, offers auto-
completion for manual editing of the JSON part of an asset, and can convert between
glTF and binary glTF assets.

5.5.4 Rendering
One of the main goals of glTF is to support a large variety of rendering clients. There-
fore, many options for importing and rendering glTF assets exist. The tables at https://
www.khronos.org/gltf/ list various loaders, engines and viewers that support glTF.
Pure loader libraries represent a glTF asset as an in-memory data structure, and
allow further manipulation of the asset structure, to translate the asset into the internal
data structures of a renderer, or write a manipulated asset to a file. The libraries cover
all major programming languages, including JavaScript, C/C++, Java, Objective-C, Go,
Rust, Haxe, Ada, Swift and TypeScript.
An increasing number of rendering engines has built-in support for glTF. When
loading the asset, they build the internal rendering data structures directly. Rendering
engines that support glTF are available for different programming languages, and based
on different graphics APIs, including WebGL, OpenGL, Vulkan, Metal and DirectX.
Standalone viewers offer the option to quickly inspect a given glTF asset. These
viewers are often intended for end-users, but some of them also offer functions for
developers to analyze details of the model—for example, to visualize the node structure
or animation timeline of a glTF asset.

PBR Reference Implementation


In order to achieve a consistent appearance of the rendered assets for all clients, it is
crucial to precisely define the rendering equations that govern the computations in the
vertex and fragment shaders of the PBR implementation. In order to help engine de-
velopers to implement the rendering equations properly, an open-source WebGL-based
reference PBR implementation has been published at https://github.com/Khronos-
Group/glTF-WebGL-PBR. The repository contains the shader source code as well as
definitions of the lighting equations and references to documents that explain the equa-
tions in more detail. Figure 5.4 shows a collection of images created from the online
preview of the PBR implementation.
The preview allows selecting different example models and lighting conditions.
Each element of the lighting equation may be selected with the mouse, to show the
contribution of the respective element to the final image.
5.6 Extensions 261

Figure 5.4. Screenshots from the web-based open-source PBR reference implementation.

5.6 Extensions
Extensions are an important concept for the evolution of the glTF standard, in order to
adapt to future requirements. The core of the specification allows for extensions to be
defined in the JSON part of an asset and to refer to custom binary data like vertex
attributes. Vendors can implement own extensions in order to support custom features
that add new properties to the JSON objects. The extensions can then be proposed to
the Khronos Group. The group members can discuss the new features and the exten-
sion specification, and eventually promote the extension to become an official Khronos
extension. Depending on the demand and adaption of the extension, it may eventually
be integrated into the core specification as part of a new major release.
The extensions that are used in a certain glTF asset can be queried by the viewer
or loader. Extensions can be optional, in which case the viewer may omit the features
that are offered by the extension or -- depending on the functionality of the extension—
provide a fallback behavior. Other extensions can be declared to be required for
properly rendering the assets, and viewers that do not support such a required extension
can report this to the user. In general, viewers are encouraged to support the official
Khronos extensions, to foster the adoption of the extensions for the future version of
the standard.
Several extensions have already been proposed and become official Khronos ex-
tensions, and their goals and functionality is summarized in the following sections.
262 5. glTF—Runtime 3D Asset Delivery

5.6.1 Draco Mesh Compression


The mesh representation in the core glTF standard already is very compact. But with
the Khronos KHR_draco_mesh_compression extension, the mesh sizes can even be
reduced further. Draco is a library for compressing and decompressing 3D meshes. It
is developed by Google, and available at https://github.com/google/draco. It allows for
a compression of the topology data (i.e., the vertex indices) as well as a quantization
of the vertex attributes like positions, texture coordinates and colors.
The compression ratios that can be achieved depend on the structure of the model
and the quantization of the vertex attributes. A representative set of 19 of the official
sample models has been compressed using the gltf-pipeline tool. The binary data
(BIN files) of these models could on average be compressed down to 24% of their
original size. Models that mainly consist of complex geometry could even be reduced
to 10% of their original size. Animations are not yet compressed with Draco by default,
so the models that involved animations achieved a lower compression ratio on average,
but an extension of the Draco compression for animation data is already being
developed.
The benefits of the Draco compression extension have already been shown in real-
world applications: Cesium from Analytical Graphics Inc. is a geo-information system
that is used for streaming large amounts of geometry data. The size of a complex data
set with 738 MB of geometry data could be reduced to 149 MB using the Draco com-
pression. At the same time, the time for loading the data set was reduced from 18.92
seconds to 10.55 seconds, by performing the decompression of the compressed geom-
etry in parallel using Web Workers and the Web-Assembly based Draco library.

5.6.2 Unlit Materials


The material model of glTF offers support for a wide range of different materials, based
on a common standard for physically based rendering. But there may be use cases
where physically based rendering is not desired. This may be due to resource con-
straints on the client device or for aesthetic reasons, or when the asset was acquired
through photogrammetry, where the textures of 3D models are based on photographs
that already contain lighting and reflections. The KHR_materials_unlit extension
therefore allows the definition of materials that are unaffected by the standard physi-
cally based rendering process, and are always rendered with their base color.

5.6.3 Technique-Based Rendering


The standard material model of glTF and the extension for unlit materials cover most
use cases that appear in practical applications. But when an even finer control over the
rendering process is required, or existing, customized rendering methods should be
used for an asset, the KHR_techniques_webgl extension may be used. This extension
allows defining the actual vertex- and fragment shaders that are used for rendering as
GLSL programs.
5.7 Application support 263

The connection between the geometry and texture data of the asset and the shaders
is established by defining a rendering technique. Such a technique associates object
properties like vertex positions or texture coordinates with the attributes of the shaders.
A material in this extension is therefore an instance of a technique, with a certain map-
ping for these properties, or with specific values for the uniform variables of the
shaders.

5.6.4 Specular-Glossiness PBR Materials


The default model for Physically Based Rendering (PBR) in glTF is the metal-rough-
ness model that defines the properties of the material surface in terms of metalness and
roughness. An alternative representation of a PBR material is the specular-glossiness
model. This model allows a slightly broader range of materials to be represented, but
may be more resource consuming than the standard model.
The KHR_materials_pbrSpecularGlossiness extension adds support for the
specular-glossiness model to glTF. The specification of the extension details use-cases
and best practices for deciding which model to use, as well as conversion rules between
both models.

5.6.5 Texture Transforms


An important optimization for the resource usage in complex 3D scenes is to pack
multiple textures into a single texture atlas. To enable this, it must be possible to exactly
tell the rendering engine which part of the single texture is to be mapped on one par-
ticular object. This can be achieved by mapping the texture coordinates of the object
into the appropriate window of the texture atlas.
Depending on the creation process of the asset, the texture coordinates may al-
ready refer to a texture atlas, which can then directly be represented in glTF, without
any extension. But in other cases, a texture atlas may be generated in a post-processing
step. In this case, it can be desirable to retain the original binary geometry data, and
just transform the texture coordinates on the fly to point to the right part of the texture
atlas. The KHR_texture_transform extension therefore allows adding offset, scale,
and rotation information to the individual texture references of the JSON part of a glTF
asset, so that the proper texture segment may be looked up in the actual texture image.
This also allows to reuse the same mesh, with the same texture coordinates, to render
multiple instances of this mesh with different texture transforms, each transforming the
coordinates to point to different parts of a single texture atlas.

5.7 Application support


An increasing number of applications natively support glTF and use glTF as their
standard format for the representation of 3D assets.
264 5. glTF—Runtime 3D Asset Delivery

CesiumJS from Analytical Graphics, Inc. (https://cesiumjs.org) and Plex.Earth


from Plexscape (https://plexearth.com/) allow embedding 3D models in environments
that are based on satellite imagery and terrains. City planners, architects and engineers
can now present and visualize their models and constructions in a realistic virtual
world. Archilogic (https://archilogic.com/) focusses on architecture and creating 3D
floor plans that can be explored by customers on the web.
Applications that support 3D content creators also use glTF internally. Substance
Painter from Allegorithmic (https://www.allegorithmic.com) is a 3D painting software
for texturing 3D assets that are then exported as glTF with PBR materials. Marmoset
(https://marmoset.co/) is a high-quality real-time rendering suite that also features
glTF export.
New content creation applications are made available for artists. In addition to
professional 3D authoring tools, tools like Paint3D from Microsoft (https://www.
microsoft.com) allow everybody to create 3D content and share it in form of glTF
assets.
The results of the content creation process have to be made available to other art-
ists and users. Repositories like Sketchfab (https://sketchfab.com/) or Remix3D from
Microsoft (https://www.remix3d.com) provide millions of models, searchable and in-
dexed, as glTF assets.
The support for glTF does not only affect existing 3D modeling tools or 3D visu-
alization software. The infrastructure that has developed around glTF makes 3D con-
tent more easily accessible for everybody: WordPress (https://wordpress.org) has added
support for glTF as a media type, to allow embedding 3D content as interactive scene
parts of a site. Facebook (https://www.facebook.com) uses glTF as the basis for 3D
posts, offering content creators the opportunity to share their assets with a large
community.

5.8 Conclusion
A versatile and reliable transmission format is the key for opening up new application
areas for 3D content. High-quality rendering is now possible on many client devices—
including web browsers—and the increasing demand for 3D content on the web is one
of the main factors for the momentum that glTF has gained since its initial release. The
design choices and features have proven to meet the requirements of the content crea-
tors and the users. The ecosystem and tools have that have been developed around glTF
support all users in making 3D content available in a large variety of applications.
Being a open standard, glTF is maintained and developed by a large, active community,
in order to keep up with the latest technical developments and to adapt to future
requirements.
V
Real-time
Ray Tracing

Welcome to the Real-time Ray Tracing section of GPU Zen’s second volume. With the
recent revolution in real-time ray tracing, this section presents two novel methods that
utilize ray tracing hardware to achieve new visual effects and improve rendered image
quality.
The section starts with an article from Holger Gruen on efficient rendering of sin-
gle-bounce caustics, such as water reflections and refractions. The article presents an
elegant way to track the light energy compression induced by refractive geometry and
importance sample rays in bright and visible parts of caustics followed by denoising.
The method has a practically good performance for real-time applications and de-
livers high rendering quality of caustics.
The second article from Rahul Sathe et al. is on achieving high quality antialiasing
by utilizing the new hardware features, such as ray tracing and conservative rasteriza-
tion. Conservative rasterization is used to detect partially covered pixels, and ray trac-
ing is then used to evaluate the subsamples of these pixels. While providing high
quality antialiased image even with thin geometry, the method can used for hard and
detailed geometry.
All articles come with demos that demonstrate the methods in real-time.
This is just the beginning of a long path to make the new real-time ray tracing
practical. I hope the content of this section will inspire you to apply ray tracing more
widely in your games and do not hesitate to share your methods with us!

—Anton Kaplanyan

265
1
V

Real-Time Ray-Traced
One-Bounce Caustics
Holger Gruen

1.1 Introduction
This chapter investigates how to make use of the DirectX 12 real-time ray tracing API
DXR to simplify and innovate on current methods for rendering real-time caustics. We
assume that the necessary DXR bounding volume acceleration structures have already
been implemented in a game engine to enable other ray tracing based algorithms.
The use of DXR has several benefits over existing solutions with regards to finding
the specular intersections of refracted and reflected light rays and a dynamic 3D scene.
This is depicted in Figure 1.1(a)—the yellow light rays from the sun interact with the
refracting surface and ‘turn’ into the refracted red light rays that shed light on the brown
receiving geometry. Figure 1.1(b) shows the reflected light rays that shed light on the
purple receiving geometry.
Many games avoid finding intersections with the full game scene, but only com-
pute intersections with a ground plane or the inside of a 3D box. Consequently, the
resulting caustics cannot have shadows that are the result of many incoherent re-
fracted/reflected light rays. Typically, shadows are then generated from a shadow map,
which ‘only’ produces shadows from a single point of view, e.g., from the point of view
of the light.
For a scene with a refractive water interface (not displayed), Figure 1.2 shows a
shadow produced from a shadow map. As a result, the silhouette of the shadow of the
lid of the box is straight. Figure 1.3 shows shadows that are the result of incoherent
refracted light rays, and thus the shadow of the lid of the box is bent and deforms in a
physically correct way as the surfaces moves.

267
268 1. Real-Time Ray-Traced One-Bounce Caustics

(a)

(b)

Figure 1.1. Refracted/reflected rays hit the opaque scene.

Figure 1.2. Shadow mapped shadows for caustics.


1.2 Previous Work 269

Figure 1.3. Shadows from ray-traced caustics.

1.2 Previous Work


There are a number of publications (see Bibliography) that describe how to get beyond
the limits shown in Figure 1.1 through the use of various techniques, like approximate
ray tracing algorithms [Szirmay-Kalos et al. 2005].
Usually these methods need to first render a ‘Caustics Map’ [Shah et al. 2007], an
image with the refractive/reflective geometry. In the context of this chapter, this geom-
etry is a single interface. The algorithm for rendering caustics described in this chapter
does not need this intermediate rendering step. Instead, the ray generation shader di-
rectly generates refracted rays on the triangles of the surface. This conceptually moves
the approach described below more towards photon mapping (see Jensen [2001] and
Wang et al. [2009]).
At a glance, the following algorithms have been proposed in the past:

1. Use image space ray marching to find the approximate intersection.

a. March the primary depth buffer or the shadow map depth buffer in the
pixel shader to find intersections. The problem with this approach is that
refracted/reflected light rays may be occluded in both the primary view and
the view from light as shown in Figure 1.4. It is possible to render and
march a set of images utilizing:

i. Multiple depth layers of the primary depth buffer and the shadow map
270 1. Real-Time Ray-Traced One-Bounce Caustics

Figure 1.4. Issues with shadow map based visibility for caustics.

ii. Multiple viewpoints of the primary depth buffer and the shadow map.

iii. Distance impostors [Szirmay-Kalos et al. 2005].

Note that the runtime cost, the memory consumption and the implementa-
tion complexity of these methods can be significantly over what the DXR
based approach described below achieves.

2. Voxelize the receiving geometry and march the voxelized 3D grid. This can
yield good results, depending on the resolution of the grid. Voxelization is not
a cheap operation and is the rasterization side equivalent of keeping a bounding
volume hierarchy up-to-date. Raymarching the 3D grid usually has also a high
or unacceptable performance cost. Overall, the implementation complexity of
voxelization methods is above the one of the DXR based approach described
below.

1.3 Algorithm Overview


The following four steps are used to render ray-traced caustics:

1. Compute a refracted/reflected mesh. For each vertex of the mesh that repre-
sents the geometry of the refractive/reflective interface, a refracted/reflected ray
R is constructed. This ray starts at the interface’s vertex and points along the
refracted/reflected direction of incident light. The resulting refracted/reflected
mesh has the same number of vertices and the same triangle count as the
1.3 Algorithm Overview 271

refractive/reflective interface. The positions of its vertices are computed by in-


tersecting each ray R with the receiving geometry. Figure 1.5 depicts this pro-
cess for refracted rays. Every blue surface triangle generates a red triangle in
the refracted mesh. Please note that the refracted/reflected mesh does not need
to be fine enough to follow every detail of the receiving geometry. It needs to
only be detailed enough to facilitate the computation of a high enough quality
compression ratio as described in Step 2. This step can introduce an error when
the surface is not detailed enough. It is therefore necessary to refine the surface
if errors are detected.

Figure 1.5. Compute a refracted mesh.

2. Compute a compression ratio for each refracted/reflected triangle. As Fig-


ure 1.6 shows, the refraction of the light rays can either focus the light within a
triangular beam, or it can do the opposite. As a result, triangles in the refracted
mesh can either have bigger or smaller area than their respective interface tri-
angle. The same is true for the case of reflection. For each triangle, the ratio is
computed as follows and is stored in a buffer:
area interfaceTriangle 
compression ratio 
area  refractedTriangle 

The same equation is used to compute and store the compression ratio for re-
flected triangles.

3. Generate refracted/reflected light rays at random locations. For each inter-


face’s triangle a number of random world space positions (samples) are gener-
ated in the triangle. The higher the refractive/reflective compression ratio, the
more samples get generated. The algorithm to find an adequate sample count
for a given compression ratio also considers the on-screen size of the refracted/
272 1. Real-Time Ray-Traced One-Bounce Caustics

Figure 1.6.

reflected triangle. This means, that more samples get generated for triangles
that are close to the viewer. The idea to use the compression ratio from Equation
1 is not new and has been described in the past (see, e.g., Golias and Jensen
[2001]).

a. For each random world-space position that can be seen from the light
source, a ray (photon) is formed starting at this position and along the in-
terpolated refracted/reflected direction. One easy solution to find out if a
sample is shadowed (can’t be seen by the light source) is to use a shadow
map. It is of course also possible to shoot a ray towards the light source.

b. Compute the intersection of each such ray with the receiving scene
geometry.

c. Project each intersection to screen space—if the position is on screen and


corresponds to the front-most pixel in that position, use Interlocked-
Add() to accumulate a brightness value in that screen location. In order to
find out if the intersection corresponds to the frontmost pixel on screen the
simplest solution is to do a depth test with a certain tolerance. Other pos-
sibilities to deal with are to also consider the g-buffer normal of the on-
screen pixel and/or scale the brightness value by a function of the differ-
ence in depth. It is also possible to render a unique triangle ID into the g-
buffer and to compare this ID with the primitive and instance IDs that are
available in the DXR hit-shaders. The brightness value that gets accumu-
lated can be scaled by a number of factors, including the compression ratio
and/or the amount of light that has been absorbed by the length of the way
the ray travels through the refractive medium like water (see, e.g., Baboud
and Décoret [2006]).
1.4 Implementation Details 273

Figure 1.7. Sample generation based on screen-space density.

Figure 1.7 depicts the generation of samples and their corresponding refracted
screen-space positions. It is possible to skip the generation of samples for in-
terface triangles that are guaranteed not to generate any on-screen intersections.
See below in ‘Implementation Details’ for a description.

4. Denoise the buffer that results from Step 3c to arrive at a denoised and
smooth caustics buffer.

5. Use the caustics buffer to shed light on the receiving scene.

The following section describes implementation details of the demo that accom-
panies this chapter.

1.4 Implementation Details


As described above, the DirectX 12 DXR is used to implement all ray tracing work-
loads for Steps 1 and 3 from Section 1.2 above. The provided example demonstrates
an underwater scene to render refracted caustics.

 For Step 1 we call DispatchRays() with each thread tracing exactly one re-
fracted/reflected ray into the scene. The resulting refracted/reflected mesh is
written to a buffer that is read by later steps and uses the same index buffer as
the original interface mesh.
274 1. Real-Time Ray-Traced One-Bounce Caustics

 Step 2 is implemented as a compute shader.

 For Step 3, we call DispatchRays() with enough threads to be able to cast


trace the maximum number of rays for every interface triangle within an en-
larged view frustum. The frustum is enlarged to also capture refracted/reflected
rays that enter the view frustum via from outside the normal view frustum. The
ray generation shader culls (e.g., does not call TraceRay()) if a given thread
has no sample to process.

As depicted in Listing 1.1, a simple seed based random number generator is used
to generate (see http://www.reedbeta.com/blog/quick-and-easy-gpu-random-numbers-
in-d3d11/) a set of ‘random’ barycentric positions inside the current interface’s trian-
gle. The current implementation uses the triangle ID/index as the RandSeed. This
choice means that the generated noise is stable across a triangle. Should the number of
samples for a triangle change between frames then this does only result in removing or
adding samples but never in a change of sample positions for a given triangle.

float3 RandBarycentrics(inout uint RandSeed)


{
float s = RandFloat(RandSeed);
float t = RandFloat(RandSeed);

[flatten]if (s + t > 1.0f)


s = 1.0f - s, t = 1.0f - t;

return float3(1.0f - s - t, s, t);


}

Listing 1.1. Generate a ‘random’ position inside a triangle.

As described above, the number of samples generated is also based on the screen-
space size of the triangle of the refracted/reflected mesh. As this refracted/reflected
triangle may be off-screen even if some of the generated rays produce on-screen caus-
tics, the maximum screen-space size of the refracted triangle and three additional tri-
angles is used.
These three triangles are formed at the intersection points of the three refracted/re-
flected rays. This is depicted in Figure 1.8, the blue triangle is at the interface, and the
three yellow rays show scene intersections at different distances.
In order to speed up Step 4, the buffer accumulating brightness is implemented as
a half-res buffer. This speeds up denoising significantly. Denoising is done through a
set of iterated cross-bilateral blurring steps that account for differences in view-space
depth, normal and view-space positions.
1.5 Results 275

Figure 1.8. Three additional triangles for screen-space coverage estimation.

1.5 Results
Figure 1.9 shows the result of applying the raw caustics buffer without denoising. Fig-
ure 1.10 shows a similar frame with denoising enabled. Finally, Figure 1.11 shows the
algorithm running with slightly different lighting conditions.
The following performance numbers have been measured on a NVIDIA RTX
2080Ti board running at a resolution of 1920 1080 utilizing the official version of
DXR that is part of the DirectX API:

Workload: Refracted Caustics only Time (ms)


DispatchRays() calls including accumulative scattering 0.8
Denoise 1.0

Adding reflected caustics add approximately another 0.8 ms to the run-time of the tech-
nique. Please note that the denoise step does not get more expensive unless a separate
denoise step is required for the reflected caustics. These numbers show that physically
inspired ray-traced caustics are fast enough to be brought into scenes of real-time
games on DXR capable hardware.

1.6 Future work


In the future is would be interesting to investigate the following extensions to the algo-
rithms presented above.
276 1. Real-Time Ray-Traced One-Bounce Caustics

Multi-bounce caustics. The algorithm described above stops when a refracted ray
hits the scene. With DXR it is possible to generate one or more secondary rays in the
ray generation shader. In this case, for each secondary ray that generates an intersection
with the scene, it is necessary to accumulate brightness at the projected screen space
position of the intersection.

Volumetric caustics. Most algorithms for volumetric caustics (see, e.g., Hu et al.
[2010] and Liktor and Dachsbacher [2011]) make use of the regular structure of a
Caustics Map. It would be interesting to investigate how volumetric caustics can be
implemented based on the random ray samples that the current implementation uses.

1.7 Demo
A demo that can be run on Nvidia GPUs showcasing the proposed technique will be
provided.

Figure 1.9. Raw caustics buffer.


1.7 Demo 277

Figure 1.10. Denoised caustics.

Figure 1.11. Different lighting conditions.


278 1. Real-Time Ray-Traced One-Bounce Caustics

Bibliography
BABOUD, L. AND DÉCORET, X. 2006. Realistic water volumes in real-time. In Proceedings of
Eurographics conference on Natural Phenomena, pp. 25–32.
GOLIAS, R. AND JENSEN, L. 2001. Deep Water Animation and Rendering. URL: https://www.
gamasutra.com/view/feature/131445/deep_water_animation_and_rendering.php.
JENSEN, H. 2001. Realistic Image Synthesis Using Photon Mapping. A K Peters.
HU, W., DONG, Z., IHRKE, I., GROSCH, T., YUAN, G., AND SEIDEL, H. 2010. Interactive volume
caustics in single-scattering media. In Proceedings of the Symposium on Interactive 3D
Graphics and Games, pp. 109–117.
LIKTOR, G. AND DACHSBACHER, C. 2011. Real-time volume caustics with adaptive beam trac-
ing. In Symposium on Interactive 3D Graphics and Games, pp. 47–54.
SHAH, M., KONTTINEN, J., AND PATTANAIK, S. 2007. Caustics Mapping: An Image-Space Tech-
nique for Real-Time Caustics. In IEEE Transactions on Visualization and Computer Graphics,
13:2, pp. 272–280.
SZIRMAY-KALOS, L., ASZÓDI, B., LAZÁNYI, I., AND PREMECZ, M. 2005. Approximate Ray-Trac-
ing on the GPU with Distance Impostors. In Computer Graphics forum, 24:3.
WANG, R., ZHOU, K., PAN, M., AND BAO, H. 2009. An efficient GPU-based approach for inter-
active global illumination. In ACM Transactions on Graphics, 28:3, Article 91.
WYMAN, C. AND NICHOLS, G. 2009. Adaptive Caustic Maps Using Deferred Shading. In Com-
puter Graphics forum, 28:2.
2
V

Adaptive Anti-Aliasing using


Conservative Rasterization
and GPU Ray Tracing
Rahul Sathe, Holger Gruen,
Adam Marrs, Josef Spjut,
Morgan McGuire, and Yury Uralsky

2.1 Introduction
Anti-aliasing is a category of techniques used to remove image artifacts that result from
inadequate sampling rates. MSAA [Akeley 1993] is a popular anti-aliasing technique
that samples visibility at a rate different from the typical shading rate of once per pixel
per primitive. Although effective in geometric anti-aliasing, MSAA incurs higher stor-
age costs due to storing depth and color samples at the sampling rate. Additionally, it
can suffer from higher bandwidth usage in the case where color compression fails to
compress the color data well. For these reasons, MSAA produces high image quality
at a relatively high cost.
Ideally, we would like the image quality of MSAA without paying the high asso-
ciated cost. When a primitive covers a pixel entirely, it is not necessary to do further
visibility calculations. When a pixel is partially covered by primitive(s), we need to
determine how much of the pixel is covered by each intersecting primitive to calculate
correct visibility. Taking advantage of this knowledge, we present an approach that
identifies “complex” or “interesting” pixels that require computing visibility more ac-
curately than a single raster sample. We then discuss two methods to compute visibility
for the identified pixels with improved accuracy.

279
280 2. Adaptive Anti-Aliasing using Conservative Rasterization and GPU Ray Tracing

2.2 Overview
In this section, we describe the algorithm at a high level. The flow chart of the pixel
shader used in the algorithm is shown in Figure 2.1. The goal of the algorithm is to
identify partially covered pixels and resolve the visibility of those pixels using ray trac-
ing. We use conservative rasterization to identify such pixels. To eliminate the fully
covered pixels that are also rasterized by conservative rasterization, we use a system
variable available in Tier 3 conservative rasterization hardware called SV_Inner-
Coverage. Using this variable, we mark the fully covered pixels as “less interesting”
and mark the partially covered pixels as “interesting” in a render target. However, this
can generate a lot of false positives even if interesting pixels are completely behind
fully covered pixels. To minimize this, we output farthest depth for the fully covered
pixels. To ensure that no interesting pixels are missed, we output the closest depth
within the pixel for the partially covered pixels, even if it generates some false positives.
This is described in detail in the next section.

2.3 Pixel Classification using Conservative Rasterization


Complex pixels can be identified by analyzing the depth and/or primitive id buffers;
however, this approach is likely to miss thin geometric features that do not hit any pixel
centers. Common problem cases are subpixel projected geometry, such as cable wires
or fences at a distance, that are not sampled sufficiently by standard rasterization. We
can avoid using multiple subpixel samples during pixel classification by employing
Conservative Rasterization1.

Calculate the min/max depth of the primitive using


GetAttributeVertex

Is FullyCovered

Output the max depth Output the min depth


Mark as “not interesting” Mark as “interesting”

Figure 2.1. Flowchart of the pixel shader used to identify “interesting” pixels that are ray traced
for improved visibility resolution.

1
See https://msdn.microsoft.com/en-us/library/windows/desktop/dn903791(v=vs.85).aspx.
2.3 Pixel Classification using Conservative Rasterization 281

GPU hardware support for Conservative Rasterization was introduced in Direct3D


11.3 on feature level 12 hardware. Conservative Rasterization typically refers to a ras-
terization mode in which pixels partially covered by the primitive are rasterized. There
are different tiers2 of this feature. At the Tier 3 level, shaders can input a system variable
called SV_InnerCoverage whose least significant bit (LSB) is set to 1 when that pixel
is guaranteed to be fully covered by the primitive.

2.3.1 Algorithm
We propose using ray tracing to determine the visibility within the pixels where raster-
ization techniques fall short. With the Tier 3 conservative rasterizer at our disposal, we
propose the following algorithm to identify the pixels that require ray tracing to resolve
the visibility further. We set the pipeline with the standard Input Assembler (IA) and
the Vertex Shader (VS), followed by a Pixel Shader (PS). PS calculates the depth equa-
tion using the derivatives (finite differences) of the position with respect to x and y. It
then uses this equation to calculate the depth conservatively (minimum or maximum)
over the entire pixel. It calculates the maximum and minimum depth over the primitive
in the PS using the new method GetAttributeAtVertex to access the per vertex
attributes. We set the rasterizer to rasterize in a conservative manner and we also as-
sume the Tier 3 level support available. We set the PS with the following inputs and
outputs.

PS Inputs.
1. The screen space position (with the semantic SV_Position).
2. Inner coverage with the semantic SV_InnerCoverage.
PS Outputs.
1. Depth value with the semantic SV_Depth. This forces the depth-stencil test to
happen late (i.e., after the PS).
2. A uint value that eventually get merged into a render target.
Pixel Shader. The goal of this shader is to:
1. Identify the pixels that are partially covered and could be potentially visible.
2. Separate them from the fully covered frontmost (visible) pixels.

We would like to do this without resolving the exact visibility within the partially
covered fragments. We use SV_Coverage, to separate partially covered pixels from
fully covered. We mark the interesting pixels that are partially covered by outputting
the value of 0x1 into the render targets. We output the value 0x0 for the fully covered

2
See https://msdn.microsoft.com/en-us/library/windows/desktop/ff476876(v=vs.85).aspx.
282 2. Adaptive Anti-Aliasing using Conservative Rasterization and GPU Ray Tracing

(a) (b) (c)


Figure 2.2. A triangle shown with its fully covered pixels as yellow and partially covered pixels
as purple.

pixels. Figure 2.2 shows the pixels that get identified as fully covered marked in yellow
and the ones that are partially covered and hence potentially “interesting” are marked
in purple.
To identify the frontmost layer of the pixels, both partially and fully covered, we
assign the depth values inside the pixel shader conservatively. We output the farthest z
for the occluding fragments (fully covered pixels) and nearest z for the potentially vis-
ible fragments (partially covered pixels) as output depth (with the semantic
SV_Depth). By doing so, we guarantee that no potentially visible (and partially cov-
ered) fragment gets missed. As a result of these conservative depths, sometimes we
allow some partially covered pixels to pass the depth test even though they are occluded
in reality. This is shown in Figure 2.3(b).

zmax zmax

zmin zmin

(a) (b)
Figure 2.3. As shown in the figure, we output closest z (z min ) of the partially covered purple
fragment and farthest z (z max ) of the fully covered yellow fragment. (a) shows the case of an
“interesting” pixel; and (b) shows the pixel identified as “interesting” incorrectly (false positive).
2.3 Pixel Classification using Conservative Rasterization 283

Figures 2.3(a) and 2.3(b) show a yellow fragment of a triangle perpendicular to xz


plane covering the entire pixel seen from the top. The box represents one pixel. As seen
in this figure, we output the closest z (z min) for the partially covered purple fragment
and the farthest z (z max ) for the fully covered yellow fragment. Figure 2.3(a) shows the
case of truly “interesting” pixel whereas Figure 2.3(b) shows the false positive. The
goal of the algorithm is to not miss any truly “interesting” pixels with as few false
positives as possible. Once such “interesting” pixels are identified, we propose gener-
ating rays only over these pixels using Ray Generation shaders in DXR and tracing
those rays to find out approximate visibility in those pixels. Pseudocode for the PS is
shown in the Listing 2.1.

void main(in float4 i_position : SV_Position,


in uint m_InnerCov : SV_InnerCoverage;
out uint o_color : SV_Target0,
out float o_depth : SV_Depth)
{
// Calculate the primitive max and min depths.
float4 p0 = GetAttributeAtVertex(i_position, 0);
float4 p1 = GetAttributeAtVertex(i_position, 1);
float4 p2 = GetAttributeAtVertex(i_position, 2);
float2 primDepth = CalcMinMaxDepth(p0, p1, p2);

// Set the defautl min and max values.


float zMin = MAX_VAL, zMax = MIN_VAL;

float dzdx = ddx(i_position.z);


float dzdy = ddy(i_position.z);

// Calculate Z at the four corners and calculate min/max of those.


// Depth is a monotonic function, so evaluating at corner
// suffices, unless primitive is entirely contained within a
// pixel OR contains one of the vertices.
for (uint y = -1; y <= 1; y+=2)
{
for (uint x = -1; x <= 1; x += 2)
{
float2 pt = i_position.xy;
offset = .5f * float2(x, y);
pt += offset;

// Calculate the Z, using the plane equation z = Ax+By+C.


float z = CalculateZ(pt, dzdx, dzdy);

// Clamp to per primitive min-max depths to avoid


// overestimating the bounds for primitives at steep
284 2. Adaptive Anti-Aliasing using Conservative Rasterization and GPU Ray Tracing

// angles and when or more vertices are contained


// within a pixel.
z = clamp(z, primDepth.x, primDepth.y)

zMin = min(zMin, z);


zMax = max(zMax, z);
}
}

if (i_vtx.m_InnerCov & 0x1)


{
o_color = 0x0; // Fully covered pixel
o_depth = zMax; // Output the farthest depth
}
else
{
o_color = 0x1; // Partially covered ("interesting") pixel
o_depth = zMin; // Output the nearest pixel
}
}

Listing 2.1. Pseudocode for the pixel shader.

2.4 Improved Coverage and Shading Computation


Once complex pixels are identified by our conservative rasterization approach, we em-
ploy improved methods to approximate surface visibility and the integration of shaded
surface colors within the “interesting” pixels. We accomplish this by adaptively in-
creasing the subpixel sampling rate and distribution with either (a) GPU ray tracing or
(b) Pixel Shader based (non-DXR) ray tracing.

2.4.1 GPU Ray Tracing with DXR and RTX


The combination of Microsoft’s DirectX Raytracing (DXR) API and Nvidia RTX en-
ables programs to cast arbitrary rays against scene geometry in an efficient and GPU-
friendly way. Prior work by Marrs et al. [2018] shows an Nvidia Titan V GPU processes
approximately 102 million rays per second (102 Megarays) for highly divergent, adap-
tive ray workloads without dedicated hardware acceleration for ray tracing. With the
introduction of the Nvidia Turing architecture’s “RT Cores”, GPU ray tracing received
dedicated hardware acceleration and the maximum ray throughput of the GPU was
raised substantially to 10 billion rays per second (10 Gigarays).
In prior work, complex pixels are identified using screen-space heuristics, and rays
were cast on a fixed subpixel distribution of 2, 4, or 8 samples per pixel. The screen-
space criteria for complex pixel identification relied on a 1 sample-per-pixel input
2.4 Improved Coverage and Shading Computation 285

buffer, which is unable to reliably capture all subpixel geometry. Using our conserva-
tive rasterization-based approach, we are able to guarantee that subpixel projected ge-
ometry will be accounted for during the pixel identification phase.
After all pixels are classified, we create a set of rays for each interesting pixel iden-
tified by the conservative raster approach by computing an origin and direction. Since
these are all viewing (or “primary”) rays, we assume a pinhole camera model and all
rays share a common origin (the pinhole). The ray direction is selected based on vary-
ing sample locations corresponding to an offset from the pixel center. These rays are
then tested against the geometry in the scene by invoking the TraceRay() HLSL func-
tion of the DXR API. We perform shading at these sample locations in DXR Hit
Shaders and store the reconstructed result in the framebuffer.
While our current sample distribution follows the MSAA pattern, we could easily
use more sophisticated patterns like 8 rooks. Alternatively, movie production renderers
primarily use a more sophisticated approach called correlated multi-jittered sampling
[Kensler 2013]. This approach avoids the structured artifacts common to quasi Monte
Carlo sequences while achieving similar quality without needing precomputation or
storage. Enterprising developers might experiment with a variety of patterns to deter-
mine which works best for their application.

2.4.2 Visibility with Pixel Shader Ray Tracing


It is possible to use a variant of Listing 2.1 to ray trace in the PS whilst accounting for
a user defined number of subsamples per pixel. Instead of using DXR and a bounding
volume hierarchy, Shader Model 6.1 GetAttributeAtVertex() in combination with
conservative rasterization is used to pass all triangles that touch a pixel to the pixel
shader, as shown in Figure 2.4.

Figure 2.4.
286 2. Adaptive Anti-Aliasing using Conservative Rasterization and GPU Ray Tracing

To render with just one geometry pass it is necessary to compute ray intersections
for each pixel that is temporarily classified as partially covered even if it is classified as
fully covered later on. In a nutshell the algorithm works like this:
1. In order to support an arbitrary N fully customizable subsamples per pixel cre-
ate a buffer B that is big enough to hold all per-pixel/per-sample data (see Fig-
ure 2.5) as well as a 32-bit depth value.

Figure 2.5.

2. Clear the 32-bit depth part of all subsamples in B to the maximum or the min-
imum depth value. Clear to maximum depth if a LESS depth comparison mode
is used. Clear to minimum depth if a GREATER depth comparison mode is used.
3. Set up the same rendering pipeline as in Listing 2.1.
4. Render the main geometry pass.
a. In the pixel shader do the following:
i. Compute attributes at the pixel center by interpolating the three vertex
attributes and output all interpolated per pixel attributes to the bound
render targets.
ii. If SV_InnerCoverage is 1, a simple pixel.
1. Output zMax to SV_Depth.
2. Output 0x0 to a render target for classifying this pixel as simple.
iii. Else (if SV_InnerCoverage is 0), a complex pixel.
1. Output zMin to SV_Depth.
2. Output 0x1 to a render target for classifying this pixel as complex.
3. For s  0 to N 1 (iterate over all subsamples)
a. Compute intersection of the ray from the eye to the position of
the current subsample with the plane of the current triangle—see
Figure 2.6.
2.4 Improved Coverage and Shading Computation 287

Figure 2.6. Tracing rays through the subsamples of a complex pixel.

b. If the intersection is inside the triangle


i. Compute intersection depth d.
ii. Compute all output attributes at the intersection position.
iii. Compute the position P of s in the buffer B.
iv. Use a thread-safe way to only update s if d is smaller than the
subsample depth stored at P (see below for a discussion of
how to do this).

5. After the geometry pass, it is necessary to resolve all complex pixels which
have finally output 0x1 in Step 4. E.g., in a deferred rendering setup, these sub-
samples need to be lit individually before averaging their results.

For forward rendering pipeline or other rendering algorithms that only need to
store 32 bits per pixel, a thread-safe way to update subsample data is to use 64-bit wide
interlocked operations:

1. Construct a 64-bit word W like this: (32 bits of intersection depth d: 32 bits of
color).

2. Update the buffer B at position P using InterlockedMin64(P, W) for a LESS


depth-comparison function or InterlockedMax64(P, W) for a GREATER
depth-comparison function. Please note that if other depth-comparison func-
tions are required then it is necessary to switch to a solution that is not based
on interlocked operations.

If a larger data structure needs to be stored per pixel, or a GREATER-EQUAL or


LESS-EQUAL depth comparison function is required, then use one of the following
options:
288 2. Adaptive Anti-Aliasing using Conservative Rasterization and GPU Ray Tracing

1. If supported, use a per pixel mutex to lock the data at position P in B and update
all subsamples. Then unlock the mutex again.

2. If Raster Order Views are supported, then render to ROVs. In this case all up-
dates to B are automatically thread safe.

2.5 Image Quality and Performance


Figure 2.7 show a tower scene with many thin features (cables, tower elements) that
extend over many pixels in the screen. Figure 2.8 shows a moonscape scene with very
few thin and long features. We used these two scenes for our measurements with a
Nvidia Quadro RTX 6000 and Windows 10 October Update (RS5) and Nvidia driver
version 415.82. We used Windows SDK version 10.0.17763.0 and measured all the
data at 1080p resolution. We used simple diffuse shading without texturing to focus on
the artifacts due to visibility alone.
Figure 2.7(a) shows one sample per pixel without any anti-aliasing and Fig-
ure 2.7(b) shows 4 rays traced per pixel for pixels where luminance difference across a
1-ring neighborhood (8 pixels surrounding a pixel in a 9 9 region) by more than a pre-
set threshold. Figure 2.7(c) shows the image rendered with the criterion used for Fig-
ure 2.7(b) and the additional pixels identified with our algorithm using the conservative
rasterization. Figure 2.7(b) took 0.2 ms for the rasterization pass followed by 4.6 ms
for ray tracing. For Figure 2.7(c), the rasterization pass took 0.4 ms (as expected be-
cause we are doing two passes) and ray tracing took 4.7 ms since there are more pixels
where we are performing ray tracing. As one can see, long and thin cables appear more
reliably with our algorithm. It is possible to still miss some subpixel features because
we still sample visibility within those pixels.
Figure 2.8 shows a different scene with very few thin subpixel features. Rendering
in Figure 2.8(b) took 0.9 ms for the rasterization pass and 4.7 ms for ray tracing. For
Figure 2.8(c), we spent 1.2 ms in rasterization passes and 5.4 ms ray tracing. Here we
can see that there is hardly any the image quality improvement with an additional con-
servative rasterization pass.
Although our algorithm is very good at finding thin geometry that otherwise might
be missed between samples, this does not come for free. Our algorithm needs an addi-
tional pass over the geometry with conservative rasterization in order to identify pixels.
For complex, geometry heavy scenes this added cost might offset the image quality
improvements.
A side-effect of conservative rasterization is that it identifies the partially covered
pixels along the internal edges of the mesh surface as candidates for ray tracing. Alt-
hough this is the correct behavior, this increases the number of raytraced pixels signif-
icantly in some cases, e.g., the moonscape scene. There are two ways to eliminate most
of such pixels. The first technique uses raster order views to store the vertex IDs of the
first triangle covering that pixel partially. When the subsequent triangle covers the same
pixel partially, we compare the vertex IDs of that triangle to those that are stored. If
2.5 Image Quality and Performance 289

(a)

(b)

(c)

Figure 2.7. A tower scene with thin subpixel features. Top to bottom: (a) no anti-aliasing; (b)
ray tracing executes on pixels where luminance differs by more than a threshold; and (c) ray
tracing executes on pixels identified by (b) and our conservative rasterization approach.

two vertex IDs match (but in opposite order) we detect that as an internal edge and do
not mark such pixels for ray tracing. However, note that, depending on the valence of
the vertex and the submission order of the triangles around that vertex, the pixel con-
taining a vertex may still be marked for ray tracing. With this ROV based approach, we
see rasterization passes take significantly longer, e.g., 11.7 ms and 8.1 ms for the tower
scene and the moon scape scene respectively, due to serialization caused by ROVs. For
290 2. Adaptive Anti-Aliasing using Conservative Rasterization and GPU Ray Tracing

(a)

(b)

(c)

Figure 2.8. A moonscape scene with almost no thin subpixel features. Top to bottom: (a) no
anti-aliasing; (b) ray tracing executes on pixels where luminance differs from their neighbors by
more than a threshold; and (c) ray tracing executes on pixels identified by (b) and our conserva-
tive rasterization approach.

the tower scene, the ray tracing costs dropped only slightly to 4.6 ms whereas in the
case of moonscape scene the ray tracing costs dropped to 5.0 ms because fewer pixels
were marked for ray tracing. In both the cases the savings in the ray tracing costs was
more than offset by the additional costs required to remove those pixels. We believe
this is an artifact of extremely simplified shading we used. Table 2.1 summarizes the
times spend in passes in different variations of the algorithms.
2.6 Future work 291

Raster Passes Ray Tracing Total


(ms) (ms) (ms)
Luminance criterion 0.2 4.6 4.8
Tower
Lumi+CRastr 0.4 4.7 5.1
Scene
Lumi+CRast+ROV 11.7 4.6 16.3
Luminance criterion 0.9 4.7 5.6
MoonScape
Lumi+CRastr 1.2 5.4 6.6
Scene
Lumi+CRast+ROV 8.1 5.0 13.1
Table 2.1. Summary of time spent in the different phases of the algorithm. Luminance criterion
rows refer to the baseline where Luminance based criteria is used to select the pixels for ray
tracing. Conservative rasterization based algorithm without ROVs seems like a good choice only
for scenes with long thin features.

Note that in our measurements we had to use the geometry shader to get the vertex
IDs because the shader compiler dxc that Microsoft distributes with the SDK
10.0.17763.0 failed to compile the shaders with GetAttributeAtVertex().
Alternative solution to this internal edge issue is to use the geometry shader (or
fast geometry shader on Nvidia hardware) along with the adjacency topology and iden-
tify the silhouette edges. The Pixel shader can use this information to mark partially
covered pixels only along the silhouette triangles, but this technique requires the use
of geometry shader.

2.6 Future work


We are experimenting with other algorithms to classify pixels and determine how many
rays to cast when using GPU ray tracing. These techniques involve analyzing following
things:
1. Differences in luminance of neighboring pixels.
2. Differences in the primitive IDs of the neighboring pixels.
3. Differences in material IDs or material parameters of neighboring pixels.
4. Differences in depth or surface normal of neighboring pixels.
5. Combinations of 1–4.
6. Temporal variants of combinations of 1–4.

2.7 Demo
We provide a demo running on Nvidia GPUs for the proposed technique.
292 2. Adaptive Anti-Aliasing using Conservative Rasterization and GPU Ray Tracing

Bibliography
AKELEY, K. 1993. Reality Engine Graphics. In Proceedings of SIGGRAPH ‘93, pp. 109–116.
Kensler, A. 2013. Correlated multi-jittered sampling. In Mathematical Physics and applied
mathematics, 7, pp. 86–112.
MARRS, A., SPJUT, J., GRUEN, H., SATHE, R., MCGUIRE, M. 2018. Adaptive Temporal Antiali-
asing. High Performance Graphics 2018.

You might also like