Siggraph 2022

A𝛿: Autodiff for Discontinuous Programs – Applied to Shaders
YUTING YANG, Princeton University, U.S.

CONNELLY BARNES, Adobe Research, U.S.
ANDREW ADAMS, Adobe Research, U.S.
ADAM FINKELSTEIN, Princeton University, U.S.
(a) Target (b) Hand-chosen (c) Configurations (d) Optimization (e) Modified-d (f) Hybrid-b/d
Fig. 1. Our compiler automatically differentiates discontinuous shader programs by extending reverse-mode automatic differentiation (AD) with novel
differentiation rules. This allows efficient gradient-based optimization methods to optimize program parameters to best match a target (a), which is difficult to
do by hand (b). Our pipeline takes as input a shader program initialized with configurations (c) that look very different from the reference, and converges to be
nearly visually identical (d) within 15s. The compiler can also output the shader program with optimized parameters to GLSL, which allows programmers to
interactively edit or animate the shader, such as adding texture (e). The optimized parameters can also be combined with other shader programs (e.g. b) to
leverage their visual appearance while keeping the geometry close to the reference. For animation results please refer to our supplemental video.
Over the last decade, automatic differentiation (AD) has profoundly im- compiler also outputs GLSL that renders the target image, allowing users
pacted graphics and vision applications — both broadly via deep learning to interactively modify and animate the shader, which would otherwise be
and specifically for inverse rendering. Traditional AD methods ignore gradi- cumbersome in other representations such as triangle meshes or vector art.
ents at discontinuities, instead treating functions as continuous. Rendering
CCS Concepts: • Mathematics of computing → Automatic differentia-
algorithms intrinsically rely on discontinuities, crucial at object silhouettes
tion; • Software and its engineering → Domain specific languages.
and in general for any branching operation. Researchers have proposed
fully-automatic differentiation approaches for handling discontinuities by Additional Key Words and Phrases: Automatic Differentiation, Differentiable
restricting to affine functions, or semi-automatic processes restricted either Programming, Differentiable Rendering, Domain-Specific Language
to invertible functions or to specialized applications like vector graphics. ACM Reference Format:
This paper describes a compiler-based approach to extend reverse mode Yuting Yang, Connelly Barnes, Andrew Adams, and Adam Finkelstein. 2022.
AD so as to accept arbitrary programs involving discontinuities. Our novel A𝛿: Autodiff for Discontinuous Programs – Applied to Shaders. ACM Trans.
gradient rules generalize differentiation to work correctly, assuming there is Graph. 41, 4, Article 135 (July 2022), 24 pages. https://doi.org/10.1145/3528223.
a single discontinuity in a local neighborhood, by approximating the pre- 3530125
filtered gradient over a box kernel oriented along a 1D sampling axis. We
describe when such approximation rules are first-order correct, and show 1 INTRODUCTION
that this correctness criterion applies to a relatively broad class of functions.
Moreover, we show that the method is effective in practice for arbitrary Many graphics and vision optimization tasks rely on gradients.
programs, including features for which we cannot prove correctness. We When outputs can be expressed as explicit functions of given param-
evaluate this approach on procedural shader programs, where the task is eters, automatic differentiation (AD) can provide gradients. How-
to optimize unknown parameters in order to match a target image, and our ever, most AD methods assume that such functions are continuous
method outperforms baselines in terms of both convergence and efficiency. with respect to the input parameters, and produce incorrect gradi-
Our compiler outputs gradient programs in TensorFlow, PyTorch (for quick ents at discontinuities resulting from if/else branches, for example.
prototypes) and Halide with an optional auto-scheduler (for efficiency). The Such AD-based methods therefore struggle to optimize functions
involving factors like object boundaries, visibility, and ordering.
Authors’ addresses: Yuting Yang, Princeton University, U.S., yutingy@princeton.edu;
Connelly Barnes, Adobe Research, U.S., cbarnes@adobe.com; Andrew Adams, Adobe In certain cases, the gradient at a discontinuity can be computed
Research, U.S., andrew.b.adams@gmail.com; Adam Finkelstein, Princeton University, analytically. For example the derivative of a step function is the
U.S., af@cs.princeton.edu.
Dirac delta distribution (informally, infinity at the discontinuity
Permission to make digital or hard copies of part or all of this work for personal or and zero elsewhere). Likewise, the gradient of certain pre-filtered
classroom use is granted without fee provided that copies are not made or distributed discontinuous functions can be derived analytically as a convolution
for profit or commercial advantage and that copies bear this notice and the full citation with Dirac deltas1 . Building on these properties, this paper proposes
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s). a framework and compiler we call A𝛿 for automatic differentiation
© 2022 Copyright held by the owner/author(s). of programs – including proper differentiation of the discontinuities.
0730-0301/2022/7-ART135
https://doi.org/10.1145/3528223.3530125 1 Technically, such Dirac delta distributions act on a test function.
ACM Trans. Graph., Vol. 41, No. 4, Article 135. Publication date: July 2022.
135:2 • Yuting Yang, Connelly Barnes, Andrew Adams, and Adam Finkelstein
Gradient Combine Gradient Optimization

Input Rules §4.2 Multiple Program
DSL Sampling Gradient Randomize §7.2
§4.1 Prefilter X1 Axes Correctness TensorFlow
§6.1
Prefilter Xn Theoretical §5 PyTorch
Animation
Quantitative §7.3 Halide §7.1 GLSL §8.2
Color Code: [General Method] [Shader Specific] [Backend]
Fig. 2. Overview: green boxes indicate general components; red boxes are components specific to shaders; blue boxes indicate specific backend languages that
our compiler outputs to (details in gray). Our compiler takes as input an arbitrary program in our DSL (§ 4.1), and approximates the gradient by pre-filtering a
1D box kernel along sampling axes (§ 4.2). Approximations along multiple sampling axes are later combined (§ 6.1). We verify that our gradients are accurate in
two ways: we prove that a subset of programs are first-order correct (§ 5), and we also design a quantitative error metric (§ 7.3) to evaluate any gradient
program empirically. For practical applications, our compiler outputs the gradient program to three backends: TensorFlow, PyTorch and Halide. For efficiency
we further explore the scheduling space in Halide that trades-off register usage vs memory I/O and provide an optional autotuner (§ 7.1). The gradient program
can then be applied to optimization tasks that find optimal parameters for a shader program to best match a reference image. We find it helpful to add per
pixel random noise to Dirac parameters, as this makes the discontinuities observable in more pixel locations (§ 7.2). Finally, the program representation with
optimized parameters can be output to GLSL, which allows interactive animation.
We develop gradient rules for efficient automatic differentiation with accurate for higher sample frequencies as continuous functions ex-
respect to input parameters that discontinuous operators depend pressible in our DSL are locally Lipschitz continuous. A2 is the key
on, which we call Dirac parameters because their partial derivatives that allows us to efficiently expand to a larger set of programs. Be-
often contain Dirac deltas. (They are defined formally in Section 4.1.) cause we can use function and gradient values at nearby sample
Graphics researchers have derived several application-specific locations as proxies, we do not need extra samples to locate the
solutions for differentiating Dirac parameters, targeting specialized discontinuity, or limit the discontinuity to certain types of functions
domains such spline shapes for vector graphics [Li et al. 2020] or tri- that can be easily inverted. Additionally, as will be shown in Sec-
angle meshes rendered in a path tracer [Bangaru et al. 2020; Li et al. tion 3, A2 also allows efficient reverse-mode AD, because gradients
2018a; Loubet et al. 2019]. While they address these specific domains, with respect to the sampling axis can be approximated by finite dif-
they are not readily adapted to arbitrary functions. TEG [Bangaru ferences. Lastly, A3 allows us to evaluate most discontinuities using
et al. 2021] on the other hand, systematically differentiates para- minimal samples, along the sampling axis. In case one sampling axis
metric discontinuities on a limited scope of programs. Their system is not sufficient to observe every discontinuity, our framework can
correctly handles discontinuities that are represented by differen- also combine approximations from multiple sampling axes.
tiable and invertible functions, and is only fully automatic when the Beyond proposing gradient rules for efficiently back-propagating
discontinuities are represented by affine transformations: in all other through discontinuous programs, this paper investigates other ques-
cases, the inversion or reparameterization needs to be provided by tions such as how to evaluate the quality of our gradient and how
the programmer. This leaves out many real world programming to generate efficient GPU code for the gradients.
patterns such as discontinuity compositions, or discontinuities rep- We demonstrate the applicability of the gradient program in
resented by non-invertible functions. the domain of procedural shader programs capable of rendering
Similarly to TEG, our method also targets general discontinuous graphical designs. Using these gradients we are able to optimize the
programs, written in our domain specific language (DSL). But in- parameters of such shaders to match target reference images that
stead of proving correctness under a limited set of programs, we we found on the Web. The optimization does so more effectively
make several assumptions that allow us to handle a broader set of and quickly than using baseline methods such as finite differences,
programs. Our compiler approximates the pre-filtered gradient over and even for programs that occasionally violate assumptions A1-3.
a 1D box kernel, where we denote the kernel orientation as the Figure 2 shows an overview of our framework. The primary con-
sampling axis, under three assumptions: tribution of this paper a set of approximate derivative rules that can
be applied to a large set of general programs. We show for a subset of
(A1) There is at most one discontinuity between each sample and programs, the approximation error is bounded by a first order term
its nearest neighbor along the sampling axis (a sample pair). scaled by the size of the pre-filtering kernel. A second contribution is
(A2) Function values and some partial derivatives within the com- the implementation of a system that efficiently carries out practical
putation at a sample pair can be used to estimate gradients applications in shader programs. A third contribution is the design of
at locations between the pair. a novel error metric to quantitatively evaluate our gradient approxi-
(A3) Most discontinuities can be projected to the sampling axis. mation. Our code is available at: https://github.com/yyuting/Adelta.
Intuitively, A1 can be easily achieved in most locations with a high
enough sample frequency. Similarly, we also show A2 becomes more
A𝛿 : Autodiff for Discontinuous Programs – Applied to Shaders • 135:3
Table 1. Comparison between ours and related work on differentiating dis- out the computation of these specialized gradients [Nimier-David
continuous programs: traditional Auto-Differentiation (AD); finite difference et al. 2020; Zeltner et al. 2021; Zhang et al. 2021a]. While these
(FD); TEG [Bangaru et al. 2021]; differentiable vector graphics [Li et al. 2020]
manually derived rules are correct and efficient for the particular
and diffrentiable path tracers (DPT) [Bangaru et al. 2020; Li et al. 2018a;
Loubet et al. 2019]. We compare these methods under four criteria: whether application, they are limited to a small subset of programs. Our
they can sample discontinuities, whether the method can reduce to AD method on the other hand, is general, and can be applied to arbitrary
in the absence of discontinuities, time complexity in terms of how many programs expressible in the syntax of our DSL.
evaluations of the original program are needed as a function of parameter di-
mension 𝑛, and what set of programs each method can handle. Our method
handles every program expressible in our DSL (Section 4.1); AD and finite
difference work with arbitrary programs; TEG works with a limited subset of Replacing discontinuities with continuous proxies. Another strat-
programs whose discontinuities are represented by diffeomorphisms (Diff); egy for differentiating parametric discontinuities is to use a process
while task-specific methods DVG and DPT only apply to their specific tasks: such as smoothing to replace the original function with a continuous
vector graphics (VG) and path tracer (PT) respectively. proxy before taking the gradient. Although similar to pre-filtering,
these methods only differentiates the proxy, and does not attempt
Taxonomy Ours AD FD TEG DVG DPT to sample discontinuities directly. Soft rasterizer [Liu et al. 2019]
Discontinuities ✓ × ✓ ✓ ✓ ✓ replaces step discontinuities with sigmoids and builds a continu-
Reduce to AD ✓ ✓ × ✓ ✓ ✓ ously differentiable path tracer, but is application-specific and has
Time Complexity O(1) O(1) O(n) O(1) O(1) O(1) no formal guarantees for the approximation. Another approach,
Generality DSL All All Diff VG PT mean-variance program smoothing [Yang and Barnes 2018], can
correctly smooth out a certain class of procedural shader programs,
and briefly discusses using AD to derive one approximation term;
but it does not investigate differentiation further and is unable to
2 RELATED WORK scale to complicated programs. In contrast, our method applies to a
Automatic differentiation of parametric discontinuities. TEG [Ban- broader set of programs, and we show our approximation has low
garu et al. 2021] systematically differentiates integrals with discon- error both by mathematical proof and by evaluating under a quanti-
tinuities. When the program is posed as integrals of discontinuous tative metric. Researchers have also investigated a variety of strate-
functions, TEG correctly differentiates the program by eliminating gies for converting complicated functions with neural proxies, for
Dirac deltas residing within the integrals. The remaining integral example: approximating a “black-box” ISP camera pipeline [Tseng
dimensions are sampled and differentiated using trapezoidal rule. et al. 2019], using neural textures or neural implicit 3D representa-
However, the set of programs TEG can correctly handle is restricted. tion for differentiable rendering [Jiang et al. 2020; Niemeyer et al.
In order to correctly eliminate Dirac deltas, the discontinuities are 2019; Park et al. 2019; Sitzmann et al. 2019; Thies et al. 2019], and
limited to be represented by differentiable, invertible functions, and using NeRF as a surrogate for geometry and reflectance [Martin-
TEG can only automatically handle the affine case. All other cases Brualla et al. 2021; Mildenhall et al. 2020]. These representations are
rely on programmer provided inversion or reparameterization. This inherently differentiable, and leverage all of the recent progress in
restriction limits the set of programs that can benefit from their au- neural representations, but the resulting representation is a network
tomatic pipeline: composition of discontinuities and non-invertible which is difficult to interpret and manipulate in general ways. In
functions are excluded entirely, and non-affine invertible functions contrast our approach provides gradients in the original program
require extra manual effort to define the inverse for each of them. representaion (and does so quickly relative to the training of most
Unlike TEG, our method approximates the gradient of discontinuous neural methods).
programs, with a weaker correctness guarantee of the error being
first order in the step size for sufficiently small steps, and we show
this theoretical result applies to a larger set of programs. Moreover,
Section 8.2 shows empirically that our method can also handle a Scheduling efficient Halide kernels. To allow practical applications,
larger set of shaders than the set we analyze theoretically – and our compiler targets Halide as one of the backends and provides a
which is expressive enough to reconstruct real world images found simplified scheduling space with an optional autoscheduler. While
online. Table 1 compares our method with TEG and other baselines most state-of-the-art Halide autoschedulers focus on efficiently
such as traditional AD ([Moses et al. 2021; Nimier-David et al. 2019]) scheduling nested image pipelines and utilize scheduling choices
and finite difference, as well as application-specific methods that such as tiling and fusion [Adams et al. 2019; Mullapudi et al. 2016;
directly differentiate discontinuities, discussed next. Sioutas et al. 2019], our Halide backend targets the original and
gradient program for procedural pixel shaders, and the performance
Application-specific differentiation of parametric discontinuities. bottleneck in our case is caused by long gradient tapes that can
Many application-specific methods differentiate pre-filterings of cause register spilling inside GPU kernels. Li et al. [2018b] proposed
the discontinuous functions. For example, in the domain of vector scheduling options for the gradient of image pipelines. However,
graphics [Li et al. 2020], path tracers [Bangaru et al. 2020; Li et al. their work focuses on efficiently scheduling convolutional scattering
2018a; Loubet et al. 2019; Zhang et al. 2021b], and other physics- and gather operations (e.g. convolution and its gradient), whereas
based renderers [Zhou et al. 2021], specialized rules are derived ours focuses on trade-offs between avoiding register spilling and
analytically, and algorithms are also designed for efficiently carrying minimizing memory I/O.
3 MOTIVATION Table 2. Gradient rules for our compiler and traditional AD. ∗ : in our com-
piler implementation (but not our theoretical results), to avoid numerical
We begin by describing a simple example: 𝑓 (𝑥, 𝜃 ) = 𝐻 (𝑥 + 𝜃 ). Here instability, the division in our function composition rule is safeguarded, and
𝑥 is a sampling axis along which we sample discontinuities, and evaluates to ℎ′ whenever |𝑔+ − 𝑔− | ≤ 10−4 .
which we will discuss in more depth shortly; and 𝜃 is a parameter for
which we wish to obtain a derivative. 𝐻 is a Heaviside step function Op Ours (k = O) AD (k = AD)
that evaluates to 1 when 𝑥 + 𝜃 ≥ 0, and 0 otherwise. Mathematically,  𝜕𝑘 𝑔
if 𝐻 (𝑔+ ) ≠ 𝐻 (𝑔− )
 𝜕𝜃
the gradient of this step function is a Dirac delta distribution 𝛿, 𝜕𝑘 𝐻 (𝑔) 

|𝑔+ −𝑔− | 0
𝜕𝜃
which informally evaluates to +∞ at the discontinuity, 0 otherwise, 0
 else
and integrates over the reals to one. In real-world applications, 𝜕𝑘 (𝑔+ℎ)

𝜕𝑘 𝑔 𝜕𝑘 ℎ 𝜕𝑘 𝑔
𝜕𝜃 + 𝜕𝜃 + 𝜕𝜕𝜃
𝑘ℎ
differentiating discontinuous functions is usually approximated by 𝜕𝜃
𝜕𝑘 (𝑔·ℎ)
𝜕𝜃
1 (ℎ + + ℎ − ) 𝜕𝑘 𝑔 + 1 (𝑔+ + 𝑔− ) 𝜕𝑘 ℎ 𝜕𝑘 𝑔
+ 𝑔 𝜕𝜕𝜃
𝑘ℎ
first pre-filtering with kernel to avoid the need to exactly sample 𝜕𝜃 ( 𝜕 2𝑔 𝜕𝜃 2 𝜕𝜃 ℎ 𝜕𝜃
the discontinuity, which is measure zero. For example, pre-filtering 𝜕𝑘 ℎ (𝑔) ℎ ′ 𝜕𝜃
𝑘
if ℎ(𝑔) is statically differentiable
ℎ ′ 𝜕𝜃
𝜕 𝑔
𝑘
over a 1D box kernel in the 𝑥 dimension: ℎ (𝑔+ )−ℎ (𝑔− ) 𝜕𝑘 𝑔
∫ ∫
𝜕𝜃
𝑔+ −𝑔− 𝜕𝜃 otherwise∗
𝜕
𝐻 (𝑥 ′ +𝜃 )𝜙 (𝑥 −𝑥 ′ )𝑑𝑥 ′ = 𝛿 (𝑥 ′ +𝜃 )𝜙 (𝑥 −𝑥 ′ )𝑑𝑥 ′ = 𝜙 (𝑥 +𝜃 )
𝜕𝜃
function in Section 6.2 and a ray-marching construct in Appendix D.
Here we use 𝜙 to represent the P.D.F. of the uniform continuous
1 if 𝑥 +𝜃 ∈ [−𝜖, 𝜖], After presenting the minimal DSL, we will present our gradient
distribution 𝑈 [−𝜖, 𝜖]. The gradient evaluates to 2𝜖
rules which can be used to extend typical reverse-mode AD.
and 0 elsewhere. Note that because the discontinuity depends on
both 𝑥 and 𝜃 , we can differentiate with respect to (wrt) 𝜃 while
4.1 Our Minimal DSL Syntax
pre-filtering along 𝑥.
A key motivation of our approach is that in many applications, We formally define the set of programs expressible in our minimal
there are a few dimensions that most parametric discontinuities DSL using Backus-Naur form. The set of all programs expressible
depend on, such as time for audio or physics simulation programs, in our language can be defined as below, where 𝐶 represents any
or the 2D image axes for shader programs. As a result, the computa- constant scalar value, 𝑥 represents any variable that is a sampling
tional challenge of sampling discontinuities in high dimensions can axis, 𝜃 represents any parameters we want to differentiate wrt, and
be greatly reduced by placing samples along these axes, which are 𝑓 are continuous atomic functions supported by our DSL (presently,
much lower dimension than the entire parameter space. We denote sin, cos, exp, log, and pow with constant exponent).
them as sampling axes. In principle, sampling axes can be arbitrary, 𝑒𝑑 ::= 𝐶 | 𝑥 | 𝜃 | 𝑒𝑑 + 𝑒𝑑 | 𝑒𝑑 · 𝑒𝑑 | 𝐻 (𝑒𝑑 ) | 𝑓 (𝑒𝑑 )
and we do not need every discontinuity to be projected on a single
Using this syntax, we formally define Dirac parameters as any
axis. For a set of sampling axes, as long as discontinuities of interest
parameters 𝜃 that expressions of the form of 𝐻 (𝑒𝑑 ) depend upon.
project to one of them, their gradient will be included.
We propose to approximate the gradient wrt every parameter by
4.2 Our Gradient Rules
first prefiltering using a 1D box kernel on the sampling axes. For
example, for a continuous function 𝑐, we can differentiate 𝐻 (𝑐 (𝑥, 𝜃 )) This subsection formally defines our pre-filtering process, and presents
pre-filtered by a kernel 𝜙 (𝑥) wrt 𝜃 as follows, assuming 𝑑𝑥 𝑑𝑐 ≠ 0 at novel gradient rules that approximate the derivatives of the pre-
the discontinuity 𝑥𝑑 , and apply Dirac Delta’s scaling property. filtered function. We define a function 𝑓 : dom(𝑓 ) → R that maps
∫ ∫ a subset of R𝑛+1 to a scalar output in R. For prefiltering purposes,
𝜕
𝐻 (𝑐 (𝑥 ′, 𝜃 ))𝜙 (𝑥 − 𝑥 ′ )𝑑𝑥 ′ =
𝑑𝑐
𝛿 (𝑐 (𝑥 ′, 𝜃 )) 𝜙 (𝑥 − 𝑥 ′ )𝑑𝑥 ′ we assume 𝑓 to be locally integrable. We define dom(𝑓 ) to be the
𝜕𝜃 𝑑𝜃 set {(𝑥, 𝜽® ) ∈ R𝑛+1 : 𝑓 is not computationally singular at (𝑥, 𝜽® )},
∫ 𝛿 (𝑥 ′ − 𝑥 ) 𝑑𝑐 𝑑𝑐 where computationally singular is defined in Appendix A.1 as the
𝑑 𝑑𝜃
= 𝜙 (𝑥 − 𝑥 ′ )𝑑𝑥 ′ = 𝑑𝜃 |𝑥𝑑 𝜙 (𝑥 − 𝑥𝑑 ) (1) points where any intermediate value in the program is undefined.
𝑑𝑐 |
| 𝑑𝑥 𝑑𝑐 |
| 𝑑𝑥
Note that our framework and implementation also support multi-
We choose 1D box (boxcar) kernels to minimize the extra compute dimensional outputs R𝑘 such as RGB colors for 𝑘 = 3, but since
needed for locating the discontinuity 𝑥𝑑 and computing 𝜙 (𝑥 − 𝑥𝑑 ). the same gradient process is applied to each output independently,
Previous work either relies on simplifying assumptions such as 𝑐 is for a simpler notation but without loss of generality we assume
invertible [Bangaru et al. 2021], or has to use recursive algorithms the codomain of 𝑓 is R. In our compiler, multidimensional outputs
to find exact location of 𝑥𝑑 [Li et al. 2020]. Unlike previous work, are implemented for efficiency using a single reverse mode pass as
because a box kernel 𝜙 is piece-wise constant, we can simplify described in Section 6. We pre-filter 𝑓 by convolving with a box
computing 𝜙 (𝑥 − 𝑥𝑑 ) into sampling whether 𝑥 − 𝑥𝑑 ∈ [−𝜖, 𝜖]. kernel along sampling axis 𝑥, giving pre-filtered function 𝑓ˆ.
∫ 𝑥+𝛽𝜖
4 OUR MINIMAL DSL AND GRADIENT RULES 1
𝑓ˆ(𝑥, 𝜽® ; 𝜖) = 𝑓 (𝑥 ′, 𝜽® )𝑑𝑥 ′
(𝛼 + 𝛽)𝜖 𝑥−𝛼𝜖
This section formally defines the set of programs expressible in a
minimalistic formulation of our domain specific language (DSL). We Here 𝜽® is the vector of parameters that we wish to differentiate with
present the minimal DSL first to simplify the exposition, but later respect to, and 𝛼, 𝛽 are non-negative constants for each pre-filtering
in the paper we extend our DSL to include a ternary if or select with 𝛼 + 𝛽 > 0, that control the box’s location.
In our approximation, we locate discontinuities by placing two Because 𝐻 (𝑥 + 𝜃 ) only samples exactly at the discontinuity with
samples at each end of the kernel support, denoted as 𝑓 + and 𝑓 − measure zero, in practice the above expressions will always eval-
respectively. When 𝜖 is small enough, 𝑓 + and 𝑓 − can be viewed as uate to either 𝜖1 or 0, both leaving 𝐴𝐷
𝜕 𝑓
𝜕𝜃 incorrect. Intuitively, AD
approximating the right and left limit of 𝑓 . fails because it treats the function as continuous, and therefore is
Both ours and AD approximate the derivative of functions, with always biased to either side of the branch. Because TEG’s [2021]
differentiation rules summarized in Table 2. We further denote gra- multiplication rule is equivalent to that of AD, this also leads to the
dient approximations as 𝜕𝑘 , where 𝑘 ∈ {𝑂, 𝐴𝐷 } indicates ours and degeneracy discussed in their Section 4.6: differentiating the multi-
the traditional AD rule respectively. These rules contain a minimum plication of two identical step functions involves multiplication of a
set of operations from which any program from the set 𝑒𝑑 can be step function with a Dirac Delta, both being singular at the same
composed. For example, 𝑔 − ℎ = 𝑔 + (−1) · ℎ and 𝑔/ℎ = 𝑔 · (ℎ) −1 . position. Thus integrals involving such multiplication are undefined.
Boolean operators can be rewritten into compositions of step func- Unlike TEG and AD, our multiplication rule samples on both sides
tions based on De Morgan’s law. Forward mode AD can be carried of the branch, and therefore robustly handles this case.
out by applying Table 2 directly. Reverse-mode AD derivative rules
can be obtained by replacing in Table 2 𝜃 with the input argument 𝑔 𝜕𝑂 𝑓 𝐻+ + 𝐻− 1 𝐻+ + 𝐻− 1 1 𝜕 𝑓ˆ
= + = =
or ℎ, and treating the other input argument as constant for addition 𝜕𝜃 2 |2𝜖 | 2 |2𝜖 | 2𝜖 𝜕𝜃
and multiplication. When combined with reverse-mode differen- Function composition. Similarly to multiplication, Appendix B de-
tiation on 𝑛 parameters, both ours and traditional AD have O(1) scribes how the AD rule also fails for a similar example 𝑓 = 𝐻 (𝑥 +𝜃 ) 2
complexity. But AD can be faster due to its simpler rules. This is in when viewed as a square function. Note our approximation applies
contrast with finite difference, which has O(n) complexity and is different rules based on whether ℎ(𝑔) is statically differentiable.
inefficient for programs with many parameters. Static differentiability of ℎ(𝑔) means either ℎ(𝑔) is statically contin-
uous or ℎ(𝑔) is not statically dependent on 𝑥. For static continuity,
Heaviside step. Our rule is motivated by Equation 1 with 𝜙 being
𝜕𝑔 the compiler applies static analysis to the program, and decides
a 1D box kernel. But instead of computing 𝜕𝑥 analytically, we ap- static continuity based on whether each node depends on any dis-
proximate it using finite difference to avoid extra back-propagation continuous operators in its compute graph.
passes. Because 𝑔+ , 𝑔− are computed as an intermediate value as
part of the computation of 𝐻 (𝑔+ ), 𝐻 (𝑔− ), no extra computational 5 FIRST-ORDER CORRECTNESS
passes are needed for the finite difference.
In this section, we formally define the notion of first-order correctness
Multiplication. Our derivative rules work correctly when there of a gradient approximation. Then we characterize the subset of
is at most one discontinuity in the local region. However, when programs for which ours and AD is first-order correct. This section
authoring programs, intermediate values that depend on the same targets readers who are interested in the mathematical underpin-
discontinuity may further interact with each other, leading to multi- nings of our framework, but those who are more interested in its
plications where both arguments are discontinuous. For example, in application can safely skip to the next section.
shader programs, the lighting model to a 3D geometry may depend
on discontinuous vectors such as surface normal, point light direc- 5.1 First-order Correctness Definition
tion, reflection direction, or half-way vectors. These vectors can be We define absolutely and relatively first-order correct gradient ap-
discontinuous at the intersection of different surfaces, at the edge proximations, where absolutely is used for local regions where 𝑓
where a foreground object occludes a background object. When is continuous and relatively is used otherwise. Intuitively, these
computing the intensity, these vectors are usually normalized first, say that the partial derivative approximation matches prefiltered
therefore each of the discontinuous elements 𝑛 needs to be squared derivatives from 𝑓ˆ up to error 𝑂 (𝜖).
and be expressed as 𝑛 · 𝑛. For simplicity, we assume 𝑛 is a Heavi- 𝜕 𝑓
side step function and motivate our rule by showing differentiating Definition 1. A gradient approximation 𝜕𝜃 𝑘
is absolutely first-
𝑖
𝑓 = 𝐻 (𝑥 + 𝜃 ) · 𝐻 (𝑥 + 𝜃 ) using the AD rule is already incorrect. ®
order correct for parameter 𝜃𝑖 at (𝑥, 𝜽 ) with kernel size 𝜖 if
Because 𝑓 = 𝐻 (𝑥 + 𝜃 ) · 𝐻 (𝑥 + 𝜃 ) can be simplified into 𝐻 (𝑥 + 𝜃 ),
𝜕𝑘 𝑓 𝜕 𝑓ˆ
its pre-filtered gradient is already discussed in Section 3. Assuming (𝑥, 𝜽® ; 𝜖) = (𝑥, 𝜽® ; 𝜖) + 𝑂 (𝜖) (2)
𝜕𝜃𝑖 𝜕𝜃𝑖
a discontinuity sampled within the kernel, we can plug in 𝑐 (𝑥, 𝜃 ) =
𝑥 + 𝜃 and 𝜙 (𝑥 − 𝑥𝑑 ) = 2𝜖 1 into Equation 1 and get the following:
𝜕 𝑓
Definition 2. A gradient approximation 𝜕𝜃 𝑘
is relatively first-
𝜕 𝑓ˆ
𝑖
1 ®
(𝑥, 𝜃 ; 𝜖) = order correct for parameter 𝜃𝑖 at (𝑥, 𝜽 ) with kernel size 𝜖 if
𝜕𝜃 2𝜖
𝜕𝑘 𝑓 𝜕 𝑓ˆ
Directly differentiating with the AD rule leads to zero because AD (𝑥, 𝜽® ; 𝜖) / (𝑥, 𝜽® ; 𝜖) = 1 + 𝑂 (𝜖) (3)
cannot correctly handle step functions. Even if we replace the Heav- 𝜕𝜃𝑖 𝜕𝜃𝑖
iside step gradient rule with ours and use the AD multiplication
rule on 𝑓 = 𝐻 · 𝐻 , we still get the following incorrect result: 5.2 Subsets of Our Minimal DSL
𝜕𝐴𝐷 𝑓 1 1 𝐻 (𝑥 + 𝜃 ) 𝜕 𝑓ˆ In general, we do not guarantee first-order correct gradient approx-
= 𝐻 (𝑥 + 𝜃 ) + 𝐻 (𝑥 + 𝜃 ) = ≠ imation for every program in 𝑒𝑑 , although we do show empirically
𝜕𝜃 |2𝜖 | |2𝜖 | 𝜖 𝜕𝜃
in Sections 8 that ours typically has low error and works in practice Definition 3. Given a function 𝑓 , the continuous set 𝐶𝑖 is the set
for optimizing shader parameters. To show first-order correctness, of all points (𝑥, 𝜽® ) in dom(𝑓 ) where 𝑓 is continuous with respect to 𝜃𝑖 ,
we progressively define 𝑒𝑑 from smaller subsets whose correctness or 𝑓 has a discontinuity with second order root (meaning there exists
can be shown. an intermediate value 𝐻 (𝑔) where 𝑔 and 𝜕𝑔/𝜕𝑥 evaluate to zero). The
𝑒𝑎 ::= 𝐶 | 𝑥 | 𝜃 | 𝑒𝑎 + 𝑒𝑎 | 𝑒𝑎 · 𝑒𝑎 | 𝑓 (𝑒𝑎 ) discontinuous set 𝐷𝑖 = dom(𝑓 ) \ 𝐶𝑖 .
𝑒𝑏 ::= 𝐻 (𝑒𝑎 ) | 𝐻 (𝑒𝑏 ) | 𝐶 · 𝑒𝑏 | 𝑒𝑏 + 𝑒𝑏 | 𝑒𝑏 · 𝑒𝑏 | 𝑓 (𝑒𝑏 ) In Appendix Definition 15, we excluded discontinuities with roots
𝑒𝑐 ::= 𝑒𝑎 | 𝑒𝑏 | 𝑒𝑐 + 𝑒𝑐 | 𝑒𝑐 · 𝑒𝑐 | 𝑓 (𝑒𝑐 ) of order 3 or above in the 𝑒¯𝑏 , 𝑒¯𝑐 definition, so this leaves disconti-
nuities with roots of order 2 or below in 𝑒¯𝑏 and 𝑒¯𝑐 . Given 𝐻 (𝑔),
𝑒𝑑 ::= 𝑒𝑐 | 𝐻 (𝑒𝑑 ) | 𝑒𝑑 + 𝑒𝑑 | 𝑒𝑑 · 𝑒𝑑 | 𝑓 (𝑒𝑑 )
the second order roots for an intermediate value 𝑔 are of the form
𝑒𝑎 represents all continuous programs that can be expressed in that they touch the x axis at a point without crossing it, so along x,
our DSL. Both ours and AD is correct for this set. For example, a the prefiltered derivative ignores this point and this point could be
color palette that smoothly changes color according to time and removed from the prefiltered gradient without changing it. Thus,
pixel coordinate belongs to this set (Figure 3 (a)). we compare ours with a prefiltered gradient that we consider contin-
𝑒𝑏 represents a subset of piece-wise constant discontinuous pro- uous (wrt x) at this point, so this is why we place the discontinuities
grams, whose discontinuities are either represented by continuous with second order roots in 𝐶𝑖 .
functions, or another 𝑒𝑏 function. Our gradient is correct almost In most cases of practical interest, the discontinuous sets 𝐷𝑖 are of
everywhere for 𝑒¯𝑏 : 𝑒𝑏 excluding some pathological cases described Lebesgue measure zero so the convenient Lebesgue measure gives
in the Appendix Definition 13, 14 and 15. For example, a black uninteresting results on those sets. Thus we define “dilated" sets 𝐷𝑖𝑟
and white blob whose shape changes parametrically belongs to 𝑒𝑏 that expand regions around the discontinuities so discontinuities
(Figure 3 (b)). typically expand into nonzero measure regions as follows:
𝑒𝑐 represents the subset of programs whose discontinuities share Definition 4. The dilated 1D interval 𝑁 𝑟 (𝑥, 𝜽® ) is defined as
a similar constrains as 𝑒𝑏 , but with arbitrary continuous parts ex-
{(𝑥 ′, 𝜽® ) ∈ dom(𝑓 )) : 𝑥 ′ ∈ (𝑥 − 𝛽𝜖 ′, 𝑥 + 𝛼𝜖 ′ ) for 𝜖 ′ = 𝑟𝜖 𝑓 (𝑥, 𝜽® )}.
pressible by 𝑒𝑎 . Our gradient is also correct almost everywhere for
𝑒¯𝑐 : 𝑒𝑐 excluding pathological cases. For example, a blob whose color Definition 5. The dilated discontinuous set 𝐷𝑖𝑟 is defined as 𝐷𝑖𝑟 =
is rendered according to pixels’ distance to object boundary belongs 𝑁 𝑟 (𝑥, 𝜽® ).
Ð
∀(𝑥,𝜽® ) ∈𝐷
𝑖
to 𝑒𝑐 (Figure 3 (c)).
𝑒𝑑 is the entire set of programs expressible in our minimal DSL. Theorem 1. For our approximation, ∃𝜖 𝑓 (𝑥, 𝜽® ) > 0 such that for
Generally, we do not give any correctness guarantee for this set. all parameters 𝜃𝑖 , for every kernel size 𝜖 ∈ (0, 𝜖 𝑓 (𝑥, 𝜽® )), 𝑓 ∈ 𝑒𝑎 ∪ 𝑒¯𝑏 is
However, we can show empirically that gradient approximations in absolutely first-order correct on 𝐶𝑖 , and almost everywhere on 𝐶𝑖 for 𝑒¯𝑐 .
this set have low error for our quantitative metric in Section 7.3. For our approximation, ∀𝑟 ∈ (0, 1], ∃𝜖 𝑓 (𝑥, 𝜽® ) > 0, 𝜏 : 𝐷𝑖𝑟 → 𝐷𝑖 , such
5.3 First-order Correctness Results that for all parameters 𝜃𝑖 , for kernel size 𝜖𝑖𝑟 = 𝑟𝜖 𝑓 (𝜏 (𝑥, 𝜽® )), 𝑓 ∈ 𝑒¯𝑏 ∪𝑒¯𝑐
is relatively first-order correct almost everywhere in 𝐷𝑖𝑟 .
We begin by defining sets where 𝑓 is continuous and discontinuous
with respect to different parameters, then we “dilate" the discontin- Note that as 𝑟 varies in (0, 1], 𝜖𝑖𝑟 is proportional to 𝑟 , so the result
uous sets for reasons related to measure theory that we discuss, and on 𝐷𝑖𝑟 holds for a variety of kernel sizes. A proof sketch is presented
finally we show our theorem on the program sets 𝑒𝑎 , 𝑒¯𝑏 , 𝑒¯𝑐 , where in Appendix A. Our almost everywhere results on 𝐷𝑖𝑟 exclude mea-
𝑒¯𝑏 , 𝑒¯𝑐 excludes pathological cases from 𝑒𝑏 , 𝑒𝑐 respectively. sure zero sets of locations that have multiple discontinuities: we
show this in Appendix A Lemma 2 and 4. An alternative interpreta-
tion is that the discontinuous point sets in 𝐷𝑖 in general can have
Hausdorff dimension up to 𝑛 (and usually this dimension equals 𝑛),
but the subset of points where our rule is not relatively first-order
correct in 𝐷𝑖 have Hausdorff dimension strictly less than 𝑛, so not
first order correct points form a lower dimensional subset within
𝐷𝑖 (see Appendix A Lemma 3 and 4).
(a) Shader example for 𝑒𝑎 (b) Shader example for 𝑒𝑏 6 COMPILER DETAILS
As mentioned in Section 4.2, our compiler supports functions with
multidimensional outputs in R𝑘 such as for 𝑘 = 3 for shader pro-
grams that output RGB colors. Assuming we are optimizing a scalar
loss 𝐿, we implement the gradient in a single reverse pass for effi-
ciency by first computing the components 𝜕𝐿/𝜕𝑓 𝑖 of the Jacobian
matrix for each output component 𝑓 𝑖 of 𝑓 , and the backwards pass
(c) Shader example for 𝑒𝑐 (d) Shader example for 𝑒𝑑 simply accumulates (using addition) into 𝜕𝐿/𝜕𝑔 for each intermedi-
ate node 𝑔. Our implementation assumes the program is evaluated
Fig. 3. Shader examples that belong to different subsets of programs. over a regular grid, such as the pixel coordinate grid for shader
programs. This allows small pre-filtering kernels that span between

the current and neighboring samples that can still catch small dis-
continuities so long as they show up when sampling the grid.
Since our gradient approximation only works with a single dis-
continuity in the local region, our compiler averages between two
smaller non-overlapping pre-filtering kernels to reduce the likeli-
hood that the single discontinuity assumption is violated. Specif- Fig. 4. Visualizing different options for how to combine multiple sampling
ically, we average the gradient between 𝑈 [−Δ𝑥, 0] and 𝑈 [0, Δ𝑥], axes in 2D. The green line demonstrates a discontinuity, and the blue re-
where Δ𝑥 is the sample spacing on the regular grid. This is similar to gion indicates evaluation locations where discontinuity can be sampled.
pre-filtering with 𝑈 [−Δ𝑥, Δ𝑥], but allows our compiler to correctly Naively choosing either the 𝑥 (a) or the 𝑦 axis (b) can result in the dis-
handle discontinuities whose frequency is below the Nyquist limit. continuity parallel to those axes being sampled at measure zero locations.
For example, if a discontinuous function in a shader program results For example, at the evaluation location indicated with a red square, each
in a one pixel speckle on the rendered image, our compiler can still method places additional samples (orange squares) to sample discontinu-
correctly account for discontinuities on both sides of the speckle. ities. Naively choosing the 𝑥 axis (a) fails because the discontinuity is parallel
to the kernel direction. Although naively choosing 𝑦 axis (b) succeeds, it
In the case of a single sampling axis, we would draw 3 samples
will fail if evaluated at the purple pentagon instead. Pre-filtering with a 2D
along the sampling axis for each location where the gradient is
kernel (c) allows robust sampling over the discontinuity, but the integra-
approximated. Furthermore, the samples may be shared between tion induces a computational burden. Our implementation (d) adaptively
neighboring locations on the regular grid. chooses from available axes, and ensures discontinuities in any orientation
Additionally, the compiler conservatively avoids incorrect ap- can be sampled with nonzero probability.
proximation due to multiple discontinuities: when multiple discon-
tinuities that are represented by different continuous functions are
sampled, the compiler nullifies the contribution from that location
6.2 Efficient Ternary Select Operator
by outputting zero gradient.
As can be seen in Section 4.2, in order to robustly handle discon-
tinuities, our multiplication rule places two samples on both ends
6.1 Combining Multiple Sampling Axes of the kernel support while AD just needs one. This leads to extra
It is common for multiple sampling axes to exist, either because the register usage in compiled GPU kernels that increases the likelihood
program is evaluated on a multi-dimensional grid (e.g. the 2D image of register spilling, which can result in slower run-times.
for our shader applications), or because no single axis can be used To alleviate the problem, our compiler always applies static anal-
to project every discontinuity. A natural question arises: how do ysis before multiplication, and switches to AD whenever both argu-
we extend Section 4.2 to handle multiple sampling axes? A naive ments are statically continuous. However, the ternary if or select
approach is arbitrarily choosing one of the axes, which risks ignoring operator can still be frequently expressed as multiplications of dis-
some discontinuities that are not projected to the chosen one. This continuous values using the minimum DSL syntax introduced in
may happen, for example, when the discontinuity is parallel to the Section 4.1. This is because the branching values themselves can
sampling axis (Figure 4(a)(b)). Another approach is to use a multi- be discontinuous, or the condition is a Boolean expression that
dimensional prefiltering kernel (Figure 4(c)). However, integrating needs to be expanded into multiplications of step functions using
against a Dirac delta in 𝑛-dimensions with 𝑛 > 1 typically results due De Morgan’s rule. Therefore, we introduce the ternary operator as
to the sifting property in a 𝑛 − 1 dimensional integration over the set an extended primitive to the DSL and design specialized optimiza-
where the Dirac delta’s argument is zero, which can be challenging tions so that differentiating it uses a similar number of registers to
but can be handled by additional sampling [Bangaru et al. 2021] or the AD rule, while allowing the first-order correctness property to
by recursively finding the intersection between discontinuity and stay the same as claimed in Section 5. See Appendix C for details.
the kernel support [Li et al. 2020].
In our implementation, for simplicity, we instead use a separate 7 ADAPTING OUR FRAMEWORK TO SHADERS
1D kernel for each sampling axis and combine gradient approxima- We apply our gradient approximations to procedural pixel shader
tions from different sampling axes afterwards. For each location, programs on optimization tasks that involve finding good parame-
we adaptively choose approximations from available sampling axes ters so the shader matches a target image. Parametric discontinuities
based on the following intuition: the chosen axis should ideally be are common in shaders to control object edges and shape, visibility,
the one that is closest to perpendicular to the discontinuity (Fig- and ordering.
ure 4(d)). This allows fixed-size small steps along the sampling axis Procedural shader programs are usually evaluated over a regu-
to have a larger probability of sampling the discontinuity. In practice, lar pixel grid, where the workload is embarrassingly parallel. Our
𝜕𝑐 |, and choose
for a discontinuity 𝐻 (𝑐), we quantify this feature as | 𝜕𝑥 compiler outputs a gradient program to two backends that both
the axis with the largest value. Because this term corresponds to support highly parallel compute on the GPU. The TensorFlow (TF)
the denominator in Equation 1, a larger value leads to approximate and PyTorch backends utilize the pre-compiled libraries that allow
gradients with smaller magnitude, therefore smaller variance. For for fast prototyping and debugging. The Halide backend on the
𝑛 sampling axes, our compiler draws 2𝑛 + 1 samples, which can be other hand, grants full control over the kernel scheduling, and can
potentially shared between neighboring locations. be orders of magnitude faster than TF and PyTorch provided a good
schedule. Section 7.1 discusses details about our abstraction to the Table 3. Our compiler provides a simplified Halide scheduling space for a
Halide scheduling space, and an optional autoscheduler. program 𝑓 . For an explanation of each choice refer to Section 7.1.
Unlike other rendering pipelines, the procedural shader repre-
sentation allows programmers to abstract scenes using their own Name Option
judgement, not limited by the primitives provided by the system’s logged_trace List of intermediate nodes in 𝑓
API. For example, although DVG [Li et al. 2020] provides circles as cont_logged_trace subset of logged_trace
a basic primitive, in our program representation, a circle equation separate_cont {True, False}
can be easily modified to represent a parabola, but with the absence separate_sample {axis, kernel, None}
of a parabola primitive in DVG, users may resort to manually defin-
ing the shape through control points. To fully utilize the ease of
modification for program representations, we additionally provide layers, checkpointing a node indicates perfect separation to the
a third backend that outputs the original program (without gradi- computation before and after the checkpoint; but the equivalence
ent) encoded with the optimal parameters to GLSL. Users can then in our case would be a min-cut to the compute graph with variable
interactively modify or animate the program through editors such terminal nodes, which is NP-hard. It is nontrivial to adopt classic
as the one on shadertoy.com. graph cuts literature (e.g. [Boykov et al. 2001]) to this problem due
In the rest of this section, we first provide details regarding sched- to the interaction with the complex engineering of the lower-level
uling efficient gradient kernels for the Halide backend in Section 7.1. register allocator and the hardware register space limits.
Next, we discuss a random noise technique that helps the conver- Therefore, this section describes heuristic-based scheduling op-
gence of optimization tasks for shader applications. Finally, to quan- tions summarized in Table 3. Note this is only a subset of the entire
titatively evaluate the gradient approximation, we propose a metric scheduling space, but it is easier to understand and explore, and
based on the gradient theorem in Section 7.3. large enough to contain reasonable scheduling for every shader
program shown in this paper. For each intermediate values in the
forward pass, we make a decision on checkpointing vs recomputa-
7.1 Halide Scheduling tion and encode the list of checkpoint nodes in logged_trace. For
Because our compiler approximates the gradient of an arbitrary the space of splitting backward pass into sub-programs, we reduce
program in reverse-mode AD, the number of forward intermediate it to a combination of the discrete choices in Table 3. Because the
values can be arbitrary many. Unlike the work of Li et al.[2018b], gradient wrt non-Dirac parameters can be computed with AD, it
which focuses on efficiently scheduling convolutional scattering and usually results a smaller gradient kernel, which can be optionally
gathering operations, because procedural shader programs compute computed in a separate reverse-mode AD (separate_cont = True).
independently per pixel, our scheduling bottleneck is the limited Since this sub-program can be small, it may have extra register space
register count per thread. On one hand, assuming unlimited register for recomputation and save some memory I/O. Therefore, instead
space, inlining the entire program into one kernel without interme- of reading checkpointing values from logged_trace, the gradient to
diate checkpointing gives optimal performance as memory access is non-Dirac parameters reads checkpoints from cont_logged_trace
minimized. On the other hand, assuming unlimited memory, writing instead, which is a strict subset to logged_trace. The gradient wrt
to and reading from memory for every intermediate computation is Dirac parameters, however, is a combination of approximating the
equivalent to building the graph using pre-compiled libraries such gradient by pre-filtering four different kernels: a left and right ker-
as TensorFlow. Because memory bandwidth is limited, this can be nel on image coordinates 𝑥, 𝑦 respectively (Section 6). Therefore,
orders of magnitude slower than the first approach. Because register we can compute the gradient approximation to each sampling axis
space is limited on GPUs, naively adopting the first approach usually in a different sub-program (separate_sample = axis), or separate
results in register spilling, which can cause slowdowns. In principle, the approximation to each pre-filtering kernel into a different parts
register spilling can be avoided by instructing part of the program to (separate_sample = kernel).
be recomputed within a GPU kernel. However, in practice we do not Even after simplification, the scheduling space is still a combi-
have this level of control even within Halide because of the common natoric space that is too large to exhaustively sample. Therefore
sub-expression elimination (CSE) optimization pass by CUDA. To we provide an optional autoscheduler based on heuristic search.
work around this, our scheduling choice involves splitting the gra- To estimate register usage, we build an approximate linear cost
dient program into multiple smaller kernels to best utilize available model by counting the number of intermediate computation for
registers while minimizing the memory I/O. each node in the forward pass. Based on the model, we iteratively
There are two strategies to our scheduling space: each value com- add nodes to a potential checkpoint list based on greedy search: the
puted in the forward pass (original program) can trade-off between newly added nodes should approximately halve the current maxi-
recomputation (which can potentially lead to register spilling) or mum cost among all nodes. The cost model is updated according to
checkpointing (which can require extra memory I/O); and the back- the potential list before every iteration. The list is then combined
ward pass as a large graph can be split into multiple sub-programs. with discrete options in Table 3 to search for a schedule with best
The first strategy is analogous to the recomputation and memory profiled runtime. We first search the best discrete choice with a
consumption trade-off for back-propagation in neural networks default list of checkpointing: logged_trace = cont_logged_trace =
[Chen et al. 2016; Gruslys et al. 2016]. However, generalizing those every output from ray marching loops if the shader involves the
methods to arbitrary compute graph is hard: in sequential neural RaymarchingLoop primitive, and logged_trace = cont_logged_trace
variables. Therefore, the width of the pre-filtering kernel is of the

form 𝑂 (𝜖) + 𝑂 (𝑠). Correspondingly, the error bound in Theorem 1
is changed from 𝑂 (𝜖) to 𝑂 (𝜖) + 𝑂 (𝑠): larger scale in the random
variable increases our approximation error, but as the scale goes to
0, the error becomes similar to that without the random noise.
(a) Random Z parameters (b) Random Dirac parameters One caveat is that the random variable can not be combined with
the RaymarchingLoop for implicitly defined geometry (Appendix D)
Fig. 5. Augmenting parameters with uniformly distributed random vari- because assumptions made in its gradient derivation can be violated
ables. This helps to sample discontinuities more often. In (a) we show only by random variables. We explain this further in Appendix D.4. In
augmenting random variables to parameters controlling Z order, which the optimization process, a separate noise scale associated with
better samples Z ordering discontinuities, and in (b) we show our default every Dirac parameter (except those RaymarchingLoop primitives
choice of augmenting random variables to all Dirac parameters, which also depend on) is tuned as well. At convergence, their values are usually
causes object boundary discontinuities to be sampled more frequently. optimized to be so small that the random noise is not be perceived
during rasterization.
= [] otherwise. With the best discrete combination, we further find 7.3 Quantitative Error Metric
the optimal logged_trace based on the potential checkpointing list, Although we mathematically showed the approximation error for
we search for the integer 𝑛 such that checkpointing every node some subsets of programs is 𝑂 (𝜖) (Section 5), we also wish to nu-
found before iteration 𝑛 in the greedy search gives the best runtime. merically evaluate the approximation made by our implementation.
cont_logged_trace simply chooses from the optimal logged_trace One possibility is to compute 𝐿 1 or 𝐿 2 distance between our gradient
or the default checkpointing list with best runtime. and finite difference [Li et al. 2018a]. However, a flaw with the 𝐿 1
or 𝐿 2 norm metrics is that the same delta distribution e.g. 𝛿 (𝑥) can
7.2 Introducing Random Variables to the Optimization be formed as the limit of many different distributions e.g. 𝐺𝜖 (𝑥)
As discussed before, our gradient approximation works when the and 𝐺 2𝜖 (𝑥), where we use 𝐺𝜎 (𝑥) to identify the distribution for the
discontinuity can be sampled along the sampling axes. The assump- Gaussian PDF wrt x with standard deviation 𝜎, and those distribu-
tion that it suffices to use only a 2D spatial grid for sampling axes tions have nonzero 𝐿 1 and 𝐿 2 distance between each other even as
may be incorrect, when the discontinuity makes a discrete choice 𝜖 → 0. This is undesirable because two different gradient rules that
and the rendered image is only exposed to one branch. For example, both give precisely correct derivatives as limits in the distribution
in Figure 3(d), when rings overlap, the output color corresponds to theory but with different limiting “shapes" (e.g. 𝐺𝜖 (𝑥) and 𝐺 2𝜖 (𝑥))
the ring with the largest Z value. The choice is always consistent for would wrongly be considered to have nonzero error. Additionally,
each overlapping region. As a result, the discontinuity generated by finite difference with finite step size and sample count introduces
comparing the Z value between two rings may not be sampled on its own approximation error (Section 8). Another possibility is to
the current image grid and the vanilla method is unable to optimize analytically write down the reference gradient. However, this re-
the Z values when the interlock pattern is wrong in Figure 3(d). quires substantial human effort and may not be feasible depending
We propose to solve this problem by introducing auxiliary ran- on the complexity of the program.
dom variables. Conceptually we can think of these as extending the We therefore avoid computing a reference gradient directly, and
sampling space where we look for discontinuities from only the 2D propose a quantitative metric according how much the gradient
spatial coordinates to a higher-dimensional space that includes 2D theorem is violated by the approximation. According to the gradient
spatial parameters and Dirac parameters. For each Dirac parameter theorem, the line integral through a gradient field can be evaluated
(with exception of those discussed in Section D.4), its value is aug- as the difference between the two endpoints of the curve. For ex-
mented by adding a per pixel uniformly independently distributed ample, assuming fixed pixel center, and with program parameters
random variable whose scale becomes another tunable parameter moving from 𝜽® 0 to 𝜽® 1 along a differentiable curve Θ, Equation 4
as well. For the ring example, adding random variables to the Z should always hold for the reference gradient, and any good ap-
values leads to speckling color in the overlapping region, as shown proximation should keep the difference between both sides of the
in Figure 5(a). Each pair of pixel neighbors with disagreeing color equation as small as possible. We refer to the two sides of the equa-
represents the discontinuity on different choices to the Z value com- tion as LHS (left hand side) and RHS (right hand side).
parison. Because the compiler does not have semantic information ∫
for each parameter, in practice we augment every Dirac parameter
∇𝑓 (𝜽® )𝑑 𝜽® = 𝑓 (𝜽® 1 ) − 𝑓 (𝜽® 0 ) (4)
with an associated random variable as in Figure 5(b). Instead of sam- Θ
pling discontinuities only at the ring contour, the random variable Additionally, because the gradient theorem requires a continu-
allows the discontinuity to be sampled at many more pixels. ously differentiable function, we need to pre-filter the discontinuous
Our gradient approximation also generalizes to the random vari- program 𝑓 before applying Equation 4. In practice, the desired pre-
able setting. Instead of sampling along the image coordinate with filter is a n+2-dimensional box filter in both image × parameter
regularly spaced samples, we now sample along a stochastic direc- space: 𝐾 ∗ = 𝑈 [−1, 1] 2 × 𝑈 [−𝜉, 𝜉]𝑛 . In image space, the 𝑈 [−1, 1] 2
tion in the parameter space with sample spacing scaled by both kernel smooths out a 3x3 region centered at the current pixel, and
spacing 𝜖 on image grid and the maximum scale 𝑠 among all random in the parameter space, 𝜉 is chosen such that the diagonal length of
Rendering Ours TEG FD 0.01
Circle 9.27 × 10−5 2.0x 3.4x
(a) Circle (b) Ring
Rectangle 1.68 × 10−4 0.80x 3.6x 0.00 Fig. 7. Comparing performance of ours and differentiable vector graphics
[Li et al. 2020] for two optimization tasks. Each plot shows the convergence
of 100 random restarts. The x axis reports wall clock time in seconds. The y
Fig. 6. We compare ours, TEG [Bangaru et al. 2021] and finite difference (FD)
axis reports log scale 𝐿 2 error relative to the minimum error 𝐿𝑚𝑖𝑛 found over
using a circle and a rectangle shader. The quantitative errors are reported
all restarts for both methods. Labels on the 𝑦 axis are base 10 exponentials:
below as the mean error from all pixels, with baselines relative to ours. FD
a label 𝑘 indicates a 𝐿 2 error of 𝐿𝑚𝑖𝑛 10𝑘 . Each restart is reported as a
is evaluated at 1 sample per pixel and is chosen as the lowest error from
transparent line, and the median error within all restarts (at a given time) is
among five different step sizes (10−𝑖 for 𝑖 ∈ {1, 2, 3, 4, 5}).
shown as the solid line. In the circle example, DVG converges with a slower
runtime. In the ring example, DVG hardly converges because its gradient
wrt the radius of the ring has a bug: see Appendix Figure 15 for details.
the n-dimensional box kernel is 1/10 that of the line integral. We
keep the filter size relatively small to avoid extra sampling due to
the curse of dimensionality. We use the L1 difference between LHS using its quadrature rule for pre-filtering the 2D kernel.There are
and RHS in Equation 5 as our error metric. three possible sources of error for ours: sampling error for the 2D
∫
prefiltering; first order residual error in our gradient approximation
∇(𝐾 ∗ 𝑓 (𝜽® ))𝑑 𝜽® = 𝐾 ∗ ∗ 𝑓 (𝜽® 1 ) − 𝐾 ∗ ∗ 𝑓 (𝜽® 0 ) (5)
Θ (𝑂 (𝜖) term in Definition 2 and 1); and inaccurate approximation
when our single discontinuity assumption is violated (e.g. at the
The RHS is estimated by sampling the 𝑛+2-dimensional box filter 𝐾 ∗
four corners of the rectangle). Nevertheless, TEG’s error is 2.0x and
around 𝜽® 0 and 𝜽® 1 and the LHS is estimated by quadrature using the
0.80x compared to ours in the two examples, indicating that the ap-
midpoint rule. Because different methods make different prefilter-
proximations our method introduces cause minimal error compared
ing assumptions internally, the kernel 𝐾 in the LHS’s integrand is
to the sampling error of the prefiltering.
adapted for each method so the combination of all prefilters results
We use TEG’s CPU implementation as running its gradient pro-
in the same desired target 𝑛 + 2-dimensional box filter 𝐾 ∗ for each
gram on the GPU requires manually writing extra CUDA kernels.
method: this is described in Appendix E.
Therefore we do not compare with TEG using the optimization tasks
We always evaluate the metric without random variables (Sec-
similar to Section 8.2 because rasterization on the CPU is slow.
tion 7.2) due to computational reasons: random variables introduce
an additional pre-filtering integral along all dimensions of the param- 8.1.2 Comparison with Differentiable Vector Graphics. We compare
eter space. Due to the curse of dimensionality, accurately evaluating with Differentiable vector graphics (DVG) [Li et al. 2020] using
the LHS and RHS in the case of random variables thus requires a two shaders: circle and ring. Both of them are expressed as a circle
very large number of samples. primitive in DVG but the ring has a specified stroke width and
blank fill color. Because DVG is integrated in PyTorch and does
8 EVALUATION AND RESULTS not provide an API for efficiently extracting per pixel gradient map
8.1 Simple Shader Comparison with Related Work for arbitrary parameters, our comparison is focused on comparing
performance in a gradient-based optimization task.
This section compares our method with TEG [Bangaru et al. 2021]
The reference image for both shaders are rendered with a slightly
and differentiable vector graphics [Li et al. 2020], using shader pro-
eccentric, antialiased ellipse such that neither shader can reproduce
grams that are expressible in both their framework and ours.
the target with perfect pixel accuracy. We use L2 image loss and
8.1.1 Comparison with TEG. We compare with TEG [Bangaru et al. optimize for 100 iterations for each of 100 randomly sampled ini-
2021] using two shaders: rectangle and circle. The discontinuities in tializations. Because our method reuses samples from neighboring
the rectangle shader are affine, and can be automatically handled by pixels to sample the discontinuity, the actual number of samples
TEG. For the circle shader, we need to manually apply a Cartesian computed is 1 per pixel. However, we find using 1 sample per pixel
to polar coordinate conversion to differentiate in TEG. for DVG generates inaccurate gradients even for continuous param-
We evaluate ours, TEG and finite difference (FD) using our er- eters such as color. Therefore, we use 2 × 2 samples instead.
ror metric (Section 7.3) and report in Figure 6. Because TEG can For the circle shader (Figure 7(a)), both ours and DVG easily
correctly differentiate both shaders, the low quantitative error is converge to images that are similar to the reference image (Appendix
expected, and is solely caused by approximation that TEG makes Figure 14), but DVG is much slower. For the ring shader, however,
DVG fails to converge for most of the restarts (Figure 7(b)). We using other state-of-the-art differentiable renderers. Therefore in
suspect the non-convergence is caused by a bug in their code that this section, we compare our method with finite difference and its
generates incorrect gradients (Appendix Figure 15). However, even stochastic variant SPSA [Spall 1992]. To best benefit the baselines, we
disregarding the bug, DVG is slower than our method and can not run the optimization task with ten variants for finite difference: with
handle parametric Z ordering as in Figure 10. or without random variables combined with different step sizes (10−𝑖
for 𝑖 = 1, 2, 3, 4, 5) and denote the one with minimum error across all
restarts as FD∗ . Similarly, our SPSA∗ baseline chooses from thirty
8.2 Optimizing to Match Illustrations in the Wild SPSA variants by least error. The 30 variants includes a combination
In this section, we demonstrate that our differentiation method ro- of 5 different step sizes similar to FD∗ , two choices on the number
bustly applies to a variety of shaders that can be used to match of samples per iteration (1 vs half of the number of parameters), and
illustrations directly found on the Internet. These shader programs three choices for the optimization process (with or without random
represent similar complexity as those found on shadertoy.com, variables as in FD∗ , or a vanilla variant that removes loss objective
and are usually designed using a programmer’s abstraction of the alternation and randomness). Note because low-sample-count SPSA
scene. As a result, animating and modifying the program represen- runs faster, we scale up its number of iterations accordingly so that
tation is easier because program components as well as parameters it runs at least as long as our method. We also experimented with
have semantic meaning. In general, programs can be written with AD and zeroth order optimization (Nelder-Mead and Powell). AD
arbitrary branching compositions. This is in contrary to specialized hardly optimizes the parameters because most of our shaders have
rendering pipelines, such as using splines to represent 2D scenes little to no continuous cues for optimization. Zeroth order methods
and triangle meshes for 3D. Their expressiveness comes from the have problems searching in high dimensions, and never succeed
massive number of parameters rather than the structure of the pro- according to our criterion under the same time budget. We therefore
gram. Therefore specialized rules can be developed to differentiate did not report these results.
vector graphics or path tracers, as different parameter values still
lead to similar discontinuity patterns. However, it is hard to man- 8.2.1 Shader: Olympic Rings. We design a shader to express the
ually animate or control appearance using the parameters in such Olympic logo shown in Figure 10(a)-top. The shader uses more rings
pipelines, because there are hundreds or thousands of parameters than are necessary (10) to avoid being stuck in a local minimum.
and they typically do not have attached semantic meaning. Each ring is parameterized by its location, inner and outer radius,
In this section, we optimize shader parameters to match the ren- and color. Because the rings are interlocking, each ring is slightly
dering output to target images. All target images presented in this tilted vertically such that the Z value parametrically depends on
section are directly downloaded from the Web, with the only mod- the pixel’s relative distance to the ring center. Unlike DVG which
ifications being resizing and converting RGBA to RGB. We use a requires a single Z value per shape, as is typical for illustrative
multi-scale L2 loss. To avoid local minima, the loss objective alter- workflows, our method allows the users to define a parametric Z
nates from the lowest resolution L2 to the sum of every L2 over a ordering and optimizes it automatically. Additionally, because of the
pyramid up to resolution 𝑁 until 𝑁 reaches the rendering resolution. interlocking pattern, the shader can not be easily expressed using
This loss alternation is repeated 5 times within 2000 iterations. To the circle primitive in DVG, as each primitive has a user-defined
further aid convergence, we add a uniformly distributed random constant Z order.
variable (Section 7.2) to every Dirac parameter that is not depen- We run the optimization task with 100 restarts and report our
dent on ray marching. The scales of the random variables is also result with the minimum error as Figure 10 Optimization, which is
optimized: upon convergence, their values are usually close to zero. almost pixel-wise identical to the target. We also report the conver-
For each optimization task, we restart from 100 random initial- gence across all restarts for Ours, FD∗ , and SPSA∗ in Figure 8(a). For
ization with 2000 iterations per restart. To analyze convergence the majority of restarts, both ours and FD∗ converge to a low error,
properties, we say that a restart "succeeds" if it converges to an er- but FD∗ requires an order of magnitude longer runtime. Additional
ror lower than twice the minimum error found in all restarts by the convergence metrics are reported in Table 4. For both metrics, ours is
default method: ours with random variables. We additionally report faster than the baselines by an order of magnitude. We also optimize
an ablation without the random variables. The success threshold is the same target using a simpler shader with 5 rings. The median
plotted as the grey horizontal line in Figure 8. Based on this defini- success time and expected time to success for ours 5 rings are 0.5x
tion of success, we compute two metrics: median success time and and 1.6x for that of ours 10 rings, respectively. This is because the
expected time to success, and report these in Table 4. The median simpler shader has faster runtime, but converges less often. We
success time is the median time taken for a restart to reach the additionally quantitatively evaluate the gradient approximations
success threshold across all restarts that succeed. These values are using our error metric (Section 7.3) and show the results in Figure 9.
plotted as colored circles on the grey line in Figure 8. Expected time Ours outperforms both baselines by a large margin.
to success, on the other hand, evaluates given sufficient restarts, the After the optimization, the compiler outputs the shader program
mean time until the optimization finally converges. It is computed with optimized parameters to a GLSL backend, which allows inter-
by repeatedly sampling with replacement from the 100 restarts and active editing on platforms such as shadertoy.com. Because the
accumulating their runtime until a sampled restart converges. shader program renders extra primitives to avoid local minimum,
As we discussed before, because of the arbitrary composition pat- we apply an EliminateIfUnused() program annotation outside the
terns present in the shader programs, they cannot be differentiated computation for each ring so the compiler can automatically use
(a) Olympic Rings (b) Celtic Knot (c) SIGGRAPH (d) TF RayMarch (e) TF RayCast
Fig. 8. Runtime-Error plot for five optimization tasks, comparing Ours (with and without random variables) with FD∗ and SPSA∗ – using 100 restarts with
random initialization and plot axes as in Figure 7. Note we select FD∗ and SPSA∗ based on minimal error from among 10 and 30 different variants, as discussed
in Section 8.2. Grey horizontal lines denote success threshold, while circles on grey lines mark median success time as in Table 4.
a technique akin to dead code elimination to remove unused rings reconstruct the interlocking pattern, but instead of rendering col-
from the output GLSL code. During optimization, both forward ored rings, the shader renders black at the edge of the ring with a
and backward programs emit the computations within EliminateI- parametric stroke width, and white elsewhere.
fUnused(). Upon convergence, the compiler progressively removes The black and white target image [Alexander Panasovsky 2018]
any computation in EliminateIfUnused() if it does not increase the brings an extra challenge to the optimization task, because the
optimization error. In the Olympic rings example, this refers to extra shader can no longer rely on color hue to match the target, but
rings that are either pushed outside of the image or to the back of the instead use gradients only from discontinuities. This limits the num-
image with the same color as the background. The pruned compute ber of pixels that contribute to the gradient, as discontinuities are
graph is then output to the GLSL backend, and includes only code only sampled at a sparse set of pixels. Additionally, when the ren-
and parameters for the five rings visible in the rendering. Figure 10 dering and the target image are poorly aligned, the majority of the
Modified shows an example interactive edit for the GLSL program: gradient contribution is quite noisy, which causes the optimization
we decrease the spacing between rings and thicken them. The in- landscape to be almost flat except for a small neighborhood around
teractive edit simply requires modifying corresponding parameter the minimum. The problem is alleviated by the random variables dis-
values in the program representation. However, if the target image cussed in Section 7.2. By randomly perturbing the parameter values,
is represented by multiple filled shapes, editing it to the modified we generate a fuzzy rendering that greatly increases the number
position requires many tedious manual changes such as editing the of pixels with differently branched neighbors, which permits our
control points for each shape, and adding new shapes. Adding new method to sample discontinuities more frequently.
shapes may be necessary because typically editors only support a Our optimization result is reported in Figure 10. It correctly lo-
single Z value per shape, and the decreased ring spacing introduces cates the rings, and correctly models the interlocking pattern. For
more disconnected regions, such as the small black region inside the convergence plot in Figure 8 (b), ours converges at a lower rate
the blue ring. than (a) because of the optimization challenges discussed, but sig-
nificantly outperforms ours no random, FD∗ and SPSA∗ , which do
8.2.2 Shader: Celtic Knot. We modify the ring shader from Sec- not converge at all. This is also reflected in Table 4. To confirm that
tion 8.2.1 to match the target image for Celtic Knot in Figure 10(a). the scales of random variable always converge to zero, and that the
The Z ordering of the rings are parameterized similarly to correctly lower convergence rate is due to the flat optimization landscape,
we run an additional optimization task for Ours where the parame-
ters are initialized at their optimal position with the same random
variable initialization as in Figure 8 (b). In all 100 restarts, the scale
Table 4. Time metrics comparing how fast ours, ours without random vari- to the random variable always converges close to 0 such that their
ables (O/wo) and baselines converge, as discussed in Section 8.2. Symbol × effects do not influence the rasterization, and the parameters stay
indicates the method never succeeded in all restarts. optimal to within 1% of the minimum error. Because SPSA∗ is sto-
chastic and does not always follow the gradient direction, its poor
Shader
Med. Success Time Exp. Time to Success performance on an almost flat landscape is expected. FD∗ on the
Ours O/wo FD∗ SPSA∗ Ours O/wo FD∗ SPSA∗ other hand, is unable to find a suitable step size: a small step can fail
Olympic Rings 1.4 0.9 13.8 13.8 5.0 19.1 342.4 252.7 to sample discontinuities especially in the presence of the random
Celtic Knot 2.3 × × × 17.3 × × × variables, but a large step approximates the gradient inaccurately
SIGGRAPH 2.6 1.8 34.9 71.4 6.1 6.2 86.5 247.8 and therefore, similarly to SPSA∗ , works poorly on the almost flat
TF RayMarch 0.7 0.7 × × 40.3 40.3 × × landscape with or without random variables. Figure 9 quantitatively
TF RayCast 4.3 2.2 31.4 × 13.9 9.0 2187.8 × evaluates the gradient approximation between ours and baselines
Ours FD SPSA
Olympic
Olympic
9.75 × 10−5 1.8x 6.3x
Celtic Knot
Celtic Knot
TF RayMarch
3.73 × 10−4 1.5x 3.9x
SIGGRAPH
(a) Target (b) Optimization (c) Modified
Fig. 10. Target, optimization, and modified results for three shaders dis-
4.58 × 10−5 2.5x 7.5x
cussed in Section 8.2: Olympic Rings, Celtic Knot and TF RayMarch.
TF RayMarch
TF RayCast
4.88 × 10−5 3.2x 12.8x
(a) Optimization (b) Modified (c) Modified

TF RayCast
Fig. 11. Optimization and modifications for the shader TF RayCast, using
target from Figure 10(a). Section 8.2.5 discusses the differences between
TF RayCast and TF RayMarch.
3.90 × 10−5 1.5x 3.1x
read and output variables that might be modified, and inserts a

empty Animate() function within the GLSL code with input and
output variables correctly referenced. That function’s body can then
Fig. 9. We quantitatively evaluate the gradient approximation between ours
be easily edited interactively essentially by performing variable
and baselines using the error metric described in Section 7.3. At each row,
we evaluate the metric over a scalar function that sums up the pixel values
substitutions or used to produce animations. Figure 10 shows an
in all three color channels for the corresponding shader program in Figure 1, example of coloring the optimized Celtic Knot shader. With our
10 and 11 using the optimized parameters. Below each method we report the framework, we simply use the Animate() method in GLSL to access
mean error from all pixels. For baselines we report relative error compared the pixel’s relative position within each ring, and modify the color
to ours. The finite difference (FD) baseline is always evaluated at 1 sample values accordingly. The interlocking is automatically handled by
per pixel because FD is always slower than ours in the optimization. For the optimized Z ordering. Such an edit can be more cumbersome if
SPSA, the number of samples is chosen so that its runtime per iteration in directly editing the target image in Photoshop or Illustrator, because
the optimization task is comparable to ours. Similar to Figure 6, FD and users need to manually mask out disconnected regions caused by
SPSA are chosen as the lowest error from among five different step sizes the overlapping.
(10−𝑖 for 𝑖 ∈ {1, 2, 3, 4, 5}).
8.2.3 Shader: SIGGRAPH. In this section, we explore a 3D shader that
can be used to reconstruct the SIGGRAPH logo as in Figure 1(a).
at the optimized parameters. Even when random variable is not Our shader is adapted from the shader “SIGGRAPH logo" by Inigo
present, FD∗ still has a higher error than ours. Quilez on shadertoy.com. The original shadertoy program is hand-
Because the compiler generated code is less readable than manu- designed with manually picked parameters to best match the target
ally written programs, we add an Animate() construct to the DSL image. However, the rendered output (Figure 1(b)) is still very dif-
to facilitate easier interactive edits and animations of the output ferent from the target. We modified the original shader so that the
GLSL shader. The Animate() construct indicates which input vari- geometry and lighting model are closer to the target image. Each
ables a programmer who is animating or editing might wish to half of the geometry is represented by the intersection between
a sphere and a half-space, from which is subtracted an ellipsoidal Knot A, Key 1 Knot A, Key 2 Knot B, Key 1 Knot B, Key 2
cone shape whose apex is at the camera location. The ray intersec-
Opt. Target
tions are approximated using sphere tracing with 64 iterations. Each
half is further parameterized with different ambient colors, and lit
separately by a parametric directional light and another point light.
Directly differentiating through the raymarching loop can result
in a long gradient tape because the number of loop iterations can
be arbitrarily large. As an alternative, we bypass the root-finding Fig. 12. Optimizations (below) matching animation key frames (above) for
process and directly approximate the gradient using the implicit two knot tying examples. We manually pick target key frames from the
function theorem. We extend the DSL with a RaymarchingLoop original animation, as described in Section 8.3.1.
primitive, and develop a specialized gradient rule motivated by
[Yariv et al. 2020](Appendix D). One caveat is that the specialized
gradient can not be combined with random variables, as assumptions differentiated using the general gradient rules in Section 4.2, and so
made for the gradient derivation can be violated by the randomness. random variables can be applied to every Dirac parameter. As can
Therefore, we do not associate random variables to any parameters be seen in Figure 8(e), ours both with or without random variables
that the RaymarchingLoop primitives depend on. frequently converges, but for this shader, the no random variant
Our optimization is almost identical to the target image and is benefits more from the faster runtime and the less noisy gradient
reported in Figure 1(c). FD∗ can also achieve a similar low error, approximation. FD∗ does not converge as well: it struggles to find a
but because its runtime scales by the number of parameters, it variant that is both accurate and able to sample discontinuities. But
converges slower than ours by an order of magnitude, as reported in optimizing the TF logo is easier, because the geometry is colored, so
Table 4 and Figure 8(c). The quantitative error metric for the gradient FD∗ still converges for a few restarts. Due to its stochastic nature,
approximation is reported in Figure 9. Note because this shader SPSA∗ is less likely to be stuck at a local minimum, but it trades this
renders the 3D geometry using sphere tracing, it is differentiated for lower accuracy, and so is unable to achieve low enough error. In
using rules from Appendix D whose accuracy depends on that of Figure 11 we show our optimization result (a), and two novel view
the iterative sphere tracer as well. Nevertheless, ours still has lower renderings where the letters T (b) and F (c) are recognizable.
error by a large margin compared to the baselines.
Because parameters and program components have semantic 8.3 Beyond Optimizing a Single Image
meaning, this opens many more editing possibilities than we have
This section demonstrates that the gradient of discontinuous pro-
for the original image. For example, we can similarly insert an
grams can also be applied to other optimization tasks.
Animate() primitive as in Section 8.2.2 and add Perlin noise [Perlin
2002] bump mapping to the geometry (Figure 1(e)). As an alternative, 8.3.1 Optimizing a Knot Tying Animation. This section explores
we can also utilize the displacement mapping and lighting model the possibility of using a program representation to reconstruct an
authored by Inigo Quilez in the original SIGGRAPH shader (Figure 1(b), animation sequence. Specifically, we experiment with knot tying
but use optimized parameters to make its geometry similar to the animations shown in Figure 12. The animations for knot tying tu-
target image. To do this, we modify Inigo Quilez’s shader in GLSL torials are generated by painstakingly manually specifying every
so that its geometry and camera model are compatible with our single frame of the animation, where each frame is expressed by a
parameterization, and paste the optimized parameters to the new combination of filled shapes or splines without semantic relation
shader. The hybrid modification is shown in Figure 1(f). between each other. To extract the semantic meaning encoded in the
8.2.4 Shader: TF RayMarch. Similar to Section 8.2.3, we author a 3D animation, we design a rope shader whose position is modeled by a
ray-marching shader that can be used to reconstruct the Tensorflow 2D quadratic Bezier spline. Because ropes can overlap themselves,
logo shown in Figure 10(a). The shader uses a 64 iteration sphere their depth information is encoded as a quadratic Bezier spline as
tracing loop to approximate the union of 4 boxes whose positions well.
are constrained based on our observations of the target image, such In our implementation, we manually pick and optimize key frames
as they should always be connected to each other by some particular in the animation sequence and increment the number of Bezier
surfaces. The scene is then shaded by a parametric ambient color segments by 1 for each key frame. Figure 12 demonstrates our re-
and directional color lights. construction of two key frames for each of the two different knots.
Our optimization result is shown in Figure 10 along with a modi- Experiment details are described in Appendix G. Instead of drawing
fied novel view rendering. The convergence and quantitative error additional frames in between, we can easily render a smooth anima-
comparisons are reported in Figure 8(d) and 9, respectively. tion by interpolating control points for the rope. These animations
are shown in our supplemental video.
8.2.5 Shader: TF RayCast. In this section, we discuss an alternative
program representation that directly uses ray-box intersection to at- 8.3.2 Optimizing a Path Planning Task. In this section, we show our
tempt to match the same TensorFlow target image as in Section 8.2.4. gradient approximation can also be applied to other domains such
All other components stay the same as the TF RayMarch variant. as path planning in robotics. The optimization task is to solve for a
Unlike the raymarching loop, which is differentiated using special 2D trajectory that allows an object to travel from a start position to
rules discussed in Appendix D, ray casting shader programs are the target position without hitting any of the obstacles.
and checkpointing strategies for better auto-scheduling, or develop

GPU kernels that can smartly trade-off between register usage and
recomputation.
Further, our specialized gradient approximation for implicit ge-
ometry cannot be combined with random variables. This can result
(a) Init (b) Ours (c) FD* (17x) (d) AD (0.9x) in undesirable local minima for optimization applications, such as
when one object is entirely occluded by another. We believe a differ-
Fig. 13. Path planning task: obstacles we wish to avoid are shaded as black ent specialized rule may be designed based on volumetric rendering,
regions, starting and desired ending position of the path are denoted as so that both foreground and background objects are involved in the
red and green crosses respectively. The path is initialized as a straight line differentiation.
between them (a). Both Ours (b) and FD* (c) can find a desirable path but Fourth, our gradient works under the assumption of a single
AD (d) suffers from avoiding the discontinuous repulsion field. Similar to discontinuity. For programs sampled on a regular grid, the number
Section 8.2, FD* is chosen from the five variants with different step sizes.
of samples that violate this assumption is inversely proportional to
We report relative runtime to ours for FD* and AD. While AD has similar
the grid’s sampling frequency. Future work could explore adaptive
runtime, FD* is significantly slower.
sampling along the sampling axis to further increase the likelihood
of a single discontinuity. Finally, although in our implementation the
sampling axis is independent from the parameter space, we imagine
We model the 2D trajectory by initializing the object with zero our gradient rules could further be generalized to arbitrary choice
velocity, with 10 segments of piece-wise constant acceleration from of sampling axis, such as a linear combination of the parameters.
time 0. Each constant piece of acceleration is parameterized by its In conclusion, this paper proposes an efficient compiler-based
x, y component and duration. The velocity and position can be rep- approach to approximately differentiate arbitrary discontinuous
resented as piece-wise linear and quadratic forms respectively. An programs in an extended reverse-mode AD. We validate our approx-
L2 loss encourages the position at the end of the last segment to be imation both using theoretical error bounds and quantitative error
close to the target position. We model the obstacles as a discontinu- metrics. We demonstrate a useful application to interactive editing
ous repulsion field with a constant large value inside the obstacle and animating illustrations represented as shader programs.
and 0 everywhere else. A configuration is penalized by the integral
of the field along the trajectory. The penalty term can be further ACKNOWLEDGMENTS
reparameterized as an integral over time, and is approximated by Thanks to Boris Mityagin for elucidating details of his proof tech-
quadrature using 10 samples per constant acceleration segment. nique [Mityagin 2020] which we followed in the Appendix Lemma 3.
We additionally add a third loss that minimizes the integral of the TensorFlow, the TensorFlow logo and any related marks are trade-
magnitude of the acceleration over time, which is proportional to marks of Google Inc. The Olympic rings are the exclusive property
the total fuel consumed assuming constant specific impulse. of the International Olympic Committee (IOC).
Because our gradient approximations are implemented specific to
shader programs, for this task we manually implement the gradient REFERENCES
program in numpy using syntax and rules described in Section 4. Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël
We report the optimization result in Figure 13. While both ours and Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and
Jonathan Ragan-Kelley. 2019. Learning to Optimize Halide with Tree Search and
FD* finds a desirable path, FD* is significantly slower (17x). Random Programs. ACM Trans. Graph. 38, 4, Article 121 (jul 2019), 12 pages. https:
//doi.org/10.1145/3306346.3322967
9 LIMITATIONS, FUTURE WORK, AND CONCLUSION Alexander Panasovsky. 2018. Celtic. https://thenounproject.com/icon/celtic-1975448/.
Sai Bangaru, Tzu-Mao Li, and Frédo Durand. 2020. Unbiased Warped-Area Sampling
Our method has a number of limitations, which offer potential for Differentiable Rendering. ACM Trans. Graph. 39, 6 (2020), 245:1–245:18.
avenues for future work. First, our application to shaders requires Sai Bangaru, Jesse Michel, Kevin Mu, Gilbert Bernstein, Tzu-Mao Li, and Jonathan
Ragan-Kelley. 2021. Systematically Differentiating Parametric Discontinuities. ACM
a programmer who is sufficiently skilled in writing shaders so the Trans. Graph. 40, 107 (2021), 107:1–107:17.
given shader for some parameter setting can approximate the target Yuri Boykov, Olga Veksler, and Ramin Zabih. 2001. Fast approximate energy minimiza-
tion via graph cuts. IEEE Transactions on pattern analysis and machine intelligence
image. We imagined that future work might broaden the scope 23, 11 (2001), 1222–1239.
of applicability to non-programmers by setting up a 2.5D or 3D Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep
workspace where primitives can be placed down and properties Nets with Sublinear Memory Cost. CoRR abs/1604.06174 (2016). arXiv:1604.06174
http://arxiv.org/abs/1604.06174
can be assigned to them such as interior and edge color, visibility, Pau Gargallo, Emmanuel Prados, and Peter Sturm. 2007. Minimizing the Reprojec-
fronto-parallel or planar depth in 2.5D, or geometric properties and tion Error in Surface Reconstruction from Images. In 2007 IEEE 11th International
relationships in 3D such as radius, abutment, symmetry, or CSG Conference on Computer Vision. 1–8. https://doi.org/10.1109/ICCV.2007.4409003
Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. 2016.
operations. Then the user could optimize user-chosen subsets of Memory-Efficient Backpropagation Through Time. CoRR abs/1606.03401 (2016).
these properties to produce different designs. arXiv:1606.03401 http://arxiv.org/abs/1606.03401
John C Hart. 1996. Sphere tracing: A geometric method for the antialiased ray tracing
Additionally, our current heuristic-based Halide auto-scheduler of implicit surfaces. The Visual Computer 12, 10 (1996), 527–545.
may not generalize to more complicated shader programs. For ex- Inigo Quilez. 2021. Distance Functions. https://www.iquilezles.org/www/articles/
ample, an iterative loop without specialized gradient approximation distfunctions/distfunctions.htm.
Chiyu Max Jiang, Avneesh Sud, Ameesh Makadia, Jingwei Huang, Matthias Nießner,
can generate a very long tape, which can cause trouble for efficient and Thomas A. Funkhouser. 2020. Local Implicit Grid Representations for 3D Scenes.
scheduling. Future research can either establish better cost models CoRR abs/2003.08981 (2020). arXiv:2003.08981 https://arxiv.org/abs/2003.08981
Tzu-Mao Li, Miika Aittala, Frédo Durand, and Jaakko Lehtinen. 2018a. Differentiable (2020).
Monte Carlo Ray Tracing through Edge Sampling. ACM Trans. Graph. (Proc. SIG- Tizian Zeltner, Sébastien Speierer, Iliyan Georgiev, and Wenzel Jakob. 2021. Monte
GRAPH Asia) 37, 6 (2018), 222:1–222:11. Carlo Estimators for Differential Light Transport. ACM Trans. Graph. 40, 4, Article
Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan- 78 (jul 2021), 16 pages. https://doi.org/10.1145/3450626.3459807
Kelley. 2018b. Differentiable programming for image processing and deep learning Cheng Zhang, Zhao Dong, Michael Doggett, and Shuang Zhao. 2021a. Antithetic
in Halide. ACM Trans. Graph. (Proc. SIGGRAPH) 37, 4 (2018), 139:1–139:13. Sampling for Monte Carlo Differentiable Rendering. ACM Trans. Graph. 40, 4 (2021),
Tzu-Mao Li, Michal Lukáč, Gharbi Michaël, and Jonathan Ragan-Kelley. 2020. Differen- 77:1–77:12.
tiable Vector Graphics Rasterization for Editing and Learning. ACM Trans. Graph. Cheng Zhang, Zihan Yu, and Shuang Zhao. 2021b. Path-Space Differentiable Rendering
(Proc. SIGGRAPH Asia) 39, 6 (2020), 193:1–193:15. of Participating Media. ACM Trans. Graph. 40, 4 (2021), 76:1–76:15.
Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. 2019. Soft Rasterizer: A Differ- Yang Zhou, Lifan Wu, Ravi Ramamoorthi, and Ling-Qi Yan. 2021. Vectorization for
entiable Renderer for Image-based 3D Reasoning. CoRR abs/1904.01786 (2019). Fast, Analytic, and Differentiable Visibility. ACM Trans. Graph. 40, 3, Article 27 (jul
arXiv:1904.01786 http://arxiv.org/abs/1904.01786 2021), 21 pages. https://doi.org/10.1145/3452097
Guillaume Loubet, Nicolas Holzschuch, and Wenzel Jakob. 2019. Reparameterizing
discontinuous integrands for differentiable rendering. Transactions on Graphics
(Proceedings of SIGGRAPH Asia) 38, 6 (Dec. 2019). A PROOF SKETCH FOR THEOREM 1
Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey
Dosovitskiy, and Daniel Duckworth. 2021. NeRF in the Wild: Neural Radiance Fields A.1 Key Math Definitions and Lemmas
for Unconstrained Photo Collections. In CVPR.
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ra-
In this section, we summarize the most important definitions and
mamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields lemmas for the program sets discussed in Section 4.1 and briefly
for View Synthesis. In ECCV. justify our claims. We start by defining computational singularity,
Boris Mityagin. 2015. The zero set of a real analytic function. arXiv preprint
arXiv:1512.07276 (2015). which refers to the function value being undefined for any interme-
Boris Samuilovich Mityagin. 2020. The zero set of a real analytic function. Matematich- diate node of 𝑓 .
eskie Zametki 107, 3 (2020), 473–475.
William S. Moses, Valentin Churavy, Ludger Paehler, Jan Hückelheim, Sri Hari Krishna Definition 6. A function 𝑓 ∈ 𝑒𝑑 is computationally singular at
Narayanan, Michel Schanen, and Johannes Doerfert. 2021. Reverse-Mode Automatic
Differentiation and Optimization of GPU Kernels via Enzyme. In Proceedings of the (𝑥, 𝜽® ) if any of its intermediate values 𝑔 satisfies one or more of:
International Conference for High Performance Computing, Networking, Storage and
Analysis (St. Louis, Missouri) (SC ’21). Association for Computing Machinery, New 

𝑔 = ℎ𝐶 where constant integer 𝐶 < 0 and ℎ(𝑥, 𝜽® ) = 0
York, NY, USA, Article 61, 16 pages. https://doi.org/10.1145/3458817.3476165 

Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and 𝑔 = ℎ𝐶 where 𝐶 is a constant non-integer and ℎ(𝑥, 𝜽® ) ≤ 0
Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing
where ℎ(𝑥, 𝜽® ) ≤ 0

𝑔 = log(ℎ)

Pipelines. ACM Trans. Graph. 35, 4, Article 83 (jul 2016), 11 pages. https: 
//doi.org/10.1145/2897824.2925952
Michael Niemeyer, Lars M. Mescheder, Michael Oechsle, and Andreas Geiger. 2019.
Differentiable Volumetric Rendering: Learning Implicit 3D Representations without We now state the local boundedness and continuity results for
3D Supervision. CoRR abs/1912.07372 (2019). arXiv:1912.07372 http://arxiv.org/abs/
1912.07372
program set 𝑒𝑎 . For simplicity but without loss of generality we
Merlin Nimier-David, Sébastien Speierer, Benoît Ruiz, and Wenzel Jakob. 2020. Radiative always assume the function is evaluated on one given sampling
Backpropagation: An Adjoint Method for Lightning-Fast Differentiable Rendering. axis 𝑥 and other tunable parameters 𝜽® . As a reminder, because the
ACM Trans. Graph. 39, 4, Article 146 (jul 2020), 15 pages. https://doi.org/10.1145/
3386569.3392406 definition of the domain of 𝑓 made in Section 4.2 always excludes
Merlin Nimier-David, Delio Vicini, Tizian Zeltner, and Wenzel Jakob. 2019. Mitsuba 2: computational singularities, our discussion in the appendix also
A Retargetable Forward and Inverse Renderer. Transactions on Graphics (Proceedings
of SIGGRAPH Asia) 38, 6 (Dec. 2019). https://doi.org/10.1145/3355089.3356498
excludes computational singularities.
Jeong Joon Park, Peter Florence, Julian Straub, Richard A. Newcombe, and Steven
Lovegrove. 2019. DeepSDF: Learning Continuous Signed Distance Functions for Lemma 1. A function 𝑓 ∈ 𝑒𝑎 is evaluated at (𝑥, 𝜽® ) ∈ dom(𝑓 ) ⇒
Shape Representation. CoRR abs/1901.05103 (2019). arXiv:1901.05103 http://arxiv. 𝜕𝑓 𝜕𝑓
org/abs/1901.05103
∃ 𝜖 > 0 𝑠.𝑡 . 𝑓 , 𝜕𝑥 , 𝜕𝜃 are bounded and Lipschitz continuous in [𝑥 −
𝑖
Ken Perlin. 2002. Improving noise. In Proceedings of the 29th annual conference on 𝜖, 𝑥 + 𝜖].
Computer graphics and interactive techniques. 681–682.
Ken Perlin and Eric M Hoffert. 1989. Hypertexture. In Proceedings of the 16th annual Proof sketch: because of the construction of our DSL set 𝑒𝑎 from
conference on Computer graphics and interactive techniques. 253–262.
Savvas Sioutas, Sander Stuijk, Luc Waeijen, Twan Basten, Henk Corporaal, and Lou real analytic functions, we can always find 𝜖 > 0 so that 𝑓 ∈ 𝑒𝑎
Somers. 2019. Schedule Synthesis for Halide Pipelines through Reuse Analysis. and its gradients are real analytic on the local region. Real analytic
ACM Trans. Archit. Code Optim. 16, 2, Article 10 (apr 2019), 22 pages. https://doi. therefore leads to local boundedness by boundedness theorem and
org/10.1145/3310248
Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. 2019. Scene Represen- local Lipschitz continuity by mean value theorem ■.
tation Networks: Continuous 3D-Structure-Aware Neural Scene Representations. To characterize when the assumptions of our gradient rules are
CoRR abs/1906.01618 (2019). arXiv:1906.01618 http://arxiv.org/abs/1906.01618
J.C. Spall. 1992. Multivariate stochastic approximation using a simultaneous perturba-
met, we first define locally zero along the sampling axis, which may
tion gradient approximation. IEEE Trans. Automat. Control 37, 3 (1992), 332–341. lead to pathological functions containing intermediate expressions
Justus Thies, Michael Zollhöfer, and Matthias Nießner. 2019. Deferred Neural Ren- 𝐻 (𝑔) that evaluate as 𝐻 (0) within a local interval. We further define
dering: Image Synthesis using Neural Textures. CoRR abs/1904.12356 (2019).
arXiv:1904.12356 http://arxiv.org/abs/1904.12356 two outlier scenarios where our first-order correctness properties
Ethan Tseng, Felix Yu, Yuting Yang, Fahim Mannan, Karl St. Arnaud, Derek cannot be proved.
Nowrouzezahrai, Jean-Francois Lalonde, and Felix Heide. 2019. Hyperparameter Op-
timization in Black-box Image Processing using Differentiable Proxies. ACM Trans- Definition 7. A function 𝑓 is locally zero along the sampling
actions on Graphics (TOG) 38, 4 (7 2019). https://doi.org/10.1145/3306346.3322996
Y. Yang and C. Barnes. 2018. Approximate Program Smoothing Using Mean-Variance axis (i.e. 𝑥) at (𝑥, 𝜽® ) if ∃ 𝜖 > 0 𝑠.𝑡 . 𝑓 (𝑥 ′, 𝜽® ) = 0 ∀ 𝑥 ′ ∈ (𝑥 − 𝜖, 𝑥 + 𝜖)
Statistics, with Application to Procedural Shader Bandlimiting. Comput. Graph. whenever (𝑥 ′, 𝜽® ) ∈ dom(𝑓 ). Note also that to avoid excessively verbose
Forum 37, 2 (2018), 443–454.
Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and notation, from here on, when there is such an 𝑥 ′ we always implicitly
Yaron Lipman. 2020. Multiview Neural Surface Reconstruction by Disentangling assume (𝑥 ′, 𝜽® ) ∈ dom(𝑓 ), i.e. we assume we are within the set of
Geometry and Appearance. Advances in Neural Information Processing Systems 33
points where 𝑓 is defined.
Definition 8. A function 𝑓 ∈ 𝑒𝑐 evaluated at (𝑥, 𝜽® ) is symmetric choose 𝜖 > 0 s.t. within a local region along 𝑥 (i.e. an open interval
along the sampling axis 𝑥 if for some intermediate value 𝑔 of 𝑓 , 𝑔 is not (𝑥 − 𝜖, 𝑥 + 𝜖) intersected with dom(𝑓 ), per Def. 7) all intermediate
statically continuous (defined at the end of Section 4.2), and at (𝑥, 𝜽® ), values of 𝑔 with the form 𝐻 (·) are constant, so 𝑔 is a sum, product,
𝑔 is continuous, 𝜕𝑔/𝜕𝑥 exists, 𝜕𝑔/𝜕𝑥 is not locally zero wrt 𝑥, and there and/or composition of functions that are real analytic as a single-
exists 𝜖𝑘 > 0 s.t. for all 𝜖 ∈ (0, 𝜖𝑘 ], 𝑔(𝑥 + 𝛽𝜖, 𝜃 ) = 𝑔(𝑥 − 𝛼𝜖, 𝜃 ). This variate function of 𝑥 within that 1D local region, so 𝑔 and 𝜕𝑔/𝜕𝑥
implies 𝜕𝑥 = 0 at (𝑥, 𝜽® ).
𝜕𝑔 are real analytic single-variate functions within that local region.
We can consider at (𝑥, 𝜽® ), the representation of 𝜕𝑔/𝜕𝑥 as a single-
Definition 9. A function 𝑓 ∈ 𝑒𝑐 is multi-discontinuous at variate real analytic function on the 1D local region containing 𝑥 as
(𝑥, 𝜽® ) ∈ dom(𝑓 ) if any two of its intermediate values are of the one possible local real analytic representation of 𝜕𝑔/𝜕𝑥: there are
form 𝐻 (𝑔𝑖 ), 𝐻 (𝑔 𝑗 ) such that 𝑔𝑖 , 𝑔 𝑗 ∈ 𝑒𝑎 evaluate to 0, and ∇𝑔𝑖 , ∇𝑔 𝑗 finitely many such local real analytic representations even across all
are linearly independent vectors at the given (𝑥, 𝜽® ), where ∇ = (𝑥, 𝜽® ) ∈ dom(𝑓 ) since each 𝐻 (·) has only 2 discrete values {0, 1}.
[𝜕/𝜕𝑥, 𝜕/𝜕𝜃 1, . . . , 𝜕/𝜕𝜃𝑛 ]. Given 𝜽® , consider the set 𝑃 (𝜽® ) of points 𝑥 s.t. (𝑥, 𝜽® ) ∈ 𝐶𝑖 and along
With these definitions, we can now characterize the set of points the sampling axis are symmetric or discontinuous wrt 𝑥. Consider
that we will show are absolutely first-order correct in Definition 10. the set 𝑃 ′ (𝜽® ) of all zeros of all 1D real analytic (wrt 𝑥) functions
that we described in the last paragraph after excluding points 𝑥
Definition 10. For a function 𝑓 ∈ 𝑒𝑐 , (𝑥, 𝜽® ) ∈ 𝐶𝑖 is C-simple if where these real analytic functions are locally zero: within the
at (𝑥, 𝜽® ), 𝑓 is continuous along the sampling axis and not symmetric program 𝑓 there are finitely many such real analytic functions.
along the sampling axis 𝑥. By the previous paragraph’s reasoning, 𝑃 ⊂ 𝑃 ′ . A not identically
zero 1D function that is real analytic on an interval has countable
Similarly, we want to characterize the set of points that are rel-
zeros, and so the analytic functions (within suitable local regions)
atively first-order correct. However, because relatively first-order
used to construct 𝑃 ′ have countable zeros, so 𝜇 1 (𝑃 ′ ) = 𝜇 1 (𝑃) = 0,
correct is applied on the dilated discontinuous set 𝐷𝑖𝑟 , we first need
where 𝜇 1 is the Lebesgue measure on R for the 𝑥 axis. Define 𝜇𝑘 to
to define a remapping from dilated intervals (𝑥 ′, 𝜽® ) ∈ 𝑁𝑖𝑟 (𝑥, 𝜽® ) back
refer to the Lebesgue measure in R𝑘 . Given ∫ measure 𝜇𝑛+1 (𝑆), we
to discontinuous locations (𝑥, 𝜽® ) before defining D-simple in Defini- can replace it with the Lebesgue integral 1𝑆 𝑑𝜇 = 𝜇𝑛+1 (𝑆), apply
tion 12. Note the existence of the remapping is guaranteed because Fubini’s theorem considering 𝜇𝑛+1 as a product measure, and find
𝐷𝑖𝑟 is defined in Definition 5 as a union of 𝑁𝑖𝑟 (𝑥, 𝜽® ). 𝜇𝑛+1 (𝑆) = 0 after setting the innermost integral (over 𝑥 dimension)
as 0 due to 𝜇 1 (𝑃) = 0. ■
Definition 11. ∀(𝑥 ′, 𝜽® ) ∈ 𝐷𝑖𝑟 , we define a remapping 𝜏 (𝑥 ′, 𝜽® ) = We would next like to show that D-simple points are almost
(𝑥, 𝜽® ) such that 𝑥 ∈ 𝐷𝑖 and 𝑥 ′ ∈ (𝑥 − 𝛽𝜖 ′, 𝑥 + 𝛼𝜖 ′ ) for 𝜖 ′ = 𝑟𝜖 𝑓 (𝑥, 𝜽® ). everywhere in 𝐷𝑖𝑟 . Intuitively, we can do this because we have
defined multi-discontinuities to be the intersection of two manifolds
Definition 12. For a function 𝑓 ∈ 𝑒𝑐 , (𝑥 ′, 𝜽® ) ∈ 𝐷𝑖𝑟 is D-simple if that each are in general position so they each have dimension 𝑛,
their intersection is dimension 𝑛 − 1, and the dilation operation in
𝜏 (𝑥 ′, 𝜽® ) is not at a multi-discontinuity.
𝐷𝑖𝑟 increases dimension to 𝑛, which is still measure zero in R𝑛+1 . We
Lemma 2. For a function 𝑓 ∈ 𝑒¯𝑐 and a parameter 𝜃𝑖 , C-simple next formalize this intuition in a general Euclidean setting, which
points are almost everywhere in 𝐶𝑖 . we will then will apply to our D-simple points.
Proof sketch: We justify this using a equivalent statement: the Lemma 3. Assume 𝑈 is an open subset of R𝑚 , 𝐺 1, 𝐺 2 : 𝑈 → R
set of points 𝑆 that are discontinuous along the sampling axis or are continuously differentiable on 𝑈 . Define 𝑍 = {𝑥 ∈ 𝑈 : 𝐺 1 (𝑥) =
symmetric along the sampling axis is measure zero in 𝐶𝑖 . If 𝑓 is 𝐺 2 (𝑥) = 0}. Assume for each 𝑥 ∈ 𝑍 , ∇𝐺 1, ∇𝐺 2 are linearly inde-
discontinuous along the sampling axis at (𝑥, 𝜽® ), for some interme- pendent vectors at 𝑥, where we use ∇ = [𝜕/𝜕𝑥 1 . . . 𝜕/𝜕𝑥𝑚 ]. Define
Ð
𝑁 (𝑍 ) = 𝑥 ∈𝑍 𝜈 (𝑥). Let 𝜈 (𝑥) = ((𝑥 1 + 𝑎(𝑥), 𝑥 1 + 𝑏 (𝑥)) × {𝑥 2 } ×
diate value 𝐻 (𝑔) of 𝑓 with 𝑔 ∈ 𝑒𝑎 ∪ 𝑒𝑏 , 𝑔(𝑥, 𝜽® ) = 0. If 𝑔 ∈ 𝑒𝑏 , then
{𝑥 3 } . . . × {𝑥𝑚 }) ∩ 𝑈 , 𝑎 : 𝑈 → R, 𝑏 : 𝑈 → R. Then 𝜇𝑚 (𝑁 (𝑍 )) = 0
by recursing on the definition of 𝑒𝑏 (recursing into nodes in the
where 𝜇𝑚 is the Lebesgue measure in R𝑚 .
graph that are discontinuous wrt 𝑥), there is also some intermediate
value of the form 𝐻 (ℎ) such that ℎ ∈ 𝑒𝑎 , and ℎ(𝑥, 𝜽® ) = 0. If every Proof: We proceed similarly to Claim 1 of Mityagin [2015; 2020].
intermediate value of the form 𝐻 (ℎ) with ℎ ∈ 𝑒𝑎 had ℎ(𝑥, 𝜽® ) = 0 If 𝑥 ∈ 𝑍 , by linear independence, we have ||∇𝐺 1 (𝑥)|| ≠ 0 and
and ℎ locally zero wrt 𝑥 at (𝑥, 𝜽® ), then 𝑓 would be continuous wrt 𝑥 ||∇𝐺 2 (𝑥)|| ≠ 0. Thus, 𝑍 = ∪𝑘 𝑍𝑘 for:
at (𝑥, 𝜽® ), a contradiction. So there exists some intermediate value of 𝑍𝑘 = {𝑥 ∈ 𝑍 : ||𝑥 || ≤ 𝑘, ||∇𝐺𝑖 (𝑥)|| ≥ 1/𝑘 for i=1,2, 𝑑 (𝑥, 𝑈 𝐶 ) ≥ 1/𝑘 }
form 𝐻 (ℎ) with ℎ ∈ 𝑒𝑎 where ℎ(𝑥, 𝜽® ) = 0 and ℎ is not locally zero The same as Mityagin we omit the last requirement if 𝑈 = R𝑚 ,
wrt 𝑥 at (𝑥, 𝜽® ), and due to the construction of 𝑒𝑎 , we have ℎ is real i.e. 𝑈 𝐶 = ∅. Like Mityagin, we note 𝑍𝑘 is compact: clearly 𝑍𝑘
analytic as a single-variate function wrt 𝑥 within an open interval is bounded, and it can be shown to be closed by noting that for
containing 𝑥 intersected with dom(𝑓 ), per Def. 7 of locally zero. continuous 𝐹 , lim𝑥→𝑐 𝐹 (𝑥) = 𝐹 (𝑐), the nonstrict inequalities are
Similarly, if 𝑓 is symmetric along the sampling axis at (𝑥, 𝜽® ), there preserved at the limit point, and so every limit point of 𝑍𝑘 is in 𝑍𝑘 ,
exists some intermediate value 𝑔 ∈ 𝑒𝑐 such that 𝑔 is continuous, so by the Heine–Borel theorem 𝑍𝑘 is compact.
𝜕𝑔/𝜕𝑥 = 0 and 𝜕𝑔/𝜕𝑥 is not locally zero. Now since 𝑔 is continu- If 𝑥 ∈ 𝑍 , we can use the implicit function theorem to find locally
ous and by Lemma 6 discontinuities are isolated along 𝑥, we can an 𝑚 − 2 dimensional subspace where 𝐺 1 (𝑥) = 𝐺 2 (𝑥) = 0 if we have
full rank for the Jacobian. Specifically, consider coordinates 𝑥𝑖 and ∃𝜖 > 0 𝑠.𝑡 . within the set 𝑆 = [𝑥 − 𝛼𝜖, 𝑥 + 𝛽𝜖] × 𝜽® , there is at most one
𝑥 𝑗 . Then the Jacobian used in the implicit function theorem is: discontinuity of 𝑓1 , and the function value of 𝑓1, 𝑓2 are identical within
𝑆 except when at the discontinuity, and their pre-filtered gradients are
𝜕𝐺 1 /𝜕𝑥𝑖 𝜕𝐺 1 /𝜕𝑥 𝑗
𝐽 𝑖,𝑗 =
𝜕𝐺 2 /𝜕𝑥𝑖 𝜕𝐺 2 /𝜕𝑥 𝑗
(6) identical within 𝑆. If 𝑓1 is C-simple at (𝑥, 𝜽® ) then 𝑏 = 0.
This has full rank when the two rows are linearly independent. Proof sketch: the existence of 𝜖 can be justified because disconti-
If 𝑥 ∈ 𝑍 , then by the linear independence of ∇𝐺 1, ∇𝐺 2 , we can nuities are isolated on the 1D 𝑥 axis (Lemma 6). The local expansion
choose 𝑖 ≠ 𝑗 such that 𝐽 𝑖,𝑗 is full rank. This can be shown by can be obtained by recursively evaluating step functions that are not
forming by ∇𝐺 1, ∇𝐺 2 into a 2 × 𝑛 matrix of rank 2, observing that discontinuous at 𝑓 (𝑥, 𝜽® ) into 0 or 1 and merging step functions that
row and column rank are equal, so that matrix has two linearly are discontinuous into the same form. This is only possible without
independent columns, which can be taken as 𝐽 𝑖,𝑗 . Thus, by the multi-discontinuity, and is guaranteed by the point being C-simple
implicit function theorem, any 𝑥 ∈ 𝑍 has a neighborhood 𝑄 (𝑥) or D-simple. Additionally, computing the reference gradient in the
such that 𝑥 is described by coordinates in a 𝑚 − 2 dimensional local expansion form requires applying Equation 1. ■
manifold. If 𝑝 ∈ 𝜈 (𝑥) then there exists 𝑡 ∈ [0, 1] such that 𝑝 = To accurately characterize the set of programs our compilers can
𝑥 + 𝑒 1 (𝑎(𝑥) + 𝑡 (𝑏 (𝑥) − 𝑎(𝑥)), where 𝑒 1 = [1, 0, 0, . . . , 0], so we can differentiate with first-order correctness, we further define 𝑒¯𝑏 , 𝑒¯𝑐 , 𝑒¯𝑑
parameterize 𝑁 (𝑄 (𝑥)) by 𝑚 − 1 dimensions. So 𝜇𝑚 (𝑁 (𝑄 (𝑥))) = 0. as 𝑒𝑏 , 𝑒𝑐 , 𝑒𝑑 excluding three exceptions: discontinuity degeneracy,
For 𝑍𝑘 , consider the set of all open neighborhoods from the part dependency on the sampling axis, and discontinuities with
implicit function theorem, i.e. {𝑄 (𝑥) : 𝑥 ∈ 𝑍𝑘 } this is also an roots of order 3 or above in Definition 13 - 15. Except for patho-
open cover of 𝑍𝑘 , which is compact, so choose a finite subcover: logical programs, these occur rarely in practice. For clarity and to
choose the cover 𝑄 (𝑥 1 ), . . . , 𝑄 (𝑥 𝑁 ) associated with points 𝑥 1 ∈ emphasize their pathological nature, we presented these exceptional
𝑍𝑛 , . . . , 𝑥 𝑁 ∈ 𝑍𝑛 . So 𝜇𝑚 (𝑁 (𝑍𝑛 )) = 0. But 𝑍 = ∪𝑘 𝑍𝑘 , so by subaddi- programs as being removed from the grammar. However, since these
tivity of measures, 𝜇𝑚 (𝑁 (𝑍 )) = 0. ■ 3 properties can be analyzed in a pointwise manner, if desired, our
main theorem also holds on such programs if dom(𝑓 ) is chosen
Lemma 4. For a function 𝑓 ∈ 𝑒𝑐 and a parameter 𝜃𝑖 , D-simple such that it does not contain any points with these properties.
points are almost everywhere in 𝐷𝑖𝑟 .
Definition 13. A program 𝑓 ∈ 𝑒𝑐 has discontinuity degeneracy
Proof sketch: Using the 1D dilated intervals from Definition 4, at (𝑥, 𝜽® ) if 𝑓 is 𝐶 0 continuous, but by static analysis the compiler
define for any set 𝑆, 𝑁 𝑟 (𝑆) = ∪𝑠 ∈𝑆 𝑁 𝑟 (𝑠). We now justify a state- classifies 𝑓 as being not 𝐶 0 continuous. For instance, min() is statically
ment equivalent to what we want: given the set of points 𝑆𝑑 = classified correctly, but the composition of a 𝐶 0 continuous function
{(𝑥, 𝜽® ) ∈ dom(𝑓 ) :f is multi-discontinuous at (𝑥, 𝜽® )}, we wish to with a discontinuous function can give a discontinuity degeneracy:
show 𝜇𝑛+1 (𝑁 𝑟 (𝑆𝑑 )) = 0. By using (𝑥 ′, 𝜽® ) as the 𝑛 + 1 = 𝑚 dimen- e.g. 𝑓 (𝑥) = (2𝐻 (𝑥) + 𝑥 − 1) 2 : this can be simplified by a human to
sions of Lemma 3, we trivially get this result. Specifically, the end- 𝑓 (𝑥) = {(𝑥 − 1) 2, if 𝑥 < 0, otherwise (𝑥 + 1) 2 }, which is 𝐶 0 , not 𝐶 1 ,
points of the 1D intervals unioned by 𝑁 𝑟 become functions 𝑎, 𝑏 in but the compiler identifies it as being not 𝐶 0 due to the Heaviside step.
Lemma 3, and 𝜇𝑛+1 (𝑁 𝑟 (𝑆𝑑 )) is less than or equal to sum of measures
over a finite number of choices of intermediate values 𝐻 (𝑔𝑖 ), 𝐻 (𝑔 𝑗 ) Definition 14. A program 𝑓 ∈ 𝑒𝑐 has part dependency on the
with 𝑔𝑖 ∈ 𝑒𝑎 , 𝑔 𝑗 ∈ 𝑒𝑎 of the set 𝐴 of points where 𝑔𝑖 , 𝑔 𝑗 evaluate sampling axis 𝑥 if for some intermediate value ℎ(𝑔) where ℎ is a unary
to zero, and ∇𝑔𝑖 , ∇𝑔 𝑗 are linearly independent, so we use Lemma 3 atomic function, 𝑔 is not statically continuous and statically depends
with 𝐺 1 = 𝑔𝑖 , 𝐺 2 = 𝑔 𝑗 , and we choose 𝑈 as follows. We choose any on 𝑥, and g is continuous at (𝑥, 𝜽® ), and 𝜕𝑔/𝜕𝑥 exists, but 𝜕𝑔/𝜕𝑥 is
measurable open set 𝑈 ′ containing 𝑆𝑑 , and use 𝑈 = 𝑈 ′ \ 𝐷 0 , where locally zero. For instance, 𝑓 (𝑥, 𝜃 ) = sin(𝜃 + 𝜃 · 𝐻 (𝑥)). This can result
𝐷 0 is the set of points in dom(𝑓 ) s.t. 𝑔𝑖 = 𝑔 𝑗 = 0 and ∇𝑔𝑖 , ∇𝑔 𝑗 are in non-first order correct gradients for certain non-Dirac parameters,
linearly dependent: this leaves only the desired above points 𝐴 in which are not the main focus of this paper.
the zero set for Lemma 3. Here 𝑈 can be shown open by consid- Definition 15. A program 𝑓 ∈ 𝑒𝑐 has a discontinuity with a 𝑝th
ering that if 𝑝 ∈ 𝑈 then (a) 𝑔𝑖 ≠ 0 or 𝑔 𝑗 ≠ 0 or (b) 𝑔𝑖 = 𝑔 𝑗 = 0 order root along the sampling axis 𝑥 if for some intermediate value
and ∇𝑔𝑖 , ∇𝑔 𝑗 are linearly independent: in case (a) we can show the 𝐻 (𝑔) with 𝑔 ∈ 𝑒𝑎 , 𝑔 has a 𝑝th order zero along 𝑥, i.e. 𝜕 𝑗 𝑔/𝜕𝑥 𝑗 = 0
neighborhood around 𝑝 exists in 𝑈 directly, and in case (b) we can for 𝑗 = 0, . . . , 𝑝 − 1 and 𝜕𝑝+1𝑔/𝜕𝑥 𝑝+1 ≠ 0. For instance, 𝑓 (𝑥, 𝜃 ) =
choose a Jacobian in Lemma 3 with nonzero determinant, which is 𝐻 ((𝑥 +𝜃 ) 3 ) has a discontinuity with 3rd order root at 𝑥 +𝜃 = 0 because
continuous wrt (𝑥, 𝜽® ), so we can likewise show the neighborhood at those points 𝑔 = (𝑥 + 𝜃 ) 3 has 𝑔, 𝜕𝑔/𝜕𝑥, 𝜕 2𝑔/𝜕 2𝑥 = 0, 𝜕 3𝑔/𝜕𝑥 3 = 6.
around 𝑝 exists. Each of the measures in the finite sum over choices
of 𝑔𝑖 , 𝑔 𝑗 is zero, so 𝜇𝑛+1 (𝑁 𝑟 (𝑆𝑑 )) = 0. ■ The above three definitions are only applied for programs in
Because we show correctness of our approximation by compar- 𝑒¯𝑏 and 𝑒¯𝑐 (𝑒𝑏 ⊂ 𝑒𝑐 so the definitions also can be applied to 𝑒𝑏 ).
ing with a reference pre-filtered gradient, our proof also involves Because discontinuities in 𝑒𝑑 are more difficult to characterize and
our correctness claim in Theorem 1 does not guarantee anything
computing the reference. We show if a function evaluation 𝑓 (𝑥, 𝜽® )
about 𝑒¯𝑑 , we do not consider such programs.
is either C-simple or D-simple, it can be locally expanded into the
following form to allow easy computation of the reference gradient. A.2 Existence of 𝜖 𝑓 and 𝜖𝑖𝑟 in Theorem 1
Lemma 5. Local expansion: a function 𝑓1 ∈ 𝑒𝑐 is either C-simple We now characterize the 𝜖 𝑓 and 𝜖𝑖𝑟 used in Theorem 1 and briefly
or D-simple at (𝑥, 𝜽® ) ⇒ ∃𝑓2 = 𝑎 + 𝑏 · 𝐻 (𝑐) with 𝑎, 𝑏, 𝑐 ∈ 𝑒𝑎 and justify its existence. We show a certain 𝜖 𝑓 exists that has strong
properties that are needed next in Appendix A.3 for the proof by in- Lemma 7. ∀𝑓 ∈ 𝑒¯𝑐 that is either C-simple or D-simple at (𝑥, 𝜽® ) ∈
duction over subsets of our DSL. More specifically, in Appendix A.3 dom(𝑓 ), ∃𝜖 𝑓 (𝑥, 𝜽® ) > 0 such that ∀𝜖 ∈ (0, 𝜖 𝑓 ] all of the following are
we will next use the 𝜖 𝑓 and 𝜖𝑖𝑟 with the below properties in the proof satisfied:
by induction of the first-order correctness properties (including ker-
nel sizes that depend on 𝜖 𝑓 and 𝜖𝑖𝑟 ) for appropriate subsets of our • ∀𝑥 ′ ∈ [𝑥 − (𝛼 + 𝛽)𝜖, 𝑥) ∪ (𝑥, 𝑥 + (𝛼 + 𝛽)𝜖] such that (𝑥 ′, 𝜽® ) ∈
DSL on C-simple and D-simple sets. We first establish a lemma that dom(𝑓 ), we have 𝑓 is continuous at (𝑥 ′, 𝜽® ) along 𝑥. This is
we will use to obtain the first property. a stronger property than the RHS of Lemma 5 so it is com-
patible with local expansions, and also for C-simple locations
lets us when showing absolutely first-order correct to have no
Lemma 6. Isolated discontinuities: Given a function 𝑓 ∈ 𝑒¯𝑐 evalu- discontinuities in the neighborhood.
ated at (𝑥, 𝜽® ) ∈ dom(𝑓 ), ∃𝜖 > 0 𝑠.𝑡 . ∀𝑥 ′ ∈ [𝑥 − 𝛼𝜖, 𝑥) ∪ (𝑥, 𝑥 + 𝛽𝜖], • ∀𝑔 that is the intermediate value of 𝑓 and is not locally zero
𝑓 is continuous at (𝑥 ′, 𝜽® ) along the 𝑥 axis. wrt 𝑥, 𝑔(𝑥 + 𝛽𝜖) ≠ 𝑔(𝑥 − 𝛼𝜖). This lets our composition rule
exclude the zero denominator.
Proof sketch: because 𝑓 is constructed from our DSL, it only con- • There are certain expressions 𝑒 that are used in our lemmas
sists of a finite number (𝑁 ) of 𝐻 (𝑔𝑖 ) with the property that there that evaluate to a nonzero value at (𝑥, 𝜽® ) and do not depend
are no other Heaviside functions that depend on 𝐻 (𝑔𝑖 ). Without on 𝜖 which are derived from some intermediate values in the
loss of generality, we will only discuss the 𝑁 = 1 case: 𝐻 (𝑔) is the program 𝑓 . For these, 1/(𝑒 + 𝑂 (𝜖)) = 1/𝑒 + 𝑂 (𝜖). This lets us
only source of discontinuity. The 𝑁 > 1 case can be generalized rewrite Taylor expansions into the desired form for first-order
by following the 𝑁 = 1 proof, then take the intersection of the correctness.
continuous intervals associated with each 𝐻 (𝑔𝑖 ): this can allow us
to recursively cover all of the functions in 𝑒¯𝑐 . Proof sketch: The existence of 𝜖 𝑓 in the first requirement is a direct
We first discuss when 𝑓 is discontinuous wrt 𝑥 at (𝑥, 𝜽® ). Because application of the isolated discontinuities property of Lemma 6. For
𝑓 ∈ 𝑒¯𝑐 , the discontinuity 𝐻 (𝑔) satisfies either 𝑔 ∈ 𝑒𝑎 being real the second requirement, if 𝑔 is discontinuous at (𝑥, 𝜽® ), according
analytic or 𝑔 ∈ 𝑒𝑏 being piece-wise constant. We prove by induction to Lemma 5 we have 𝑔 = 𝑎 + 𝑏 · 𝐻 (𝑐). Because both 𝑎, 𝑏 are real
and start with the base case 𝑔 ∈ 𝑒𝑎 . Because 𝑓 is discontinuous analytic, 𝜖 𝑓 doesn’t exist implies 𝑏 (𝑥, 𝜽® ) = 0, which is discontinuity
at (𝑥, 𝜽® ) wrt 𝑥, ∃𝜖 ′ > 0 such that discontinuity can be sampled degeneracy and is excluded from 𝑒¯𝑐 , therefore raising a contradiction.
∀𝜖 ∈ (0, 𝜖 ′ ]: 𝐻 (𝑔(𝑥 + 𝛼𝜖, 𝜽® )) ≠ 𝐻 (𝑔(𝑥, 𝜽® )) or 𝐻 (𝑔(𝑥 − 𝛼𝜖, 𝜽® )) ≠ If 𝑔(𝑥, 𝜽® ) is continuous, then local expansion reduces to 𝑔 = 𝑎 where
𝐻 (𝑔(𝑥, 𝜽® )). Therefore 𝑔 is not locally zero (wrt 𝑥). Because real 𝑎 ∈ 𝑒𝑎 . If 𝜕𝑥𝜕𝑎 ≠ 0, because 𝜕𝑎 is real analytic and locally Lipschitz
𝜕𝑥
analytic functions that are not locally zero have isolated zeros in continuous, ∃𝜖 > 0 such that 𝑎 is monotonic in the local region,
1D (sampling axis 𝑥), ∃𝜖 > 0 𝑠.𝑡 . (𝑥, 𝜽® ) is the only zero for 𝑔(𝑥 ′, 𝜽® ) therefore the requirement is satisfied. If 𝜕𝑥 𝜕𝑎 = 0, the requirement is
within the interval 𝑥 ′ ∈ [𝑥 −𝛼𝜖, 𝑥 +𝛽𝜖], so we have the desired result violated only when 𝑓 is symmetric along the sampling axis 𝑥, which
for the case 𝑔 ∈ 𝑒𝑎 . In subsequent discussion, we refer to “within the is excluded from C-simple points, therefore raising a contradiction.
interval" as meaning fixing 𝜽® and consider the 1D restriction of g The third requirement is valid because in our proof, 𝑂 (𝜖) is always
to the x axis. Now we have this is equivalent to 𝐻 (𝑔) is continuous used to express polynomials of 𝜖 with all other terms being locally
wrt x within the interval [𝑥 − 𝛼𝜖, 𝑥) ∪ (𝑥, 𝑥 + 𝛽𝜖]. We now prove bounded. Therefore if 𝜖 𝑓 does not exist, it the means 𝑂 (𝜖) involves
the induction step of 𝑔 ∈ 𝑒𝑏 and assume if 𝑔 is discontinuous wrt a multiplication of 𝜖 with an unbounded term and contradicts how
𝑥, ∃𝜖 > 0 such that 𝑔 is continuous wrt 𝑥 in the interval [𝑥 − 𝑂 (𝜖) is constructed in the proof.
𝛼𝜖, 𝑥) ∪ (𝑥, 𝑥 + 𝛽𝜖]. Because 𝑔 is piece-wise constant, discontinuities Note in the first requirement, the local neighborhood is larger
wrt 𝑥 for 𝐻 (𝑔+ ) ≠ 𝐻 (𝑔− ) can only be sampled when 𝑔 itself is than the kernel support when pre-filtering at (𝑥, 𝜽® ). This is because
discontinuous wrt 𝑥. Therefore 𝑔 is continuous wrt 𝑥 in the interval we state relatively first-order correctness in a dilated set 𝐷𝑖𝑟 , and
[𝑥 − 𝛼𝜖, 𝑥) ∪ (𝑥, 𝑥 + 𝛽𝜖] implies 𝐻 (𝑔) is continuous wrt 𝑥 in the when we pre-filter near the boundary of 𝐷𝑖𝑟 , we still need the kernel
interval [𝑥 − 𝛼𝜖, 𝑥) ∪ (𝑥, 𝑥 + 𝛽𝜖]. support to be within dom(𝑓 ) and include only one discontinuity ■.
Now we discuss the case 𝑓 is continuous wrt 𝑥 at (𝑥, 𝜽® ). If 𝑓 ∈ 𝑒𝑎 , Once 𝜖 𝑓 is defined, the existence of 𝜖𝑖𝑟 (𝑥 ′, 𝜽® ) can be easily justified
the conclusion is trivially true. We therefore still assume 𝑓 has using the remapping in Definition 11: 𝜖 𝑟 (𝑥 ′, 𝜽® ) = 𝑟𝜖 𝑓 (𝜏 (𝑥 ′, 𝜽® ), 𝜽® ).
𝑖
intermediate values of the form 𝐻 (𝑔). And we prove by induction
similarly. For the base case 𝑔 ∈ 𝑒𝑎 , 𝐻 (𝑔) is continuous along 𝑥 either A.3 First-order Correctness Proof by Induction
indicates 𝑔 is locally zero wrt 𝑥 or 𝑔(𝑥, 𝜽® ) ≠ 0. If 𝑔 is locally zero
In this subsection, we show that Theorem 1 can be proved by induc-
wrt 𝑥, 𝑥, 𝜖 can be chosen appropriately based on the size of the
tion on different operators. We first report our conclusions for the
locally zero interval such that 𝑔 stays zero. If 𝑔(𝑥, 𝜽® ) ≠ 0, because 𝑔 base and each induction step. We then give an example of proving
is real analytic wrt 𝑥, it is locally Lipschitz continuous wrt 𝑥, which one of the induction steps. The other induction proof steps can be
justifies a local interval 𝑥 ′ ∈ [𝑥 − 𝛼𝜖, 𝑥 + 𝛽𝜖] such that 𝑔(𝑥 ′, 𝜽® ) ≠ 0. carried out similarly: we have carried them out but in the interests
For the induction step where 𝑔 ∈ 𝑒𝑏 , because discontinuities wrt 𝑥 of space do not report all of the steps of them.
are isolated and 𝑔 is piece-wise constant, ∃𝜖 > 0 𝑠.𝑡 . both [𝑥 − 𝛼𝜖, 𝑥)
and (𝑥, 𝑥 + 𝛽𝜖] are piece-wise constant, therefore 𝑓 is continuous A.3.1 Conclusions from Proofs by Induction. We first report that
wrt 𝑥 in the interval ■. our approximation is absolutely first-order correct for three base
cases in Lemma 8. Because the base cases are continuous, 𝐷𝑖𝑟 is a or continuous (𝑒𝑎 ). Therefore because 𝑔 is also discontinuous wrt
null set. We therefore do not discuss relatively first-order correct. 𝑥 at 𝜏 (𝑥, 𝜽® ), we know 𝑔 ∈ 𝑒𝑏 , and can be locally expanded as
𝑔 = 𝑎𝑔 + 𝑏𝑔 · 𝐻 (𝑐) with 𝑎𝑔 , 𝑏𝑔 being constant, 𝑐 ∈ 𝑒𝑎 . Accordingly,
Lemma 8. Given 𝑓 in the following form, ∀(𝑥, 𝜽® ) ∈ dom(𝑓 ) and
𝑓 can be expanded as 𝑓 = 𝐻 (𝑎𝑔 + 𝑏𝑔 · 𝐻 (𝑐)) = sign(𝑏𝑔 ) · 𝐻 (𝑐). The
∀𝜖 > 0, 𝑓 is absolutely first-order correct.
reference gradient is computed as follows. For simplicity, we denote
𝑓 (𝑥, 𝜽® ) = 𝐶, constant C ∈ R 𝑥𝑑 , the first component of 𝜏 (𝑥, 𝜽® ) as the discontinuity location.
𝑓 (𝑥, 𝜽® ) = 𝑥
𝜕 𝑓ˆ
∫ 𝑥+𝛽𝜖
𝜕 1
𝑓 (𝑥, 𝜽® ) = 𝜃𝑖 = sign(𝑏𝑔 ) · 𝐻 (𝑐)𝑑𝑥 ′
𝜕𝜃𝑖 𝜕𝜃𝑖 (𝛼 + 𝛽)𝜖 𝑥−𝛼𝜖
𝜕𝑐
𝜕𝑐 ′ sign(𝑏𝑔 ) 𝜕𝜃𝑖
∫ 𝑥+𝛽𝜖
We now report the induction results for unary operators 𝑈 ∈ 1
= sign(𝑏𝑔 )𝛿 (𝑐) 𝑑𝑥 = |
𝜕𝑐 | 𝑥𝑑
{𝐻, ℎ} where 𝐻 denotes the Heaviside step function, and ℎ denotes (𝛼 + 𝛽)𝜖 𝑥−𝛼𝜖 𝜕𝜃𝑖 (𝛼 + 𝛽)𝜖 | 𝜕𝑥
continuous atomic functions. Because in Theorem 1, absolutely and Denominator nonzero: discontinuities with roots of order 𝑛 ≥ 2
relatively first-order correct are claimed for different set of points, are excluded from 𝐷𝑖 (𝑛 = 2) or 𝑒¯𝑐 (𝑛 ≥ 3) respectively.
their induction steps are different. As shorthand, for a function 𝑓 and (7)
𝑟 ∈ (0, 1], we use first-order correct at (𝑥, 𝜃 ) ∈ 𝐷𝑖𝑟 for intermediate
Similarly, reference gradient of 𝑔 can be computed.
value 𝑔 of 𝑓 to mean if 𝑔 is discontinuous wrt 𝜃𝑖 at 𝜏 (𝑥, 𝜽® ), then it is
relatively first order correct, and if 𝑔 is continuous wrt 𝜃𝑖 at 𝜏 (𝑥, 𝜽® ), ∫ 𝑥+𝛽𝜖
then it is absolutely first-order correct. 𝜕𝑔ˆ 𝜕 1
= (𝑎𝑔 + 𝑏𝑔 · 𝐻 (𝑐))𝑑𝑥 ′
𝜕𝜃𝑖 𝜕𝜃𝑖 (𝛼 + 𝛽)𝜖 𝑥−𝛼𝜖
Lemma 9. Given a function 𝑓 = 𝑈 (𝑔) with 𝑓 , 𝑔 ∈ 𝑒¯𝑐 , ∀(𝑥, 𝜽® ) ∈ 𝐶𝑖 𝜕𝑐
1 𝑏𝑔 𝜕𝜃 (8)
that are C-simple and ∀𝜖 ∈ (0, 𝜖 𝑓 (𝑥, 𝜽® )], if 𝑔 is absolutely first-order = 𝑖
|
𝜕𝑐 | 𝑥𝑑
(𝛼 + 𝛽)𝜖 | 𝜕𝑥
correct, then 𝑓 is absolutely first-order correct.
𝑎𝑔 , 𝑏𝑔 are constants
Lemma 10. Given a function 𝑓 = 𝑈 (𝑔) with 𝑓 , 𝑔 ∈ 𝑒¯𝑐 , ∀𝑟 ∈ (0, 1],
∀(𝑥, 𝜽® ) ∈ 𝐷𝑖𝑟 that are D-simple, if 𝑔 is first-order correct for 𝜖 = We now apply our gradient approximation and start from the LHS
𝑟𝜖 𝑓 (𝜏 (𝑥, 𝜽® ), 𝜽® ), then 𝑓 is relatively first-order correct for the same 𝜖. of Equation 3 to show its RHS. For simplicity, we assume 𝐻 (𝑐 + ) = 1
and 𝐻 (𝑐 − ) = 0. The opposite case can be easily proved similarly.
We now report the induction results for two binary operators
{+, ·}: we use the symbol ⊕ in the next two lemmas to represent 𝜕𝑂 𝑓 𝜕𝑂 𝑔 + −
either addition or multiplication. 𝜕𝜃𝑖 𝜕𝜃𝑖 /|𝑔 − 𝑔 |
=
𝜕 𝑓ˆ 𝜕𝑐
sign(𝑏𝑔 ) 𝜕𝜃
Lemma 11. Given a function 𝑓 = 𝑔 ⊕ ℎ with 𝑓 , 𝑔, ℎ ∈ 𝑒¯𝑐 , ∀(𝑥, 𝜽® ) ∈ 𝜕𝜃𝑖 (𝛼+𝛽)𝜖 | | 𝑥𝑑
𝜕𝑐
𝑖
|
𝐶𝑖 that are C-simple and ∀𝜖 ∈ (0, 𝜖 𝑓 (𝑥, 𝜽® )], if 𝑔, ℎ are absolutely
𝜕𝑥
(Applying our rule to numerator and using Eq 7 for denominator.)
first-order correct, then 𝑓 is absolutely first-order correct.
𝜕𝑂 𝑔 𝜕𝑔ˆ + − 𝜕𝑔ˆ 𝜕𝑔ˆ
𝜕𝜃𝑖 𝜕𝜃𝑖 /(|𝑔 − 𝑔 | 𝜕𝜃𝑖 ) 𝜕𝜃𝑖 (1 + 𝑂 (𝜖)/|𝑔+ − 𝑔− |
Lemma 12. Given a function 𝑓 = 𝑔 ⊕ ℎ with 𝑓 , 𝑔, ℎ ∈ 𝑒¯𝑐 , ∀𝑟 ∈ = =
𝜕𝑐 𝜕𝑐
sign(𝑏𝑔 ) 𝜕𝜃 sign(𝑏𝑔 ) 𝜕𝜃
(0, 1], ∀(𝑥, 𝜽® ) ∈ 𝐷𝑖𝑟 that are D-simple, if 𝑔, ℎ are first-order correct for 𝑖
| 𝑖
|
(𝛼+𝛽)𝜖 | 𝜕𝑐 | 𝑥𝑑 𝜕𝑐
(𝛼+𝛽)𝜖 | 𝜕𝑥 | 𝑥𝑑
𝜖 = 𝑟𝜖 𝑓 (𝜏 (𝑥, 𝜽® ), 𝜽® ), then 𝑓 is relatively first-order correct for same 𝜖. 𝜕𝑥
Apply Equation 3 to 𝑔.
As discussed in Section 4.1, the program set 𝑒𝑎 is statically con- 𝜕𝑐
𝑏𝑔 𝜕𝜃
1 𝑖
|𝑥𝑑 (1 + 𝑂 (𝜖))/|𝑔+ − 𝑔− |
tinuous, therefore for any 𝑓 ∈ 𝑒𝑎 , 𝐶𝑖 = dom(𝑓 ) is always C-simple, (𝛼+𝛽)𝜖 | 𝜕𝑥
𝜕𝑐
|
and is therefore absolutely first-order correct. Because 𝑓 ∈ 𝑒¯𝑏 is = 𝜕𝑐
sign(𝑏𝑔 ) 𝜕𝜃
piece-wise constant, both our approximation and reference gradi- 𝑖
|
𝜕𝑐
(𝛼+𝛽)𝜖 | 𝜕𝑥 | 𝑥𝑑
ents are 0 for (𝑥, 𝜽® ) ∈ 𝐶𝑖 with 𝜖 ∈ (0, 𝜖 𝑓 ], and is therefore absolutely
Using Equation 8
first-order correct. The almost everywhere results of Lemma 2 and
Lemma 4 for the C-simple and D-simple properties in 𝐶𝑖 and 𝐷𝑖𝑟 , sign(𝑏𝑔 ) · 𝑏𝑔 sign(𝑏𝑔 ) · 𝑏𝑔
= + 𝑂 (𝜖) = + 𝑂 (𝜖)
respectively, combined with the induction proof on C-simple and D- |𝑔+ − 𝑔− | |𝑎𝑔 + 𝑏𝑔 − 𝑎𝑔 |
simple points likewise immediately lead to the almost everywhere Assuming 𝐻 (𝑐 + ) = 1. Denominator nonzero because
results in Theorem 1 for 𝑒¯𝑏 and 𝑒¯𝑐 . discontinuity degeneracy is excluded from 𝑒¯𝑐 .
A.3.2 Induction Proof Example. In this section, we give an example =1 + 𝑂 (𝜖)
proof for Lemma 10 with the unary operator 𝑈 = 𝐻 and assuming 𝑔 (9)
is also discontinuous wrt 𝑥 at 𝜏 (𝑥, 𝜽® ). Other cases or other lemmas
in Section A.3.1 can all be proved similarly. B MOTIVATION FOR FUNCTION COMPOSITION RULE
Because 𝑓 ∈ 𝑒¯𝑐 , based on the DSL construction, the input ar- In this section, we further motivate our choice of the function com-
guments to step functions can either be piece-wise constant (𝑒𝑏 ) position rule in Table 2. For function compositions ℎ ◦ 𝑔, this rule
applies to atomic continuous functions ℎ. The compiler chooses statically identify they are 𝐶 0 continuous, their gradients through
between two equations to avoid numerical instability while differ- the branching always use the AD gradient rule.
entiating discontinuous functions. Similar to multiplication, directly
applying the AD rule to discontinuities leads to incorrect results. C.2 Boolean Conditions for Ternary Select
For example, the same function 𝑓 = 𝐻 (𝑥 +𝜃 ) · 𝐻 (𝑥 +𝜃 ) we discussed This section discusses specialized rules for branching conditions in
for multiplication can also be expressed as 𝑓 = 𝐻 (𝑥 +𝜃 ) 2 , which can 𝐹 = select(𝐵, 𝑙, 𝑟 ) such that 𝐵 can be decomposed into the Boolean
be viewed as applying a square function (·) 2 to 𝐻 (𝑥 +𝜃 ). Naively ap- expressions of 𝑛 inequality clauses: 𝐶𝑖 = 𝑐𝑖 > 0, 𝑖 = 1, ..., 𝑛.
plying the AD function composition rule to this function combined Similar to sampling floating point values at two ends of the fil-
with our Heaviside step gradient rule results in the following. tering kernel, we can also sample Boolean values such as 𝐵 +, 𝐵 − .
𝜕𝐴𝐷 𝑓 1 𝐻 (𝑥 + 𝜃 ) 𝜕 𝑓ˆ Disagreeing Boolean samples indicates the presence of disconti-
= 2𝐻 (𝑥 + 𝜃 ) = ≠ nuity, such as when 𝐵 + ⊕ 𝐵 − evaluates to true, where ⊕ indicates
𝜕𝜃 |2𝜖 | 𝜖 𝜕𝜃
XOR. When a discontinuity is sampled, our single discontinuity
The result from using the AD function composition is again incorrect
assumption allows us to infer that every discontinuous Boolean
and similarly to multiplication: this is due to AD always being biased
clause depends on the same underlying floating point valued func-
to one branch or the other. On the contrary, our composition rule
tion. Therefore, if we further sample every Boolean clause used in
samples on both branches and is robust at discontinuities.
𝐹 and identify two different clauses 𝐶𝑖 , 𝐶 𝑗 disagree simultaneously:
𝜕𝑂 𝑓 𝐻+ − 𝐻− 1 1 𝜕 𝑓ˆ (𝐶𝑖+ ⊕ 𝐶𝑖− ) ∧ (𝐶 +𝑗 ⊕ 𝐶 −𝑗 ) evaluates to true, then we assume 𝐶𝑖 and
= + = =
𝜕𝜃 𝐻 − 𝐻 − |2𝜖 | 2𝜖 𝜕𝜃 𝐶 𝑗 are equivalent at the current location. Our rule will traverse and
sample each Boolean clause 𝐶𝑖 in an arbitrary order, and replace
C DETAILS FOR TERNARY SELECT OPERATOR 𝜕𝐻 (𝑝) 𝜕𝐻 (𝑐𝑖 )
𝜕𝜃𝑖 in Equation 11 with 𝜕𝜃𝑖 for the first clause 𝐶𝑖 where a
C.1 General Ternary Select Operator discontinuity is sampled.
In this section, we discuss branching with inequality conditions that
can be written in the form D DETAILS FOR IMPLICITLY DEFINED GEOMETRY
𝐹 (𝑝, 𝑙, 𝑟 ) = select(𝑝 > 0, 𝑙, 𝑟 ) = 𝑟 + (𝑙 − 𝑟 ) · 𝐻 (𝑝) (10) In shader programs, a common way to define geometry is to encode
it as an implicit surface, or the zero set of some mathematical func-
Branching with complex Boolean expressions will be discussed next
tion, and iteratively estimate ray-geometry intersections through
in Section C.2. The rule developed in this section can also be used
methods like ray marching [Perlin and Hoffert 1989] or sphere
in all inequality and equality comparisons: 𝑝 > 𝑞 is equivalent to
tracing [Hart 1996]. While ray marching and sphere tracing loops
𝑝 − 𝑞 > 0, 𝑝 < 0 can be rewritten as −𝑝 > 0, and select(𝑝 ≥ 0, 𝑙, 𝑟 ) is
themselves are programs, and can be differentiated using rules in-
equivalent to select(−𝑝 > 0, 𝑟, 𝑙). The equality comparison 𝑝 == 𝑞
troduced in Section 2, this usually results in a long gradient tape
can be written as the Boolean expression 𝑝 ≥ 𝑞 ∧ 𝑝 ≤ 𝑞, which can
because the number of loop iterations can be arbitrarily large. As
be differentiated by the rules discussed next in Section C.2.
an alternative, we can bypass the root finding process and directly
Instead of directly applying the multiplication rule in Table 2, we
approximate the gradient using the implicit function theorem. Our
design a specialized rule for branching that uses a similar register
rule is motivated by [Yariv et al. 2020]: we extend their result for
space as if the gradient is approximated by AD. The specialized
differentiating points that lie on the zero set of the implicit geome-
rule utilizes the fact that our pre-filtering kernels 𝑈 [−Δ𝑥, 0] and
try to differentiating discontinuities caused by object silhouettes or
𝑈 [0, Δ𝑥] are always between the evaluation location and a neigh-
interior edges. Similarly, [Li et al. 2020] apply the implicit function
boring location on the sampling grid, therefore for an intermediate
theorem to differentiate discontinuities caused by 2D line strokes.
value 𝑔, 𝑔(𝑥) = 𝑔+ or 𝑔(𝑥) = 𝑔− always holds. For simplicity, we
Their result, however, is limited to a specific type of function in 2D
denote the samples at two ends of the prefiltering kernel as 𝑔, 𝑔𝑛 ,
(𝑛-th order polynomials). [Gargallo et al. 2007] also develop a gradi-
where 𝑔 = 𝑔(𝑥) and 𝑔𝑛 is evaluated at the neighoring location at the
ent rule for visibility change to the implicitly defined surface. Their
other end of the kernel support. The gradient rule for the efficient
derivation only applies to C1 continuous geometry (Appendix D.1)
ternary select operator is shown in Equation 11.
and does not handle discontinuities caused by the intersection of
𝜕𝑂 𝐹 𝜕 𝐻 (𝑝) 𝜕 𝑙 𝜕 𝑟 surfaces (Appendix D.2). Unlike previous methods, our rule can
= (𝑙𝑛 − 𝑟𝑛 ) 𝑘 + select(𝑝 > 0, 𝑘 , 𝑘 ) (11) differentiate the discontinuities generated by implicit geometries
𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝜃𝑖
Note Equation 11 is equivalent to differentiating 𝐹 in its right hand represented as arbitrary signed distance functions.
side form in Equation 10, but applying a simplified multiplication We assume the geometry is implicitly defined by the signed dis-
rule as in Equation 12. This results in an identical first-order correct- tance function that depends on 3D locations 𝒑® and scene parameters
ness property as differentiating 𝐹 using the original multiplication 𝜽® : 𝑓 ( 𝒑® (𝑥, 𝑦), 𝜽® ) = 0. The 3D locations further depends on image
rule, which can be proved using a similar induction step. coordinates (𝑥, 𝑦), as they are on the rays casting from the camera to
the geometry: 𝒑® = 𝒐® (𝑥, 𝑦) +𝑡 · 𝒅® (𝑥, 𝑦) where 𝑜, 𝑑, 𝑡 are camera origin,
𝜕(𝑔 · ℎ) 𝜕𝑔 𝜕ℎ ray direction, and distance from camera to geometry respectively.
=ℎ + 𝑔𝑛 (12)
𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝜃𝑖 Because our implementation uses 𝑥, 𝑦 as sampling axes, we will
For special operators such as min() and max(), although they discuss differentiating the geometry discontinuities by pre-filtering
are also expanded using a ternary select, because the compiler can along the 𝑥 axis and asuming 𝑦 is a fixed constant. Given arbitrary
𝜽® , the discontinuity location 𝑥𝑑 (𝜽® ) along 𝑥 axis can be defined as Our derivation starts with assuming the ray is at the zero set for
for a local neighborhood around 𝑥𝑑 , 𝑥 < 𝑥𝑑 and 𝑥 > 𝑥𝑑 evaluates both 𝑓0 and 𝑓1 .
to different branches of the geometry. For silhouettes, this indicates
the Boolean on whether the ray has hit the geometry is evaluated 𝑓0 ( 𝒑® , 𝜽® ) = 0
differently; and for interior edges, this corresponds to 𝑓 evaluates (17)
to different branches at different sides of 𝑥𝑑 . Therefore, the discon- 𝑓1 ( 𝒑® , 𝜽® ) = 0
tinuities can be represented as 𝐻 (𝑥 − 𝑥𝑑 ). In the forward pass, the We now differentiate wrt 𝜃𝑖 to both equations in Equation 17 .
user specifies the SDF using a RaymarchingLoop primitive and the
compiler automatically expands a ray marching loop to approach
𝜕𝑓0 𝜕𝑓0 𝜕®𝒐 𝜕®𝒐 𝜕𝑥𝑑 𝜕 𝒅® 𝜕 𝒅® 𝜕𝑥𝑑 ® 𝜕𝑡
the zero set of the SDF. The value of 𝑥𝑑 is never explicitly com- +< , + +𝑡 +𝑡 +𝒅 >=0
puted in the program. In the backward pass, the discontinuity is
𝜕𝜃𝑖 𝜕𝒑® 𝜕𝜃𝑖 𝜕𝑥𝑑 𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝑥𝑑 𝜕𝜃𝑖 𝜕𝜃𝑖
(18)
sampled by evaluating Boolean conditions described above, and 𝜕𝑓1 𝜕𝑓1 𝜕®𝒐 𝜕®𝒐 𝜕𝑥𝑑 𝜕 𝒅® 𝜕 𝒅® 𝜕𝑥𝑑 ® 𝜕𝑡
+< , + +𝑡 +𝑡 +𝒅 >=0
back-propagation is carried out by computing 𝜕𝑥 𝜕𝜃𝑖 . The compiler
𝑑
𝜕𝜃𝑖 𝜕𝒑® 𝜕𝜃𝑖 𝜕𝑥𝑑 𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝑥𝑑 𝜕𝜃𝑖 𝜕𝜃𝑖
will classify the cause of the discontinuity based on whether the 𝜕𝑡 .
Next we can rearrange Equation 18 to separate out 𝜕𝜃
camera ray intersects a locally C1 smooth geometry or not, and 𝑖
applies the implicit function theorem to each case described in Ap- 𝜕𝑓0 𝜕𝑓0 𝜕 𝒐® 𝜕 𝒐® 𝜕𝑥𝑑 𝜕 𝒅® 𝜕 𝒅® 𝜕𝑥𝑑
pendix D.1 and D.2. 𝜕𝑡 𝜕𝜃𝑖 + < 𝜕𝒑® , 𝜕𝜃𝑖 + 𝜕𝑥𝑑 𝜕𝜃𝑖 + 𝑡 𝜕𝜃𝑖 + 𝑡 𝜕𝑥𝑑 𝜕𝜃𝑖 >
=−
< 𝜕𝒑®0 , 𝒅® >
𝜕𝜃𝑖 𝜕𝑓
D.1 Camera ray intersecting locally C1 smooth geometry.
(19)
In this case, the discontinuity is implicitly defined by the ray hitting 𝜕𝑓1 𝜕𝑓1 𝜕 𝒐® 𝜕 𝒐® 𝜕𝑥𝑑 𝜕 𝒅® 𝜕 𝒅® 𝜕𝑥𝑑
𝜕𝑡 𝜕𝜃𝑖 + < 𝜕𝒑® , 𝜕𝜃𝑖 + 𝜕𝑥𝑑 𝜕𝜃𝑖 + 𝑡 𝜕𝜃𝑖 + 𝑡 𝜕𝑥𝑑 𝜕𝜃𝑖 >
the zero set of the function 𝑓 while perpendicular to the normal =−
< 𝜕𝒑®1 , 𝒅® >
𝜕𝜃𝑖 𝜕𝑓
direction of 𝑓 at the intersection.
Because the left hand sides of the two lines in Equation 19 are
𝑓 ( 𝒑® , 𝜽® ) =0 (13a) identical, their right hand side terms should also be identical. We
𝜕𝑓 ® can therefore derive 𝜕𝑥𝜕𝜃 as in Equation 20.
𝑑
𝑖
< , 𝒅 >=0 (13b)
𝜕𝒑®
𝜕𝑥𝑑 𝑎 −𝑏
=−
We can now differentiate wrt an arbitrary parameter 𝜃𝑖 on both 𝜕𝜃𝑖 𝑐 −𝑑
sides of Equation 13a. 𝜕𝑓0 𝜕𝑓0 𝜕®𝒐 𝜕 𝒅® 𝜕𝑓1 ®
𝑎 =( +< , +𝑡 >) < ,𝒅 >
𝜕𝜃𝑖 𝜕𝒑® 𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝒑®
𝜕𝑓 𝜕𝑓 𝜕®𝒐 𝜕®𝒐 𝜕𝑥𝑑 𝜕 𝒅® 𝜕 𝒅® 𝜕𝑥𝑑 ® 𝜕𝑡 𝜕𝑓1 𝜕𝑓1 𝜕®𝒐 𝜕 𝒅® 𝜕𝑓0 ®
+< , + +𝑡 +𝑡 +𝒅 >= 0 (14) 𝑏 =( +< , +𝑡 >) < ,𝒅 > (20)
𝜕𝜃𝑖 𝜕𝒑® 𝜕𝜃𝑖 𝜕𝑥𝑑 𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝑥𝑑 𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝒑® 𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝒑®
Equation 14 can be simplified by inserting Equation 13b. 𝜕𝑓0 𝜕®𝒐 𝜕 𝒅® 𝜕𝑓1 ®
𝑐=< , +𝑡 >< ,𝒅 >
𝜕𝒑® 𝜕𝑥𝑑 𝜕𝑥𝑑 𝜕𝒑®
𝜕𝑓 𝜕𝑓 𝜕®𝒐 𝜕 𝒅® 𝜕𝑓 𝜕®𝒐 𝜕 𝒅® 𝜕𝑥 𝜕𝑓1 𝜕®𝒐 𝜕 𝒅® 𝜕𝑓0 ®
+< , +𝑡 >+< , +𝑡 > 𝑑 = 0 (15) 𝑑=< , +𝑡 >< ,𝒅 >
𝜕𝜃𝑖 𝜕𝒑® 𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝒑® 𝜕𝑥𝑑 𝜕𝑥𝑑 𝜕𝜃𝑖 𝜕𝒑® 𝜕𝑥𝑑 𝜕𝑥𝑑 𝜕𝒑®
Rearranging Equation 15 results in Equation 16. D.3 Efficient back-propagation algorithm
® One major challenge for efficiently implementing gradients dis-
𝜕𝑓 𝜕 𝒐®
𝜕𝑓 𝜕𝒅
𝜕𝑥𝑑 𝜕𝜃𝑖 + < 𝜕𝒑® , 𝜕𝜃𝑖 + 𝑡 𝜕𝜃𝑖 > cussed in Section D.1 and D.2 is that at compile time, it is unknown
=− (16) whether the silhouette is caused by a smooth surface, or the inter-
®
𝜕𝜃𝑖 < , 𝜕 𝒐® + 𝑡 𝜕𝒅 >
𝜕𝑓
𝜕𝒑® 𝜕𝑥𝑑 𝜕𝑥𝑑 section of two arbitrary sub-surfaces. For example, if an implicit
function 𝑓 is defined by applying CSG operations to 𝑛 C1 smooth
D.2 Camera ray intersecting C1 discontinuous geometry. sub-functions and we assume the discontinuity is caused by an in-

Many implicit functions are only C0 continuous. For example, con- tersection, at compile time we have 𝑛2 possible combinations 𝑓0
structive solid geometry (CSG) operators such as union or intersec- and 𝑓1 in Equation 20 as the decision of which two surfaces are
tion use max or min to combine different implicit functions, causing intersecting is not determined until run time. Because our backends
the resulting function to be C0 continuous. These operations can always take all branches, the naive implementation would have
generate silhouette or interior edges whenever two smooth surfaces 𝑂 (𝑛 2 ) complexity.
intersect. For example, a box can be viewed as the intersection of In this section, we provide a specialized algorithm based on our
multiple implicitly defined half-spaces. In this section, we assume observations of implicit scene representations. Note that we make
the intersection is generated by two C1 continuous surfaces 𝑓0 and heuristic assumptions to implicit functions that may be violated by
𝑓1 . Efficiently determining 𝑓0, 𝑓1 is discussed in Appendix D.3. a carefully designed counter-example. The algorithm therefore only
serves as a heuristic optimization that improves the efficiency in dif- such that the new conditions evaluates to a different branch that
ferentiating most implicit scenes. Any implicit function that violates represents 𝑓1 . This is achieved based on another observation: a
our heuristic assumption can still be back-propagated through their ray hitting the intersection of 𝑓0 and 𝑓1 indicates 𝑓 makes a “close
root finding process using the gradient rules described in Section 4.2. decision". This means for some branch min(𝑎, 𝑏) (or max), the values
Our compiler makes the assumption that the branching decision of 𝑎 and 𝑏 must be almost identical: 𝑓 should still evaluate to 0 if we
in the implicit functions are represented as min or max, and all values flip the condition and chooses the other branch and hit 𝑓1 instead.
being compared with are scaled similarly (e.g. they all represent the To correctly modify the branching, our compiler iterates through
Euclidean distance to some primitives). This assumption holds for every condition and finds the one c∗ with minimum absolute differ-
most implicit scene representations, such as all the signed distance ence on both sides of the comparison (with 𝑂 (𝑛) complexity, 𝑛 being
fields and the CSG operators described in [Inigo Quilez 2021]. the number of min/max operators). If the condition c∗ evaluates
Our compiler first statically analyzes whether a certain branch to branch 𝑎 instead of branch 𝑏, intuitively we can simply set 𝑎 to
depends on an implicitly defined geometry discontinuity and back- infinity (for min) or negative infinity (for max) so that branch 𝑏 is
propagates accordingly. For geometry discontinuities, the compiler forced to be taken. However, this introduces artifacts if 𝑎 is used
further classifies whether the discontinuity is due to the camera elsewhere. We instead invert every condition where 𝑎 gets chosen.
ray is intersecting a C1 smooth geometry (Section D.1) or not (Sec-
tion D.2). For intersecting surfaces, we dynamically modify the D.4 Caveat when combined with Random Variables
branching condition in the the signed distance function to obtain 𝑓0 The random variables introduced in Section 7.2 cannot be combined
and 𝑓1 . The rest of this section discusses each component in detail. with the RaymarchingLoop primitive because key assumptions made
for its gradient derivation will be violated by the introduction of
Sampling geometry discontinuities. To identify implicitly defined the random variables. On one hand, the specialized rule makes the
geometries, our DSL introduces a new primitive: RaymarchingLoop. assumption that at the discontinuity of the geometry, the ray is either
It is defined using camera orientation 𝒐® , ray direction 𝒅® and the user- tangent to the surface or it is on the zero set of two sub-surfaces.
defined signed distance function (SDF) 𝑓 and outputs a Boolean These will approximately hold as long as the image resolution is
condition 𝐵 on whether the loop has converged, as well as the high enough. On the other hand, the random variables generate
distance 𝑡 from the camera to geometry. Optionally, the SDF can be great variation in the scene parameters, therefore neighboring pixels
defined to output the surface normal and sub-surface labelling using on different sides of the geometry discontinuity may correspond to
identical branching conditions as the original SDF. In the forward 3D points that are actually very far from the silhouette, violating
pass, the SDF will be evaluated at 𝒑® = 𝒐® + 𝑡 · 𝒅® for these values. the assumptions made in Appendix D.1 and D.2. For these reasons
In the forward pass, the compiler automatically expands the prim- we always disable the random variables for Dirac parameters that
itive into a loop that finds the zero set to the implicit function using RaymarchingLoop primitive depends upon.
ray marching. In the backward pass, we sample the silhouette discon- It is possible to combine random variable with implicitly defined
tinuity by evaluating 𝐵, and interior edge discontinuity is found by geometry by resorting to the general gradient rules introduced in
sampling all branch conditions defined in 𝑓 evaluated at 𝒑® = 𝒐® +𝑡 · 𝒅.® Sections 4.2 and 6.2. Because ray marching loop can have arbitrarily
Because 𝑡 is the output of the black-box primitive RaymarchingLoop, many iterations, the gradient program can be inefficient due to the
it is viewed as a discontinuous function 𝑡 = 𝑠𝑒𝑙𝑒𝑐𝑡 (𝑥 > 𝑥𝑑 , 𝑡 −, 𝑡 + ) long tape. Note that for simple geometries, it is also possible to
whenever a silhouette or interior edge discontinuity is sampled at analytically compute ray object intersection.
𝑥𝑑 . However, if the interior edge is caused by the camera ray being
tangent to a foreground geometry without changing any Boolean E DETAILS FOR QUANTITATIVE METRIC
conditions in 𝑓 , our method will fail to sample such a discontinuity. Because different methods make different prefiltering assumptions,
the kernel 𝐾 for the LHS ∇(𝐾 ∗ 𝑓 ) = 𝐾 ∗ (∇𝑓 ) of Equation 4 is chosen
Classifying geometry discontinuities. For discontinuities that de- differently for different methods so all methods give the same desired
pend on RaymarchingLoop, the compiler applies Section D.1 if the kernel 𝐾 ∗ when considering both the prefiltering internal to the
camera ray is tangent to the geometry and Section D.2 otherwise. method and the prefiltering associated with 𝐾.
We use a simple classification that works well in all our experiments: Ours: as stated in Section 5, our gradient is approximating that of
𝜕𝑓 ®
a ray is tangent to the geometry if ⟨ 𝜕𝒑® , 𝒅⟩ ≤ 0.1 a pre-filtered function using a 1D box kernel, whose size is identical
to either dimension of 𝐾. Because 𝐾 = 𝐾𝑥 ∗ 𝐾𝑦 ∗ 𝐾𝜽® is a separa-
Identifying intersecting sub-surfaces. If the geometry discontinuity ble kernel, assuming our gradient is pre-filtered on the 𝑦 axis, the
is classified as caused by intersection of sub-surfaces 𝑓0 and 𝑓1 , our integrand can be estimated as follows.
algorithm modifies the branching condition in 𝑓 so that evaluating
(𝐾𝑥 ∗ 𝐾𝑦 ∗ 𝐾𝜽® ) ∗ (∇𝑓 ) = (𝐾𝑥 ∗ (∇𝑓 )) ∗ 𝐾𝑦 ∗ 𝐾𝜽® ≈ ∇𝑂 𝑓 ∗ 𝐾𝑦 ∗ 𝐾𝜽®
Equation 20 scales linearly to the number of sub-surfaces.
We start by observing that at least one of the sub-surfaces can be Here ≈ indicates only the error made by our approximation. We
easily differentiated by directly applying AD to the original function obtain a similar result 𝐾𝑥 ∗∇𝑂 𝑓 ∗𝐾𝜽® when ours is prefiltered along y.
𝑓 . If we denote 𝑓0 as the sub-surface chosen by the current branching In practice, because our approximation adaptively chooses between
configuration, applying AD to 𝑓 is equivalent to differentiating 𝑓0 . 𝑥 and 𝑦 as sampling axes (Section 6.1), for a given pixel, the integrand
Differentiating the other intersecting sub-surface 𝑓1 is more tricky. is computed by first deciding which axis to prefilter, then sampling
The compiler modifies the branching conditions computed in 𝑓 , along the orthogonal axis to compute the integrand.
Target Ours DVG
(a) Rendering (b) Ours (c) DVG (d) FD
Fig. 15. For a single channel rendering of the ring shader (a), we evaluate
Circle the gradient of pixel color wrt the radius parameter and generate per pixel
gradient maps for ours (b), DVG (c) and finite difference (FD) (d). Red
indicates the gradient is positive and blue indicates negative. Our gradient
agrees with FD, while DVG’s gradient for the inner circle has opposite
direction as ours and FD.
But the optimization is still carried out automatically to find the

Ring parametric representation of the rope.
We manually pick 6 key frames for the rope shown as Knot A
Figure 12 and 8 key frames for the second rope. Because these
Fig. 14. Comparison of the optimized result for ours and DVG. Because
each task restarts with 100 random initializations, here we show for each animations are expressed as a combination of filled shapes / strokes
task, the result whose error corresponds to the median of 100 restarts. The in html, we further increase the stroke width so that the dark rope
target images are rendered with slightly elliptical shapes to avoid either edge is more salient. This greatly helps optimizing depth when a
ours or DVG forming a perfect reconstruction. For the circle shader, both rope overlaps with itself, because the filled color is identical for both
Ours and DVG converge close enough to the target image. But for the ring overlapping pieces, and the edge stroke is the only cue for resolving
shader, DVG is unable to converge because of a bug discussed in Figure 15. the depth correctly.
To ensure semantic continuity, our framework optimizes key
frames in sequential order, and always initializes the new optimiza-
FD and variants: because FD is approximated based on the
tion based on parameters from the previous key frame. For every
original function value, the integrand is directly sampled using 𝐾.
key frame, our framework first classifies whether a new rope seg-
TEG: their compiler handles the pre-filtering of image space as
ment should be introduced: at the first key frame or when a new
integration, and similar to ours, they resort to sampling for any
color (representing a different rope) appears in the current frame. If
extra integration after a single integral of Dirac delta functions. To
new rope is not needed, our framework further decides whether a
avoid incurring too many samples, pre-filtering in parameter space
new segment should be added as appending to the tail of the rope,
is handled by Monte-Carlo sampling implemented outside their DSL
or subdividing the existing last segment. This is done by randomly
similar to ours and FD.
sample new spline segments and evaluate their L2 loss with the
Our compiler will randomly sample a unit ray in the parameter
target image. If the loss is always larger than without adding the
space and sample 𝜽® 0, 𝜽® 1 by extending the random ray in both direc-
new segment, we initialize by subdividing the existing rope, other-
tions to a fixed distance around a center parameter, then integrate
wise, we choose 5 sampled configurations with minimum loss as
along a straight line segment between 𝜽® 0, 𝜽® 1 . We always use 105 initialization. If a new rope is added, we manually click on the two
samples for the RHS, 104 samples for the quadrature of the LHS, and ends of the new rope and use the coordinates for initialization. This
1 sample for pre-filtering the integrand of the LHS. We find the noisy helps both to increase the convergence rate compared to random
integrand estimate is smoothed out because the outer integration is initialization, as well as indicating the direction of the rope, so that
sampled so densely. However, for TEG, we use 10 samples for the new segments are initialized from the correct end of the rope for
inner integrand of the LHS because TEG constructs the pre-filtering the next key frames. At the last key frame, an additional optimiza-
in image space as a two-dimensional integration, and the symbolic tion process is applied to search for depth-related parameters only.
integration of the Dirac delta only eliminates the inner integral. Because our spline representation is not pixel-wise perfectly recon-
Therefore, the outer integral is sampled by additional quadrature structing the target animation frames, the overlapping regions may
using their implementation of the trapezoid rule, to avoid aliasing not correspond exactly. This occasionally lead to the problem that
and high error that occur with smaller sample counts. the optimal depth parameters in the L2 sense do not correspond to
human intuition, as they may encourage intersection of ropes to
F ADDITIONAL FIGURES COMPARING WITH DVG
lower the loss objective. Therefore, human effort may be involved
Here we include Figure 14 and 15 comparing ours and DVG, as to reject these optimization results. Finally, the optimized depth
discussed in Section 8.1.2. parameters will be passed back to all previous optimized key frames
to ensure correct layering throughout the animation.
G DETAILS FOR ROPE EXPERIMENT IN SECTION 8.3.1
We imagine the rope application can be useful in some interactive
applications, therefore it involves several simple manual decisions.

Siggraph 2022

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Siggraph 2022

Uploaded by

Copyright:

Available Formats

A𝛿: Autodiff for Discontinuous Programs – Applied to Shaders

YUTING YANG, Princeton University, U.S.

Gradient Combine Gradient Optimization

programs. This allows small pre-filtering kernels that span between

variables. Therefore, the width of the pre-filtering kernel is of the

Rendering Ours TEG FD 0.01

Circle 9.27 × 10−5 2.0x 3.4x

(a) Circle (b) Ring

9.75 × 10−5 1.8x 6.3x

(a) Target (b) Optimization (c) Modified

4.88 × 10−5 3.2x 12.8x

(a) Optimization (b) Modified (c) Modified

read and output variables that might be modified, and inserts a

and checkpointing strategies for better auto-scheduling, or develop

Target Ours DVG

(a) Rendering (b) Ours (c) DVG (d) FD

But the optimization is still carried out automatically to find the

You might also like