Leone Sig Graph 2011

Native shader compilation with LLVM
Mark Leone
Why compile shaders?

RenderMans SIMD interpreter is hard to beat.
Amortizes interpretive overhead over batches of points. Shading is dominated by oating point calculations.
SIMD interpreter
For each instruction in shader: Decode and dispatch instruction. For each point in batch: If runag is on: Load operands. Compute. Store result.
SIMD interpreter: example inner loop

void add(int numPoints, bool* runflags, float* dest, float* src1, float* src2) { for (int i = 0; i < numPoints; ++i) { if (runflags[i]) dest[i] = src1[i] + src2[i]; } }
SIMD interpreter: benets

Interpretive overhead is amortized (if batch is large). Uniform operations can be executed once per batch. Derivatives are easy: neighboring values are always ready.
SIMD interpreter: drawbacks

Low compute density, poor instruction-level parallelism.
SIMD interpreter: example inner loop

void add(int numPoints, bool* runflags, float* dest, float* src1, float* src2) { for (int i = 0; i < numPoints; ++i) { if (runflags[i]) dest[i] = src1[i] + src2[i]; } }
SIMD interpreter: drawbacks

Low compute density, poor instruction-level parallelism Load, compute, store, repeat. Poor locality, high memory trafc Intermediate results are stored in memory, not registers. High overhead for small batches Difcult to vectorize (pointers and conditionals).
Compiled shader execution

For each point in batch: Load inputs. For each instruction in shader: Compute. Store outputs.
Benets of native compilation

Eliminates interpretive overhead. Good for small batches. Good locality and register utilization. Intermediate results are stored in registers, not memory. Good instruction-level parallelism. Instruction scheduling avoids pipeline stalls. Vectorizes easily.
Issues: batch shading

Use vectorized shaders on small batches. Uniform operations: once per grid, not once per point. Some are very expensive (e.g. plugin calls). Derivatives: need "previously" computed values from
neighboring points.
RSL permits derivatives of arbitrary expressions.
Why vectorize?
Consider batch execution of a compiled shader:
For each point in batch: Load inputs. For each instruction in shader: Compute. Store outputs.
Why vectorize?
Consider batch execution of a vectorized shader:
For each block of 4 or 8 points in batch: Load inputs. For each instruction in shader: Compute on vector registers (with mask) Store outputs.
Simple vector code generation

Consider using SSE instructions only for vectors and matrices:
float dot(vector v1, vector v2) { vector v0 = v1 * v2; return v0.x + v0.y + v0.z; } load4 load4 mult4 move add add r1, r2, r3, r0, r0, r0, [v1] [v2] r1, r2 r3.x r3.y r3.z
Vector utilization
Shader vectorization
To vectorize, rst scalarize:
float dot(vector v1, vector v2) { vector v0 = v1 * v2; return v0.x + v0.y + v0.z; } float dot(vector v1, vector v2) { float x = v1.x * v2.x; float y = v1.y * v2.y; float z = v1.z * v2.z; return x + y + z; }
Scalar code generation

Next, generate ordinary scalar code:
float dot(vector v1, vector v2) { float x = v1.x * v2.x; float y = v1.y * v2.y; float z = v1.z * v2.z; return x + y + z; } load load mult load load mult load load mult add add r1, r2, r0, r1, r2, r3, r1, r2, r3, r0, r0, [v1.x] [v2.x] r1, r2 [v1.y] [v2.y] r1, r2 [v1.z] [v2.z] r1, r2 r0, r3 r0, r3
Vectorize for batch of four

Finally, widen each instruction for a batch size of four:
float dot(vector v1, vector v2) { float x = v1.x * v2.x; float y = v1.y * v2.y; float z = v1.z * v2.z; return x + y + z; } load4 load4 mult4 load4 load4 mult4 load4 load4 mult4 add4 add4 r1, r2, r0, r1, r2, r3, r1, r2, r3, r0, r0, [v1.x] [v2.x] r1, r2 [v1.y] [v2.y] r1, r2 [v1.z] [v2.z] r1, r2 r0, r3 r0, r3 Vector utilization
Struct of arrays (SOA)

Normally a batch of vectors is an array of structs (AOS):
x y z x y z x y z x y z . . .
Vector load instructions (in SSE) require contiguous data. Store batch of vectors as a struct of arrays (SOA):
x x x x . . . y y y y . . . z z z z . . .
Masking / blending
Use a mask to avoid clobbering components of registers used
by the other branch.
No masking in SSE.
Use variable blend in SSE4:
blend(a, b, mask) { return (a & mask) | ~(b & mask) }
No need to blend each instruction Blend at basic block boundaries (at phi nodes in SSA).
Vectorization: recent work

ispc: Intel SPMD program compiler (Matt Pharr) Beyond Programmable Shading course, SIGGRAPH 2011 Open source: ispc.github.com Whole function vectorization in AnySL (Karrenberg et al.) Code Generation and Optimization 2011
Film shading on GPUs

Previous work LightSpeed (Ragan-Kelly et al. SIGGRAPH 2007) RenderAnts (Zhou et al. SIGGRAPH Asia 2009) Code generation is easier now (thanks CUDA, OpenCL). PTX AMD IL LLVM and Clang
GPU code generation with LLVM

NVIDIAs LLVM to PTX code generator (Grover) Not to be confused with PTX to LLVM front end (PLANG) Incomplete PTX support in llvm-trunk (Chiou) Google summer of code project (Holewinski) Experimental PTX back end for AnySL (Rhodin) LLVM to AMD IL (Villmow)
Issues: GPU code generation

Film shaders interoperate with the renderer. File I/O: textures, pointclouds, etc. (out of core). Shader plugins (DSOs). Sampling, ray tracing. Answer: multi-pass partitioning (Riffel et al. GH 2004)
Partitioning
Nf = faceforward( normalize(N), I); Ci = Os * Cs * ( Ka*ambient() + Kd*diffuse(Nf) );
ambient Cs scale Os mult Ka
normalize
faceforward
diffuse
Kd
scale
add
scale
Ci
Multi-pass partitioning for CPU

Synchronize for GPU calls, uniform operations, derivatives. Does not require hardware threads or locks. A thread yields by returning (to a scheduler). Intermediate data is stored in a cactus stack (Cilk)
or continuation closures (CPS).
Data management and scheduling is a key problem

(Budge et al. Eurographics 2009)
Issues: summary
CPU code generation (perhaps JIT) Vectorization GPU code generation Multi-pass partitioning
Introduction to LLVM
Mid-level intermediate representation (IR) High-level types: structs, arrays, vectors, functions. Control-ow graph: basic blocks with branches Many modular analysis and optimization passes. Code generation for x86, x64, ARM, ... Just-in-time (JIT) compiler too.
Example: from C to LLVM IR

float sqrt(float f) { return (f > 0.0f) ? sqrtf(f) : 0.0f; } define float @sqrt(float %f) { entry: %0 = fcmp ogt float %f, 0.0 br i1 %0, label %bb1, label %bb2 ... }

float sqrt(float f) { return (f > 0.0f) ? sqrtf(f) : 0.0f; } define float @sqrt(float %f) { entry: %0 = fcmp ogt float %f, 0.0 br i1 %0, label %bb1, label %bb2 bb1: %1 = call float @fabsf(float %f) ret float %1 bb2: ret float 0.0 }

void foo(int x, int y) { int z = x + y; ... } define void @foo(i32 %x, i32 %y) { %z = alloca i32 %1 = add i32 %y, %x store i32 %1, i32* %z ... }
Writing a simple code generator

class BinaryExp : public Exp { char m_operator; Exp* m_left; Exp* m_right; virtual llvm::Value* Codegen(llvm::IRBuilder* builder); };
Writing a simple code generator

llvm::Value* BinaryExp::Codegen(llvm::IRBuilder* builder) { llvm::Value* L = m_left->Codegen(builder); llvm::Value* R = m_right->Codegen(builder); switch case case case ... } } (m_operation) { '+': return builder->CreateFAdd(L, R); '-': return builder->CreateFSub(L, R); '*': return builder->CreateFMul(L, R);
Advantages of LLVM
Well designed intermediate representation (IR). Wide range of optimizations (congurable). JIT code generation. Interoperability.
Interoperability
Shaders can call out to renderer via C ABI. We can inline library code into compiled shaders. Compile C++ to LLVM IR with Clang. This greatly simplies code generation.
Weaknesses of LLVM
No automatic vectorization. Poor support for vector-oriented code generation. No predication. Few vector instructions, must resort to SSE/AVX intrinsics.
LLVM resources
www.llvm.org/docs Language Reference Manual Getting Started Guide LLVM Tutorial (section 3) Relevant open source projects ispc.github.com github.com/MarkLeone/PostHaste
Questions?
Mark Leone mleone@wetafx.co.nz

Leone Sig Graph 2011

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Leone Sig Graph 2011

Uploaded by

Copyright:

Available Formats

Native shader compilation with LLVM

Why compile shaders?

SIMD interpreter: example inner loop

SIMD interpreter: benets

SIMD interpreter: drawbacks

SIMD interpreter: example inner loop

SIMD interpreter: drawbacks

Compiled shader execution

Benets of native compilation

Issues: batch shading

RSL permits derivatives of arbitrary expressions.

Simple vector code generation

Scalar code generation

Vectorize for batch of four

Struct of arrays (SOA)

Use variable blend in SSE4:

blend(a, b, mask) { return (a & mask) | ~(b & mask) }

Vectorization: recent work

Film shading on GPUs

GPU code generation with LLVM

Issues: GPU code generation

Multi-pass partitioning for CPU

Data management and scheduling is a key problem

Example: from C to LLVM IR

Example: from C to LLVM IR

Example: from C to LLVM IR

Writing a simple code generator

Writing a simple code generator

You might also like