This is were we left from http://c0de517e.blogspot.com/2011/09/rendering-101.html


In the middle, Nvidia Fermi GPU Die



In the background image and in the scheme, the Pentium 4 CPU. Compare its complicated design with many different logic blocks with the Fermi GPU in Slide 5 See i.e. http://www.azillionmonkeys.com/qed/cpujihad.shtml


The code is written in a “dumb” way here, we could have written out[i] = (in*i+*10+5)*in*i+, but we wanted to write it in a way that’s closer to how things are translated in assembly and executed



It’s interesting to notice the similarities between manual unrolling and prefetching used in traditional CPU optimization, fibers, continuations and async I/O used in modern servers, and GPGPU architecture. Note about the “fixed vectors”: SIMD instructions (i.e. Math on float4s) does not translate into SIMD execution in HW (nor is needed for HW SIMD) – HW SIMD width may be very different from language SIMD width!


Siggraph 2008 - From Shader Code to a Teraflop: How Shader Cores Work – Kayvon Fatahalian, Stanford University http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf



Throw away what you learned in university. Algorithm complexity nowadays should be judged on memory accesses and coherency, and on how much they can parallelize. Binary trees are boring. [- dramatization] For a good introduction on Reyes, see http://www.amazon.com/Production-RenderingIan-Stephenson/dp/1852338210/ref=sr_1_1?ie=UTF8&qid=1318968111&sr=8-1




Moore’s law: it was originally about transistor count, and processor’s roughly managed to respect it. But CPU’s are also respecting it in performance, that is odd, as the performance should increase due to the transistor count AND CPU frequency (faster!). GPU’s are following Moore’s in transistor count but beating it (as it should be expected) when it comes to performance, but only on heavily data-parallel tasks where all the code runs in parallel (Amdahl’s law is the limiting factor there) What’s on the die (PC processors...) 8086...386 --Mostly processing power: Logic units. 486...Pentium2 --Processing power and caches: A bit of cache, FPUs. Multiple pipelines. Pentium3...Pentium4 --Caches and scheduling logic: Heavy instruction decode/reorder units, branch predition, cache prediction. Longer pipelines. Core2...i7 --Multicore + Big caches Future --Back to “pure” processing power, ALUs on most of the die (and cache) Manycore, small decode stages (in-order, shared between units) and caches (shared between units), wide hardware and logical SIMD, lower power/flops ratio (GPUs, Cell...) Manycore (“GPU”) integrated with multicore (“CPU”), sharing a cache level or direct bus interconnection (single die or fast paths between units: Xenon/Xenos, Ps3 PPU/SPU...) Past: http://www.tayloredge.com/museum/processor/processorhistory.html http://www.cpu-world.com/CPUs/index.html http://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html http://www.thg.ru/cpu/19990809/onepage.html http://www.cs.clemson.edu/~mark/330/p6.html Future: www.gpgpu.org/static/s2007/slides/02-gpu-architecture-overview-s07.pdf s09.idav.ucdavis.edu/talks/02_kayvonf_gpuArchTalk09.pdf bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf http://www.research.ibm.com/cell/ http://en.wikipedia.org/wiki/Cell_(microprocessor) http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-tested http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx



How much latency? On the 360 GPU, from the start of a shader (task) to the end (write into the framebuffer) there are roughly 1000 gpu cycles of latency


Just a few examples! There are many fast sequential sorts (i.e. Radix and the other “distribution” sorts), many are even faster if the sequence to sort has certain properties (i.e. Uniform: Flash, Almost sorted: Smooth) or if we some given behaviour are desiderable (i.e. Cache efficient: Funnel, Few writes: Cycle, Extract LIS: Patience, Online: Library) and most of them can be parallelized (not only the MergeSort). Also hybrids are often useful (i.e. Radix sort and parallel merge). www.cse.ohio-state.edu/~kerwin/MPIGpu.pdf theory.csail.mit.edu/classes/6.895/fall03/projects/final/youn.ppt http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htm http://elliottback.com/wp/sorting-in-linear-time/ http://en.wikipedia.org/wiki/Sorting_algorithm http://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.c om/&um=1&ie=UTF8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=slrelated&resnum=1&ved=0CCQQzwIwAA


http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2 http://www.amazon.com/GPU-Computing-Gems-EmeraldApplications/dp/0123849888/ref=pd_bxgy_b_img_c http://developer.nvidia.com/category/zone/cuda-zone http://gpgpu.org/


Also you might want to start experimenting with some code. SlimDX is a good DirectX9/10/11 C# (.net) wrapper for DirectX and comes with some simple examples: http://slimdx.org/. Note that the various APIs have different amounts of legacy, it’s probably not a bad idea to start with DX11, it’s more complicated than for example old immediate mode OpenGL, but it’s closer to reality. Immediate mode refers to a way of drawing primitives that places the vertex data directly in the command buffer (instead of being a resource that is created upfront and then bound). It’s slow and mostly deprecated. States (switches for the fixed function parts of the pipelines), resources and shader constants are very different concepts in DX9 where you can set/get a single render state or a single shader constant. This is expensive and yields to lots of commands being sent and requires careful practices to avoid generating too many redundant state sets. DX10-11 manage everything as resources, buffers that the CPU can modify, transfer to the GPU and then set, and that contain data, textures, groups of states and shader constants. Some GPU implementations are very different from what I will sketch from here, for example the PowerVR GPU, which is used in many mobile platforms, use tile based deferred rendering which is pretty different from what I will discuss in terms of pipelines and rasterization and it’s pretty distant from the logical stages that the DirectX API exposes (even if you might find a DirectX or equivalently OpenGL implementation for such platforms) http://en.wikipedia.org/wiki/PowerVR Remember that these pipeline stages are only a logical view of the API with some logical view of a typical implementation, not a low-level physical view. Also, we won’t delve deep, for further reference see: http://c0de517e.blogspot.com/2008/04/gpu-part-1.html http://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.html


Dump of GPU commands, captured with Windows DirectX PIX (comes with the DirectX SDK) There are a few other API capture applications, the best ones for DirectX 9 are PIX and Intel GPA (http://software.intel.com/en-us/articles/intel-gpa/) For DirectX 11 you can use PIX but also Nvidia’s Parallel Nsight: http://developer.nvidia.com/nvidia-parallel-nsight and AMD GPU PerfStudio http://developer.amd.com/tools/PerfStudio/Pages/default.aspx To peek into games the old DXExplorer http://www.sandboxsoftware.net/dxexplorer/ and 3D Ripper are fun too http://www.deep-shadows.com/hax/3DRipperDX.htm Similar tools exist for OpenGL, for example gDEBugger http://www.gremedy.com/


Data fetching usually has a simple linear cache associated with it (pre-transform cache). Indexed reading is useful as triangles often share the same vertices, so we don’t want to replicate the same data over and over in source buffers, by using indexing we can just replicate indices. Moreover, GPUs store a few vertices out of the vertex shader, and if you ask to shade the same vertex again it can reuse the output without re-executing the shader (post-transform cache). Usually indexed triangle lists are the best ways to render objects on modern GPUs and there are ways to optimize the order of triangles in an object in order to maximize posttransform cache use. Usually using inteleaved inputs (a single buffer, an array of structures) is the best choice, but we might want to split the data in different buffers for example if we need to use the same data in multiple draws but each draw needs only some attributes, or if part of the data needs to be modified by the cpu etc... As for all the units, don’t confuse the logical representation with the physical one. For example, on a Xbox 360 Xenos GPU the vertex data fetching is done by the vertex shader, vertex fetching instructions are added at the beginning of a vertex shader and the fetching unit is only a kind of memory interface, like the texture fetching unit, and it’s available both to vertex and pixel shaders (as the shader engine is unified on that GPU). The only unit that “generates vertices” is the unit that handles indexing and posttransform caching, and vertices are only indices that are an input to the vertex shader.



Nvidia FX Composer is a good tool to experiment with shaders without having to care about how to draw primitives and bind resources http://developer.nvidia.com/fxcomposer The older 1.8 version of FX Composer is actually a better starting point that the newer .net based one. DirectX shader language is called “HLSL”, it’s very close to Nvidia’s CG which can be used both from OpenGL and DirectX (and works on non-nvidia cards too). OpenGL uses GLSL which is a bit more confusing and in general messier. HLSL and CG support an effect framework (FX) that enables the specification of not only shader code but also control over some GPU states, thus enabling to data-drive most of what is needed to render a given object in an easier way.


Note: we are really collapsing in this slide multiple related fixed units, clipping/culling, primitive setup/assembly, interpolation and so on are different stages, each with their own limits (throughput) and buffers. There is a lot of documentation on how to write rasterizers and rasterization algorithms. I’ll link here two software based approach, a simple and mostly illustrative one: http://www.devmaster.net/codespotlight/show.php?id=17 and a more advanced one that was devised for the (now defunct) Larrabee GPU, which had a programmable rasterizer http://software.intel.com/en-us/articles/rasterization-on-larrabee/ Often rasterization happens in more than a single step, there can for example be a coarse-rasterizer that generates bigger tiles (i.e. 8x8 pixels) and then does some earlyrejection (see next slides) to then pass the results to a fine-rasterizer that generates 2x2 quads.



Quad waste is a big problem for dense meshes, tassellation and displacement. Mesh level of detail is often more useful to avoid generating too much waste in the shaded quads than to reduce the number of vertices and their associated load in the vertex shading stages of the pipeline. It’s a problem that needs to be solved for the future of GPU graphics. http://graphics.stanford.edu/papers/fragmerging/shade_sig10.pdf It’s a long discussion, and in practice very few shaders use differentials to integrate discontinuities (we should though, or try to avoid discontinuities and high-frequencies in shaders), but the differentials are by default used when fetching textures to pre-filter them (mipmaps, anisotropic filtering), and this alone makes a HUGE difference. Some pointers for further reads: http://en.wikipedia.org/wiki/Spatial_anti-aliasing http://en.wikipedia.org/wiki/Multisample_anti-aliasing http://www.amazon.com/Texturing-Modeling-Second-ProceduralApproach/dp/0122287304 http://www.amazon.com/Real-Time-Rendering-Third-Tomas-AkenineMoller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1



I won’t go into any details of what a stencil buffer is and how early-stencil can be used, or the many ways a depth buffer can be used and configured... But early-rejections are fundamental for GPU performance and they can be used for many different optimizations, it’s fundamental to understand how they work on a specific GPU architecture as often they will work only in some given state configurations and are easy to be invalidated. Some data here: http://www.gpgpu.org/w/index.php/Code_Examples#Early-z Also each GPU vendor calls this optimizations in different ways: Hierarchial Z (HiZ), HyperZ, Early-Z, Zcull and so on... The actual representation that permits this early test varies with the hardware, and there can even be multiple levels of rejection. Most of them are inspired by Hierarchial Occlusion Maps: http://www.cs.unc.edu/~zhangh/hom.html but the literature is rich i.e. http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBF C89?cid=78472




Example image and zbuffer from a pix capture of a retail version of Battlefield Bad Company 2


Sign up to vote on this title
UsefulNot useful