You are on page 1of 58


So I found this batman image on the net. I had to put it somewhere in the presenta7on. Here it is. Nvidia has a page on its cuda zone showing applica7ons and speedups, its a bit silly and its not so adver7zed anymore nowadays, but s7ll useful as an introduc7on hCp:// Intels Knights Corner news hCp:// 2010/20100531comp.htm hCp://

A wild and not even a par7cularly crea7ve example... More computa7on could enable en7rely novel scenarios... The past couple of Unreal engine demos were architected to show some of what it could be possible

I like the idea of mo7ongraphs hCp:// Also, imperfect shadowmaps are cool hCp:// The two images top right are for megameshes (Lionhead) and megatextures (ID) Fight night round 4 and Champion are the rst games using EAs Real AI, which is trained from examples

Naughy dog famously hand op7mizes SPU kernels. How many games do pack so much computa7on, even today, on SPUs? And s7ll, preCy uniform workloads, on a processor that can handle non-uniform loads beCer than compute (lower SIMD width). Heart bokeh image taken from hCp://, I like how the woman is pushed out of the frame to show the bokeh, as a metaphor of how we did some eects early on 360 and ps3

* Data parallel is harder for more high level tasks, but its far from impossible. In fact, we s7ll rely on lots of data, just we let this data be combed through with mostly handmade serial logic Use learning, classica7on, probabilis7c, op7miza7on methods for games... I.E. in anima7on, we have data, go from blend trees to anima7on clips all down to work with individual poses, use anima7on data itself and its metrics to drive anima7on, not logic! hCp:// hCp:// Another avenue: simula7on, modeling, physics The last image is from hCp:// poetry-and-sculpture/ think about what can you do with all the data we already have

Usually also superscalar, mul7ple pipes, SIMD but for now we dont care

Ok, ok, the last batman

...but ooen is not, its surprising (but consider that ooen the compiler does not know if a given loop is executed ooen or not) how much manual unrolling (and prefetching on plaporms without strong cache predic7ors) in 7ght loops s7ll pays, allowing the compiler to then compute a good schedule without dependencies of the generated assembly.




First insight we learned


Again, same reasoning



Fibers and Corou7nes are i.e. used to hide I/O and network access in web servers (a popular example is node.js)


For a sequen7al CPU loop, all this work done *in the specied order*. The compiler may reschedule instruc7ons without breaking the logical order of opera7ons, and the OOO CPU (i.e. PC, but not Ps3 or 360) will re-order instruc7ons (well, microinstruc7ons) to get rid of dependencies, but the processing order is essen7ally preserved. By contrast, on the GPU, all you're saying is you want that whole array processed, you don't care about the order.


Of course there is other stu too, and never mistake a logical model for a physical one, but this is enough for everything well need


Vectors are usually even larger than the number of ALUs. Repeat the same instruc7on a number of 7mes to process an en7re vector (allows ALUs to be pipelined, i.e. a vector of 16 oat4 elements might be processed by four oat4 ALUs issuing 4 at a 7me, each ALU pipeline stage can compute a single oat per cycle and with four stages we achieve the oat4 processing with a throughput of one per cycle and a latency of four). We say threads here but they are more similar to lightweight bers. Also, confusingly, dierent GPU languages call the same concepts in dierent ways The more computa7onal threads we can pack in the registers, the more independent work well have to hide latencies! This lightweight threading mechanism is fundamental for performance with manycores, see i.e. hCp:// newsleCer_detail.php?num=4747&year=2011&tag=3 which seems a deriva7ve of the Tilera Tile64 hCp://


Illustra7on stole from Running Code at a Teraop: Overview of GPU Architecture, Kayvon Fatahalian, Siggraph 2009 Some GPUs do not support variable length loops, everything gets unrolled and inlined...


I.E. Pixars Reyes rendering algorithm had interpreted, very parallel shaders. The cost of the interpreter is zero when you repeat the same instruc7on over thousands of elements. Reyes shaders are essen7ally the same as the hardware shaders we have on modern GPUs Note that GPU shaders have some explicit SIMD instruc7ons in HLSL and similar languages, suppor7ng vectors of elements up to 4-wide. In prac7ce, many modern GPUs process only scalar code, turning vector instruc7on into scalars. This is because the SIMD parallelism that GPUs implement is much wider than the 4-wide instruc7ons of the language, coupling many (i.e. 16 or more) execu7on elements together in a big execu7on vector. This regardless of if each element is scalar or 4-way, some GPUs even have a 4+1 congura7on (one 4-way and one scalar instruc7on in parallel per each execu7on element). In general when we talk about GPU vectors we dont mean the actual vector instruc7ons in the shader code but the aggrega7on of many ALUs following the same code.




Skinning images from hCp:// If youve done rendering on the GPU, you might recognize that the pseudo-code looks nothing like a vertex shader, where you dont fetch data but its passed as the parameters of the main func7on of a kernel. The data is laid out in GPU memory and bound via streams, constants and other means. I didnt include a real vertex shader code example, even if it would not have been harder to read, because it might even be misleading. The way DirectX or OpenGL expose programmable func7onality to the user might be close to the hardware, but most ooen nowadays is not. I.E. shaders might indeed compile to include explicit fetches... Pixel shaders can kill their output, so to be pedan7c they can output one or zero elements. Also, pixel processing might be culled early, this is very important. The independence that I assert is not en7rely true. Pixel shaders need deriva7ve informa7on for all the variables, these are computed using nite dierences from the neighboring pixels which not only constrains them to always be processed in 2x2 pixel units (quads) but also establishes through the deriva7ves (ddx, ddy), a mean of communica7on (i.e. see Eric Penners Shader Amor7za7on using Pixel Quad Message Passing GPU Pro 2)


Well see later some more accurate pseudo-code for now lets s7ll pretend to work with a generic in-parallel



* At least in our representa7on of the GPU... Really compilers can decide to spill some of the local state onto the shared memory (well see later) and allow for a stack and so on, but that is quite a dierent thing... I.E. Nvidia Fermi has a limit of 64 temporary registers in kernel code (number of registers per core is higher, but it will be used for mul7ple contexts)





hCp:// hCp:// hCp:// hCp:// Tangen7ally related, but interes7ng hCp://



Rendering is a non-uniform workload, fully automated scheduling has its benets early-z early-stencil rejec7on, which performs an hardware, automated stream compac7on! Fundamental for rendering...


At rst its easy to complain about bank rules and cohalescing issues when trying to access shared and global memory... But we have to consider that the GPU oers the ability of accessing dierent memory loca7ons per SIMD element, on many of them on many cores... This is a rather impressive feat, on standard CPUs to load data into SIMD vectors you have to do it serially, one element at a 7me. Memory has addressing limita7ons, its a limit of the technology. SIMD communica7on via vo7ng: all, any, ballot Atomics: Add, Sub, Inc, Dec, And, Or, Xor, Exch, Min, Max, CompareAndSwap. Append (or) consume structured buer views (Dx11)


If we dont have control over scheduling, we could not be able to access core local memory, obviously, as we would not know which elements are scheduled where.


For convenience, in compute, indices can be generated on 1d, 2d or 3d grids ResetCoreMemory memory is not really- reset, but we cant rely on its contents, the ini7aliza7on is to be done in the Kernel and any output has to go into GLOBAL_MEM, so I created this pseudo-func7on to illustrate this in a C-like way.




In the picture, the rst Connec7on Machine, a 65536-processor supercomputer of the 80ies


Shared memory is not exposed to pixel and vertex shaders as it requires manual scheduling of elements on cores. Its s7ll used internally by the graphic card to store results of rasteriza7on and shared informa7on



Note how widening the number of outputs per element looks like unrolling the computa7on... Such scheduling decisions have impacts on latency hiding (switching contextes), the ability of using registers as cache (i.e. in this computa7on, as we average columns, going wide on columns will enable reusing fetches from registers), the ability of syncronizing dierent tasks in a single kernel (persistent threads) and so on...


Also, many libraries worth checking out. Thrust, Nvidia performance primi7ves, CuDPP, libCL etc Even more for tasks that are inherently parallel, linear algebra, imaging and so on





All the arrows here represent sums to the exis7ng values, the self part of the addi7on is not shown to not confuse with too many arrows hCp://


All the black arrows here represent sums to the exis7ng values, the self part of the addi7on is not shown, as for the slide before The red arrows mark element moves hCp://


The moderngpu website has lots of code, and it take a scan centric approach to parallel computa7on hCp:// hCp://


Blelloch paper references many applica7on domains: Ble93.pdf hCp:// RadixSortTR.pdf



hCp:// %20on%20Wide%20SIMD%20Many-Core%20Architectures.pdf



Images from the original paper: On-the-y Point Clouds through Histogram Pyramids. Gernot Ziegler, Art Tevs, Chris7an Theobalt, Hans-Peter Seidel hCp:// hCp:// hCp:// Marching cubes using HP hCp:// v27i8pp2028-2039.pdf.abstract.pdf


Good rendering hCps:// ac7on=ACachFile&do=get&target=08-GPUArchII.pdf


Some of the compromises, many dierences in the implementa7on See Debunking the 100x GPU vs CPU Myth for more details: h@p:// 2010/06/p451-lee.pdf


There are many languages that are trying to be both GPU and CPU targets, OpenCL has CPU backends, without considering CPU and GPUs on the same chip like AMD Fusion (see i.e. and languages that target the CPU for wide SIMD compu7ng like Intels opensource SPMD ISPC hCp:// hCp:// Throughput compu7ng is everywhere, not only for high performance calcula7ons but also servers, low power and so on. Amdhal for mul7cores: ieeecomputer08_amdahl_mul7core.pdf



Some bonus reads BASICS Nvidia CUDA programming guide hCp:// DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf N-Body Simula7on with CUDA from GPU Gems 3 hCp:// GPUGems3/gpugems3_ch31.html ADVANCED Ecient Primi7ves and Algorithms for Many-core architectures hCp:// Persistent threads programming hCp:// GTC2012/Presenta7onPDF/S0157-GTC2012-Persistent-Threads-Compu7ng.pdf More on PT and Task Parallelism hCp:// GTC2012/Presenta7onPDF/S0138-GTC2012-Parallelism-Primi7ves-Applica7ons.pdf Registers or shared memory? hCp:// SC08_Volkov_GPU.pdf Op7miza7on 7ps hCp:// cuda_tuning_bof_sc09_nal.pdf DATA STRUCTURES TONS of spa7al ones, very few general purpose NOTE: aCributes of parallel data structures: