You are on page 1of 28

Copperhead: A Python-like Data

Parallel Language & Compiler
Bryan Catanzaro, UC Berkeley
Michael Garland, NVIDIA Research
Kurt Keutzer, UC Berkeley

Universal Parallel Computing Research Center
University of California, Berkeley

Intro to CUDA

 Overview
 Multicore/Manycore
 SIMD
 Programming with millions of threads

2/28

The CUDA Programming Model  CUDA is a recent programming model. but HW & SW vendor neutral  Programming model essentially identical 3/28 . designed for  Manycore architectures  Wide SIMD parallelism  Scalability  CUDA provides:  A thread abstraction to deal with SIMD  Synchronization & data sharing between small groups of threads  CUDA programs are written in C + extensions  OpenCL is inspired by CUDA.

Multicore and Manycore Multicore Manycore  Multicore: yoke of oxen  Each core optimized for executing a single thread  Manycore: flock of chickens  Cores optimized for aggregate throughput. deemphasizing individual performance 4/28 .

cont. 32 width (max) SIMD: 32 strands 30720 strands Core i7 SP GFLOP/s 102 1080 Memory Bandwidth 25. Specifications Core i7 960 GTX285 4 cores. 480 kB GTX285 5/28 . 32 SIMD Resident Threads width SIMD: vectors. 8 way SIMD Processing Elements @3. 2 threads. 4 way SIMD 30 cores. 1.875 MB Local Store .Multicore & Manycore.6 GB/s 159 GB/s Register File .5 GHz 4 cores.2 GHz @1. 4 30 cores.

SIMD: Neglected Parallelism  It is difficult for a compiler to exploit SIMD  How do you deal with sparse data & branches?  Many languages (like C) are difficult to vectorize  Fortran is somewhat better  Most common solution:  Either forget about SIMD ▪ Pray the autovectorizer likes you  Or instantiate intrinsics (assembly language)  Requires a new code version for every SIMD extension 6/28 .

What to do with SIMD? 4 way SIMD 16 way SIMD  Neglecting SIMD in the future will be more expensive  AVX: 8 way SIMD. Larrabee: 16 way SIMD  This problem composes with thread level parallelism 7/28 .

data parallelism is the only way for most applications to scale to 1000(+) way parallelism 8/28 .CUDA  CUDA addresses this problem by abstracting both SIMD and task parallelism into threads  The programmer writes a serial. scalar thread with the intention of launching thousands of threads  Being able to launch 1 Million threads changes the parallelism problem  It’s often easier to find 1 Million threads than 32: just look at your data & launch a thread per element  CUDA is designed for Data Parallelism  Not coincidentally.

Hello World 9/28 .

highly scalable algorithm design and implementation 10/28 . CUDA-like approaches will map well to many processors & GPUs  CUDA encourages SIMD friendly.CUDA Summary  CUDA is a programming model for manycore processors  It abstracts SIMD. making it easy to use wide SIMD vectors  It provides good performance on today’s GPUs  In the near future.

A Parallel Scripting Language  What is a scripting language?  Lots of opinions on this  I’m using an informal definition: ▪ A language where performance is happily traded for productivity  Weak performance requirement of scalability ▪ “My code should run faster tomorrow”  What is the analog of today’s scripting languages for manycore? 11/28 .

Data Parallelism  Assertion: Scaling to 1000 cores requires data parallelism  Accordingly. manycore scripting languages will be data parallel  They should allow the programmer to express data parallelism naturally  They should compose and transform the parallelism to fit target platforms 12/28 .

Warning: Evolving Project  Copperhead is still in embryo  We can compile a few small programs  Lots more work to be done in both language definition and code generation  Feedback is encouraged 13/28 .

designed for data parallelism  Why Python?  Extant. but current compiler is 14/28 . well accepted high level scripting language ▪ Free simulator(!!)  Already understands things like map and reduce  Comes with a parser & lexer  The current Copperhead compiler takes a subset of Python and produces CUDA code  Copperhead is not CUDA specific.Copperhead = Cu + python  Copperhead is a subset of Python.

Copperhead is not Pure Python  Copperhead is not for arbitrary Python code  Most features of Python are unsupported  Copperhead is compiled. similar to Python-C interaction  Copperhead is statically typed Python Copperhead 15/28 . not interpreted  Connecting Python code & Copperhead code will require binding the programs together.

but bound to the ‘a’ in its enclosing scope 16/28 . yi: a*xi + yi. lambda (or equivalent in list comprehensions) ▪ you can pass functions around to other functions ▪ Closure: the variable ‘a’ is free in the lambda function. x. y): return map(lambda xi. y)  Some things to notice:  Types are implicit ▪ The Copperhead compiler uses a Hindley-Milner type system with typeclasses similar to Haskell ▪ Typeclasses are fully resolved in CUDA via C++ templates  Functional programming: ▪ map.Saxpy: Hello world def saxpy(a. x.

cont. c=a+b + : (Num0. gather  Expressions are mapped against templates  Every variable starts out with a unique generic type. Num0) > Num0 A145 A207 A52 c=a+b Num52 Num52 Num52  Copperhead includes function templates for intrinsics like add. subtract. scan. then types are resolved by union find on the abstract syntax tree  Tuple and function types are also inferred 17/28 . map.Type Inference.

or gather  No side effects allowed 18/28 . rotate.Data parallelism  Copperhead computations are organized around data parallel arrays  map performs a “forall” for each element in an array  Accesses must be local  Accessing non-local elements is done explicitly  shift.

rsegscan  exscan. rotate. exrsegscan  Shuffles:  shift. gather. scatter 19/28 . segscan. rscan. exsegscan.Copperhead primitives  map  reduce  Scans:  scan. exrscan.

 Type inference. code None. None generation. is done ) ) ) by walking the AST ) ) ) 20/28 . 'y']. y): Stmt( Function( None.Implementing Copperhead Module( None. Name('xi') own Abstract Syntax ). None. Add( Mul(  Python provides its Name('a'). Lambda( ['xi'. x. ['a'. 'x'. Python 0. 0. def saxpy(a. etc. x. yi: a*xi + yi. Name('y'). Name('yi') ) Tree ). y) 'saxpy'. Name('x'). Stmt(  The Copperhead Return( CallFunc( compiler is written in Name('map'). 'yi']. return map(lambda xi.

Compiling Copperhead to CUDA  Every Copperhead function creates at least one CUDA device function  Top level Copperhead functions create a CUDA global function. which orchestrates the device function calls  The global function takes care of allocating shared memory and returning data (storing it to DRAM)  Global synchronizations are implemented through multiple phases  All intermediate arrays & plumbing handled by Copperhead compiler 21/28 .

Saxpy Revisited def saxpy(a. y): return map(lambda xi. a. uint _blockMax = _blockMin + blockDim. _returnValueReg). Array<Num> _returnValue) { uint _blockMin = IMUL(blockDim. uint _globalIndex = _blockMin + threadIdx. y) template<typename Num> __device__ Num lambda0(Num xi. _yReg. x. saxpy0Dev(x. Num a) { return ((a * xi) + yi). Array<Num> y. if (_globalIndex < x. _globalIndex. } template<typename Num>__global__ void saxpy0(Array<Num> x. Num a.x. Num& _returnValueReg) { Num _xReg. Num a. Num _returnValueReg. uint _globalIndex.length) _yReg = y[_globalIndex]. y. if (_globalIndex < x. x. if (_globalIndex < _returnValue.x.length) _returnValueReg = lambda0<Num>(_xReg. } template<typename Num>__device__ void saxpy0Dev(Array<Num> x. if (_globalIndex < y. a).x.length) _xReg = x[_globalIndex]. yi: a*xi + yi. yReg. } 22/28 .x). Array<Num> y. blockIdx.length) _returnValue[_globalIndex] = _returnValueReg. Num yi.

Phases  Reduction phase 0 phase 1  Scan phase 0 phase 1 phase 2 23/28 .

this composition is done greedily  Compiler tracks global and local availability of all variables and creates a phase boundary when necessary  Fusing work into phases is important for performance 24/28 . B = reduce(map(A)) D = reduce(map(C)) phase 0 phase 1  Compiler schedules computations into phases  Right now. Copperhead to CUDA. cont.

cont.  Shared memory used only for communicating between threads  Caching unpredictable accesses (gather)  Accessing elements with a uniform stride (shift & rotate)  Each device function returns its intermediate results through registers 25/28 .Copperhead to CUDA.

0.b . flags) 2 return scatter(input. notFlags) 2 positions = map(lambda a. b: a + b. value): 0 flags = map(lambda a: 1 if a <= value else 0. b: a + b. flags) phases 0-2 leftPositions = exscan(lambda a. rightPositions.1. b. positions)  This code is decomposed into 3 phases  Copperhead compiler takes care of intermediate variables  Copperhead compiler uses shared memory for temporaries used in scans here  Everything else is in registers 26/28 . flags) 0-2 rightPositions = exrscan(lambda a. flag: a if flag else len(input) . 0. input) 0 notFlags = map(lambda a: not a. leftPositions.Split def split(input.

Interpreting to Copperhead  If the interpreter harvested dynamic type information. it could use the Copperhead compiler as a backend  Fun project – see what kinds of information could be gleaned from the Python interpreter at runtime to figure out what should be compiled via Copperhead to a manycore chip 27/28 .

etc. 28/28 .Future Work  Finish support for the basics  Compiler transformations  Nested data parallelism flattening ▪ segmented scans  Retargetability  Thread Building Blocks/OpenMP/OpenCL  Bridge Python and Copperhead  Implement real algorithms with Copperhead  Vision/Machine Learning.