6 Multi-core Experiments

P Cockshott, Y. Gdura, P. Keir, A.Koliousis

4 Targets, 3 Translation Systems, 1 Problem
The Targets Intel SCC 48 core chip Intel Nehalem 8 core chip Intel Sandy Bridge 4 core chip IBM Cell 7 core chip The Languages Lino – tiling language Glasgow Pascal Compiler (Vector Pascal ) Glasgow Fortran Compiler (E#) The Problem N-body gravitational.

4 CHIPS

The SCC

48 cores each 32 bit, 533 Mhz clock, cross bar network, no coherent cache

The Cell
64 bit power PC, 8 128 bit vector processors with private RAM, high speed ring comms, 3.2 Ghz

Sandy Bridge
4 x cores with 256 bit AVX vector instructions, 3.1 Ghz, ring bus, coherent caches

Nehalem
8 cores in 2 chips, each core supports 128 bit vector operations, 2.4 Ghz, SSE2, coherent caches

3 LANGUAGES

LINO
• Scripting language targeted at multi-cores * • Topological process algebra • Allows two dimensional composition of communicating processes • Underlying processes can be any executable shell command • Targets any shared memory Linux platform + SCC

* developed by Cockshott, Michaelson and Koliousis

Vector Pascal
• Invented by Turner * for vector super-computers • Extends Pascal’s support for array operations to general parallel map operations over arrays • Designed to make use of SIMD instruction sets and multi-core. • Glasgow Pascal Compiler supports Vector Pascal extensions.

* T .R. Turner. Vector Pascal: a computer programming language for the FPS-164 array pro cessor. 1987.

Fortran
• Invented by Backus • Updated to include array programming features with High Performance Fortran, and Fortran 90 • A clean subset F exists* in which the antiquated features are deleted , but the array features are kept. • The E# compiler ( a pun on F) developed at Glasgow by Keir, implements this subset.

* M Metcalf an D J Reid . The F Programming Language . Oxford University Press, 1996

THE PROBLEM

The N body Problem
For 1024 bodies Each time step
For each body B in 1024
Compute force on it from each other body From these derive partial acceleration Sum the partial accelerations Compute new velocity of B

For each body B in 1024
Compute new position
Complexity
Force calculation N planets, p cores

Inter core communication

The Reference C Version
for (i = 0; i < nbodies; i++) { struct planet * b = &(bodies[i]); for (j = i + 1; j < nbodies; j++){ struct planet * b2 = &(bodies[j]); double dx = b->x - b2->x; double dy = b->y - b2->y; double dz = b->z - b2->z; double distance = sqrt(dx * dx + dy * dy + dz * dz); double mag = dt / (distance * distance * distance); b->vx -= dx * b2->mass * mag; b->vy -= dy * b2->mass * mag; b->vz -= dz * b2->mass * mag; b2->vx += dx * b->mass * mag; b2->vy += dy * b->mass * mag; b2->vz += dz * b->mass * mag; }

Note that this version has side effects so the successive iterations of the outer loop can not run in parallel as the inner loop updates the velocities.

Lino version for 4 cores
nwcorner = [./nbody >East <South]; swcorner = [./nbody >North <East]; scorner = [./starter4.sh >South <West]; corner = [cat >West <North]; passright = [./nbody >East <West]; passleft = [./nbody >West <East]; top = [nwcorner | passright | scorner]; bottom = [swcorner | passleft |corner]; main = top _ bottom;

The w cells run the nbody programme a slightly modified version of the C nbody code. Each timestep compute new velocities and positions of ¼ planets then broadcast updates round loop.

Larger Lino versions

Pascal version – no explicit inner loop
pure function computevelocitychange(start:integer):coord;

-- declarations {M: pointer to mass vector, x: pointer to position matrix, di :

displacement matrix, distance: vector of distances}
begin row:=x^[iota[0],i]; { Compute the displacement vector between each planet and planet i.} di:= row[iota[0]]- x^; { Next compute the euclidean distances } xp:=@ di[1,1];yp:=@di[2,1];zp:=@di[3,1]; { point at the rows } distance:= sqrt(xp^*xp^+ yp^*yp^+ zp^*zp^)+epsilon; mag:=dt/(distance *distance*distance ); changes.pos:= \+ (M^*mag*di); end
Row Summation operator builds x,y,z components of dv

Pack this up in Pure Function Applied in Parallel
This is a column vector

procedure radvance( dt:real); var dv:array[1..n,1..1] of coord; i,j:integer; pure function computevelocitychange(i:integer;dt:real):coord; begin {--- do the computation on last slide} computevelocitychange:=changes.pos; end; begin Iota[0] is the 0th index vector If the left hand side

dv :=computevelocitychange(iota[0],dt); { can be evaluated in

parallel}
for i:= 1 to N do { iterate on planets } for j:= 1 to 3 do { iterate on dimensions } v^[j,i]:=v^[j,i]+ dv[i,1].pos[j]; { update velocities } x^ := x^ + v^ *dt; { Finally update positions. } end;

Equivalent Fortran kernel function
Elemental is equivalent to pure function in the Pascal
elemental function calc_accel_p(pchunk) result(accel) type(pchunk2d), intent(in) :: pchunk type(accel_chunk) integer :: i, j accel%avec3 = vec3(0.0_ki, 0.0_ki, 0.0_ki) do i=1,size(pchunk%ivec4) do j=1,size(pchunk%jvec4) dx = pchunk%ivec4(i)%x - pchunk%jvec4(j)%x dy = pchunk%ivec4(i)%y - pchunk%jvec4(j)%y dz = pchunk%ivec4(i)%z - pchunk%jvec4(j)%z distSqr distSixth = dx*dx + dy*dy + dz*dz + EPS = distSqr * distSqr * distSqr :: accel real(kind=ki) :: dx, dy, dz, distSqr, distSixth, invDistCubed, s

invDistCubed = 1.0_ki / sqrt(distSixth) s = pchunk%jvec4(j)%w * invDistCubed accel%avec3(i)%x = accel%avec3(i)%x - dx * s accel%avec3(i)%y = accel%avec3(i)%y - dy * s accel%avec3(i)%z = accel%avec3(i)%z - dz * s end do

end do
end function calc_accel_p

Invoked by writing
accels = calc_accel_p(pchunks)

Where
type(accel_chunk), dimension(size(pchunks)) :: accels

And
type, public :: accel_chunk type(vec3), dimension(CHUNK_SIZE) :: avec3 end type accel_chunk

Pascal compilation strategy

Virtual SIMD Machine (VSM) Model • VSM Instructions
• Register to Register Instructions • Operate on virtual SIMD registers ( 1KB - 16KB )

• Support basic Operations (+, - , / , * , sqrt , \+, rep ... etc)

E# compilation strategy

RESULTS

Lino on SCC versus Nehalem ( Xeon )

Why is SCC so much slower

Least squares fit of equation to the data of performances of both machines α = time to compute partial velocity change for two bodies, β = time to transmit data about a body between two cores the parameters have dimension nanoseconds.
machine SCC Nehalem α 677 27 β 94000 223 Ratio 1/138 1/8

SCC has a slower clock than the Xeon and uses an older core design, but the worst factor is the much slower inter core comms cost using the SCC

Compare all systems

Pascal Performance on Cell for Large Problems
Performance (seconds) per Iteration

N-body Problem Size 1K 4K 8K 16K

Vector Pascal

C

PPE 0.381 4.852 20.355 100.250

1 SPE 0.105 1.387 5.715 22.278

2 SPEs 0.065 0.782 3.334 13.248

4 SPEs 0.048 0.470 2.056 8.086

PPE 0.045 0.771 3.232 16.524

machine SCC

α

β 677
27 82

β 94000

Ratio 1/138

Nehalem Cell

223 42200

1/8 1/500

CONCLUSIONS

Hardware
Machines order as follows 1. Sandy Bridge 2. Nehalem 3. Cell 4. SCC General conclusions Shared memory designs much faster and easier to programme Ring communications with hardware logic much better than the SCC message passing or the Cell DMA architecture Single instruction set much easier to target

Software

Array Languages
Pro
These seem to outperform C whilst allowing code to remain at high level. Allow machine independence.

Lino
Pro
Gives good perfomance on standard Linux. Competitive with Eden, C#, Go. Allows use of legacy code

Con
Can not use existing legacy code that uses loops.

Con
Not as fast as array languages Performance on SCC disappointing

Sign up to vote on this title
UsefulNot useful