You are on page 1of 17

By Xinggao Xia and Jong Chul Lee

Applications of Systems of Linear Equations

 Biology

 Physics

 Chemistry
n : the number of elements in matrix

Technique Additions Multiplications/Divisions

Gauss-Jordan n3/2 n3/2

Gaussian Elimination n3/3 n3/3

Cramer’s Rule n4/3 n4/3

Table 1: Computational Complexity of Various Solving Techniques

Comments:
Computational complexity increases tremendously as
the dimension of matrix increases.
Gaussian Elimination solver has obvious advantage
in terms of complexity as the matrix size increases.
Ax  B  A B

Iteration No.1 Iteration No.2 Iteration No.3

pivot 1 1 1 1  1 1 1 1  1 1 1 1  1 1 1 1 
       
m21  1 1 2 4  1 pivot 0 1 3  2 0 1 3  2 0 1 3  2 
m31  1 1 3 9 1  m32  2 0 2 8 0  0 0 2 4  0 0 1 2 

Normalization
Iteration No.1 Iteration No.2 …… Iteration No.N

1 1 1
0 0 1 0 1
: : 0 …… : 0 1
0 0 : 0 : 0 1
0 0 0 0 0 : 0 1
0 0 0 0 0 0 : 0 1
Inter-iteration parallelism
For Iteration i

A[j][]=A[j][] –m[j][i]*matrix pivot row

Multiplier array m must be determined before each iteration

m0i 1
m1i 0
m2i :
0
m3i 0
0

Perfectly fit CUDA architecture


Modified Gaussian Elimination is considered for CUDA linear
equations solver
More parallelism
No back substitution
Partial pivoting guarantees accuracy of solution
Traditional Gaussian Elimination

Initial state Iteration No.1 Iteration No.2

pivot 1 1 1 1  1 1 1 1  1 1 1 1 
     
m21  1 1 2 4  1 pivot 0 1 3  2 0 1 3  2
m31  1 1 3 9 1  m32  2 0 2 8 0  0 0 2 4 

Modified Gaussian Elimination

Initial state Iteration No.1 Iteration No.2 Iteration No.3

pivot 1 1 1 1  1 1 1 1  1 0 2 3  1 0 0 7 
       
m21  1 1 2 4  1 pivot 0 1 3  2 0 1 3  2 0 1 0  8
m31  1 1 3 9 1  m32  2 0 2 8 0  0 0 2 4  0 0 1 2 
For iteration ith
Row j=Row j-mj*Row i
Column i

0 Added
: elimination
0 in modified
0 Gaussian
Row i Elimination
1
0
0
:
0

Traditional
Gaussian
Elimination
Back
Gaussian Elimination
Substitution

Iteration No.1 Iteration No.2 …… Iteration No.N-1


1 1 1
0 0 1 0 1 Traditional
: : 0 : 0 1 Gaussian
……
0 0 : 0 : 0 1 Linear
0 0 0 0 0 : 0 1 Solver
0 0 0 0 0 0 : 0 11

Modified Gaussian
Elimination

Iteration No.1 Iteration No.2 …… Iteration No.N


1 1 0 1 0 0 0 0 0
0 0 11 0 11 0 : : : Modified
: : 0 : 0 1 0 : : Gaussian
…… 11
0 0 : 0 : 0 0 : Linear
0 0 0 0 0 : : 1 0 Solver
0 0 0 0 0 0 0 0 1
For (i=0; i<N; i++)
{
Partial pivoting
{
Transfer the ith column back to host;
Search the maximum of this column and return the index; (Host)
Switches rows if necessary; (Device)
}
Determine the multiplier column; (Device)
Modified Gaussian elimination; (Device)
}
Normalize the solution; (Device)
Transfer solution back to host;

Threads Architecture
 Matrix handling like modified Gaussian Elimination kernel, each thread
handles an operation of A[j][i]=A[j][i]-mj*A[i][j] for iteration ith, use two
dimensional grid and block, total of N*N threads in the kernel
 Row or column handling like partial pivoting and others, each thread for
one elements in the row or column, use one dimentsional grid and block,
total of N threads in the kernel
CudaMemcpy:Device to Host
d_temp h_temp c

Kernel1 Host: search


maximum is c
0
:
0 Minimizing Device Host
0 transportation: Switching
a rows by kernel
b
c
: 0
x :
0
Kernel2
0
d_temp a
b
Kernel4 c Kernel3
:
x
For ith iteration
Each thread handles: A[j][i]=A[j][i]-mj*A[i][j]
N
Iteration i data partitioning
B(0,0) B(1,1) …… B(N-1,1)
T(0,0) T(0,1) ………………T(0,M-1)

B(0,1) T(0,0)

BLOCK_SIZE
T(0,1)
:
:
N

:
:
: B(i,j)
:
:
: :
:
B(0,N-1) T(0,M-1)

Column i BLOCK_SIZE

0
:
0
0
Row i 1
0
0
:
0
N Multiplier Column m

B(0,0) B(1,1) …… B(N-1,1)

B(0,1)

:
N

Shared
: B(i,j)
Memory
:
B(0,N-1)

Row i

For ith iteration: A[j][i]=A[j][i]-mj*A[i][j]


Shared
Memory
Platform Configuration: Comments:
GPU: GeForce 8400 GS GPU implementation (Global or
1 SM, 8 cores, Clock rate 1.40GHz shared) is much slower than CPU
CPU: Intel Core2 Quad Q6600 implementation(1SM)
Clock rate 2.39GHz Try to mimic Tesla (16SM) by
scaling GPU time by 16

512 1024 2048 4096


Serial Traditional Gaussian Linear Solver 47 403 5214 46098
Serial Modified Gaussian Linear Solver 71 564 8412 69949
Global Memory (1SM) 1718 13488 108916 862580
Shared Memory (1SM) 662 4806 38923 312787
Global Memory (scaled by 16) 107 843 6807 53911
Shared Memory (scaled by 16) 41 300 2433 19549
Time (ms) Time (ms)
80000 900
Traditional GE
70000 800
Modified GE
60000 700
Global(16SM)
50000 600
Shared(16SM)
40000 500
30000 400
20000 300
200
10000
100
0
2048 4096 0
Time (ms) Matrix size 512 1024 Matrix size

80000
Traditional GE
70000
Modified GE
60000
Global(16SM)
50000
Shared(16SM)
40000
30000 Comments:
20000  CPU prefers traditional GE solver
10000 than modified GE solver
0  GPU shared implementation is
Matrix size
512 1024 2048 4096 always 2-3 times faster than global
implementation
 GPU(16SM) shared implementation
is around 2 times speedup compared to
traditional GE
For 1024 case (1SM), global memory implementation time is
13488ms, shared implementation is 4806ms

Method #Calls GPU(usec) %GPU time


Global GE_kernel 1024 1.3e+07 99.11
Shared GE_kernel 1024 4.8e+06 97.6

gld uncoalesced gld coalesced %uncoalesced rate


Global 1048576 131072 89
Shared 61440 73728 45
Conclusion:
Linear equations solver based on modified Gaussian Elimination
is implemented on CUDA
Shared memory is about 3 times faster than global memory
implementation
Shared memory is expected about 3 times faster than traditional
Gaussian Elimination Solver serial code in 16SM GPU
Partial pivoting guarantees stability and accuracy. (error less than
0.001 compared to serial code)

Problem found:
More uncoalesced global memory accessing offsets advantages
gained from more parallelism in modified Gaussian Elimination.

You might also like