Professional Documents
Culture Documents
Biology
Physics
Chemistry
n : the number of elements in matrix
Comments:
Computational complexity increases tremendously as
the dimension of matrix increases.
Gaussian Elimination solver has obvious advantage
in terms of complexity as the matrix size increases.
Ax B A B
pivot 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
m21 1 1 2 4 1 pivot 0 1 3 2 0 1 3 2 0 1 3 2
m31 1 1 3 9 1 m32 2 0 2 8 0 0 0 2 4 0 0 1 2
Normalization
Iteration No.1 Iteration No.2 …… Iteration No.N
1 1 1
0 0 1 0 1
: : 0 …… : 0 1
0 0 : 0 : 0 1
0 0 0 0 0 : 0 1
0 0 0 0 0 0 : 0 1
Inter-iteration parallelism
For Iteration i
m0i 1
m1i 0
m2i :
0
m3i 0
0
pivot 1 1 1 1 1 1 1 1 1 1 1 1
m21 1 1 2 4 1 pivot 0 1 3 2 0 1 3 2
m31 1 1 3 9 1 m32 2 0 2 8 0 0 0 2 4
pivot 1 1 1 1 1 1 1 1 1 0 2 3 1 0 0 7
m21 1 1 2 4 1 pivot 0 1 3 2 0 1 3 2 0 1 0 8
m31 1 1 3 9 1 m32 2 0 2 8 0 0 0 2 4 0 0 1 2
For iteration ith
Row j=Row j-mj*Row i
Column i
0 Added
: elimination
0 in modified
0 Gaussian
Row i Elimination
1
0
0
:
0
Traditional
Gaussian
Elimination
Back
Gaussian Elimination
Substitution
Modified Gaussian
Elimination
Threads Architecture
Matrix handling like modified Gaussian Elimination kernel, each thread
handles an operation of A[j][i]=A[j][i]-mj*A[i][j] for iteration ith, use two
dimensional grid and block, total of N*N threads in the kernel
Row or column handling like partial pivoting and others, each thread for
one elements in the row or column, use one dimentsional grid and block,
total of N threads in the kernel
CudaMemcpy:Device to Host
d_temp h_temp c
B(0,1) T(0,0)
BLOCK_SIZE
T(0,1)
:
:
N
:
:
: B(i,j)
:
:
: :
:
B(0,N-1) T(0,M-1)
Column i BLOCK_SIZE
0
:
0
0
Row i 1
0
0
:
0
N Multiplier Column m
B(0,1)
:
N
Shared
: B(i,j)
Memory
:
B(0,N-1)
Row i
80000
Traditional GE
70000
Modified GE
60000
Global(16SM)
50000
Shared(16SM)
40000
30000 Comments:
20000 CPU prefers traditional GE solver
10000 than modified GE solver
0 GPU shared implementation is
Matrix size
512 1024 2048 4096 always 2-3 times faster than global
implementation
GPU(16SM) shared implementation
is around 2 times speedup compared to
traditional GE
For 1024 case (1SM), global memory implementation time is
13488ms, shared implementation is 4806ms
Problem found:
More uncoalesced global memory accessing offsets advantages
gained from more parallelism in modified Gaussian Elimination.