Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
13Activity
0 of .
Results for:
No results containing your search query
P. 1
GPU Accelerated Linear Algebra

GPU Accelerated Linear Algebra

Ratings: (0)|Views: 894 |Likes:
Optimising performance of Matrix-multiplication, LU- and QR-decomposition using Cuda.
Optimising performance of Matrix-multiplication, LU- and QR-decomposition using Cuda.

More info:

Published by: Mikkel Bundgaard-Ovesen on Aug 30, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

05/23/2013

pdf

text

original

 
 
 
 i
Table of contents
Table of contents .............................................................................................. i
 
List of figures................................................................................................... v 
 
List of tables................................................................................................... vi
 
Summary .........................................................................................................1
 
1
 
Introduction ..............................................................................................3
 
1.1
 
Motivation ...........................................................................................3
 
1.2
 
Reading guide ......................................................................................3
 
1.3
 
Problem definition .................................................................................5
 
1.4
 
Method ...............................................................................................6
 
1.5
 
Scope.................................................................................................6
 
1.5.1
 
Algorithms ....................................................................................6
 
1.5.2
 
Numerical stability ..........................................................................6
 
1.5.3
 
IEEE 754 and double-precision ............................................................7
 
1.5.4
 
BLAS ............................................................................................7
 
2
 
Background ...............................................................................................8
 
2.1
 
Linear algebra ......................................................................................8
 
2.2
 
GPU computing .....................................................................................8
 
3
 
Parallel platforms..................................................................................... 10 
 
3.1
 
Cuda ................................................................................................ 10
 
3.1.1
 
History ....................................................................................... 10
 
3.1.2
 
Version....................................................................................... 10
 
3.1.3
 
Cuda program .............................................................................. 11
 
3.1.4
 
Architecture ................................................................................ 13
 
3.1.5
 
Limitations .................................................................................. 16
 
3.2
 
GPU.NET ........................................................................................... 17
 
3.2.1
 
Overview .................................................................................... 17
 
3.2.2
 
Development ............................................................................... 18
 
 
ii3.2.3
 
Execution ................................................................................... 19
 
3.2.4
 
Limitations and bugs ...................................................................... 19
 
3.2.5
 
Evaluation ................................................................................... 20
 
4
 
Hardware platform ................................................................................... 21
 
4.1
 
Analysis ............................................................................................ 22
 
4.2
 
Benchmarking .................................................................................... 23
 
4.2.1
 
Memory performance ..................................................................... 23
 
4.2.2
 
Arithmetic performance .................................................................. 24
 
5
 
Implementation ....................................................................................... 26
 
5.1
 
Development environment ..................................................................... 26
 
5.2
 
Design decisions .................................................................................. 26
 
5.3
 
Optimisation ...................................................................................... 27
 
5.3.1
 
Strategy ..................................................................................... 27
 
6
 
Matrix-multiplication ................................................................................ 29 
 
6.1
 
Analysis ............................................................................................ 30
 
6.1.1
 
The sequential algorithm ................................................................ 30
 
6.1.2
 
Parallelism .................................................................................. 31
 
6.2
 
Simple algorithm ................................................................................. 32
 
6.2.1
 
The algorithm .............................................................................. 32
 
6.2.2
 
Test and results ............................................................................ 33
 
6.3
 
Optimisation ...................................................................................... 36
 
6.3.1
 
Unroll loop with threads ................................................................. 36
 
6.3.2
 
Tiling v1 ..................................................................................... 38
 
6.3.3
 
Tiling v2 with latency hiding ............................................................ 41
 
6.3.4
 
Tiling v3 with prefetching ................................................................ 42
 
6.3.5
 
Tiling v4 and v5 with more output per thread ....................................... 43
 
6.3.6
 
Cuda compute capability ................................................................. 45
 
6.4
 
Evaluation ......................................................................................... 46
 
 
LU decomposition ..................................................................................... 48
 
7.1
 
Analysis ............................................................................................ 48
 

Activity (13)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
Lars Wirfelt liked this
Lars Wirfelt liked this
Cheng-Hsien Lee liked this
sticker592 liked this
Chidcha Noo liked this
Gareth Thomas liked this
Angelo liked this
BGSorin liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->