You are on page 1of 79

-1I HC QUC GIA H NI TRNG I HC CNG NGH

Nguyn Th Thy Linh

TNH TON HIU NNG CAO VI B X L HA GPU V NG DNG

LUN VN THC S

H Ni - 2009

Lun vn Thc s - Nguyn Th Thy Linh

LI CAM OAN
Vi mc ch hc tp, nghin cu nng cao kin thc v trnh chuyn mn nn ti lm lun vn ny mt cch nghim tc v hon ton trung thc. Trong lun vn, ti c s dng ti liu tham kho ca mt s tc gi. Ti nu trong phn ti liu tham kho cui lun vn. Ti xin cam oan v chu trch nhim v ni dung v s trung thc trong lun vn tt nghip Thc s ca mnh! H Ni, thng 12 nm 2009 Hc vin

Nguyn Th Thy Linh

LI CM N
Nhng kin thc cn bn trong lun vn ny l kt qu ca ba nm (2005-2008) ti c may mn c cc thy c gio trong Trng i hc Cng Ngh - i hc Quc Gia H Ni, cc thy c gio cc thy c gio cc Trng i hc, Vin nghin cu trong v ngoi nc trc tip ging dy, o to v du dt. Ti xin by t li cm n chn thnh ti cc thy c gio trong B mn H thng thng tin Khoa Cng ngh thng tin i hc Cng Ngh - HQG H Ni, Phng o to sau i hc i hc Cng Ngh - HQG H Ni to iu kin thun li cho ti trong thi gian hc tp ti trng. Ti xin by t lng bit n chn thnh, li cm n su sc nht i vi thy gio TS. Nguyn Hi Chu trc tip hng dn, nh hng cho ti gii quyt cc vn trong lun vn. Ti cng xin cm n cc anh ch em ng nghip Ngn hng TMCP Cng Thng Vit Nam ng h v gip ti trong qu trnh thc hin lun vn. Lun vn cng xin c l li chia vui vi ngi thn, ng nghip, bn b v cc bn ng mn lp cao hc K12T3. H Ni, thng 12 nm 2009 Hc vin

Nguyn Th Thy Linh

MC LC
LI CAM OAN....................................................................................................................... 2 LI CM N............................................................................................................................. 3 MC LC .................................................................................................................................. 4 M U .................................................................................................................................... 6 DANH MC THUT NG ...................................................................................................... 7 DANH MC HNH V, BNG BIU ..................................................................................... 8 Danh mc hnh v ................................................................................................................... 8 Danh mc bng biu ............................................................................................................... 8 Chng 1. TNG QUAN V TNH TON SONG SONG V GPU ..................................... 9 1.1. Tng quan v tnh ton song song .............................................................................. 9 1.1.1. Cc m hnh my tnh song song ..................................................................... 10 1.1.2. M hnh lp trnh song song ............................................................................. 12 1.1.3. S cn thit ca cng c pht trin ng dng song song ................................. 16 1.2. Tng quan v GPU ................................................................................................... 17 1.2.1. Gii thiu GPU................................................................................................. 17 1.2.2. Lch s pht trin GPU..................................................................................... 18 1.2.3. Kin trc GPU .................................................................................................. 20 1.2.4. Tnh ton trn GPU .......................................................................................... 25 1.2.5. Mi trng phn mm ...................................................................................... 28 1.2.6. K thut v ng dng ....................................................................................... 31 Chng 2. H THNG CHNG TRNH DCH V NGN NG LP TRNH GPU 37 2.1. Gii thiu v mi trng pht trin CUDA.............................................................. 37 2.2. M hnh lp trnh ...................................................................................................... 39 2.2.1. B ng x l a lung mc cao...................................................................... 39 2.2.2. Gom l cc lung (Thread Batching) ............................................................... 39 2.2.3. M hnh b nh ................................................................................................ 41 2.3. Thit lp phn cng .................................................................................................. 42 2.3.1. Tp cc b a x l SIMD vi b nh dng chung trn chip .......................... 42 2.3.2. M hnh thc thi ............................................................................................... 44 2.3.3. Kh nng tnh ton............................................................................................ 45 2.3.4. a thit b ......................................................................................................... 46 2.3.5. C ch chuyn i ............................................................................................ 46 2.4. Giao din lp trnh ng dng.................................................................................... 46 2.4.1. M rng cho ngn ng lp trnh C ................................................................... 46 2.4.2. M rng ngn ng ............................................................................................ 47 2.4.3. Thnh phn chung trong thi gian chy ........................................................... 52 2.4.4. Thnh phn thit b thi gian chy ................................................................... 55 2.5. Hng dn hiu nng................................................................................................ 58 2.5.1. Hiu nng lnh.................................................................................................. 58 2.5.2. S lng lung trong mt khi......................................................................... 64 2.5.3. Truyn d liu gia Host v device.................................................................. 66 2.5.4. Li ch ca vic t chc b nh ....................................................................... 66 Chng 3. NG DNG GPU VO BI TON N-BODY V TH NGHIM CHNG TRNH.................................................................................................................... 67 3.1. Bi ton m phng N-body ...................................................................................... 67 3.2. Xy dng bi ton N-body trn CPU ....................................................................... 69 3.2.1. Thut ton tch hp thi gian Verlet: ............................................................... 69 3.2.2. Cng thc tnh lc c bn v tnh tim nng.................................................... 69 3.2.3. Thut ton m phng N-Body .......................................................................... 70

3.3. Xy dng bi ton N-body trn GPU ....................................................................... 71 3.4. Th nghim............................................................................................................... 72 3.4.1. Mi trng th nghim: ................................................................................... 72 3.4.2. Kt qu th nghim .......................................................................................... 73 3.5. Kt lun th nghim ................................................................................................. 76 KT LUN ..............................................................................................................................77 TI LIU THAM KHO ........................................................................................................ 78

M U
Cc b x l ha (GPU - Graphic Proccessing Unit) tr thnh mt phn khng th tch ri ca h thng my tnh ngy nay. Trong su nm va qua nh du s gia tng n tng trong hiu sut v kh nng ca GPU. GPU hin i khng ch l mt cng c x l ha mnh m cn l mt b x l h tr lp trnh song song mc cao, gip x l cc bi ton s hc lp trnh tnh nng x l s hc phc tp v bng thng b nh tng hn ng k so vi CPU cng loi. S tng tc nhanh chng ca GPU trong c kh nng h tr lp trnh v nng lc tnh ton ca n to ra mt xu hng nghin cu mi. Mt cng ng nghin cu v nh x thnh cng mt lng ln cc vn phc tp i hi tnh ton ln vo GPU. iu ny trong n lc chung nhm mc ch ng dng GPU vo gii quyt cc bi ton hiu nng cao ca tnh ton hin i. Tnh ton mc ch thng dng trn GPU (GPGPU) l mt thay th hp dn cho CPU ti trong h thng my tnh hin i. Trong mt tng lai khng xa, chng ta c th s thy GPU s m nhn thay cho CPU nhng cng vic nh x l hnh nh v ha, cc tnh ton phc tp thay v ch dng li nhng ng dng tr chi 3D. Vi nhng ngha thc tin , lun vn i vo nghin cu tnh ton thng dng trn GPU v th nghim trc tip trn bi ton tnh ton hiu nng cao tiu biu l nbody. Lun vn gm 3 chng chnh: Chng 1: Tng quan v tnh ton song song v GPU, chng ny gii thiu nhng kin thc tng quan v tnh ton song song, t tm hiu nhng kin thc c bn v b x l ha GPU v cch thc ng dng tnh ton trn . Chng 2: H thng chng trnh dch v ngn ng lp trnh GPU. Chng ny cung cp cc kin thc v mi trng lp trnh, ngn ng lp trnh, cch thit lp chng trnh v cc ch dn hiu nng khi ci t ng dng tnh ton trn GPU. Chng 3: ng dng GPU vo bi ton n-body v th nghim chng trnh. Trn c c cc kin thc c trnh by cc chng trn, tc gi lun vn tin hnh ci t v th nghim m phng n-body trn CPU v GPU. T c nhng so snh, nhn xt v nng lc tnh ton vt tri ca GPU so vi CPU truyn thng. ng thi cng m ra cc hng ci tin hiu nng mi cho bi ton n-body chy trn GPU.

DANH MC THUT NG
STT Ting Anh 1 API Ting Vit Application Program Interface: mt API nh ngha mt giao din chun triu gi mt tp cc chc nng. b ng x l tnh ton thng dng trn GPU B x l ha
ht nhn

2 3 4 5 6 7 8 9 10

coproccessor gpgpu GPU kernel MIMD primary surface proccessor Rasterization SIMD

Multiple Instruction Multiple Data: a lnh a d liu B mt chnh, khi nim dng trong kt cu B x l S qut mnh trn mn hnh Single Instruction Multiple Data: n lnh a d liu Dng B x l dng Kt cu: cu trc ca i tng, n c xem nh m hnh thu nh ca i tng. Hm c kt cu Tham chiu kt cu Mi khi c tch thnh cc nhm SIMD ca cc lung

11 stream 12 streaming processor 13 texture 14 texture fetches 15 texture reference 16 warp

DANH MC HNH V, BNG BIU


Danh mc hnh v
Hnh 1. My tnh song song c b nh chia s......................................................................... 10 Hnh 2. My tnh song song c b nh phn tn ...................................................................... 11 Hnh 3. Hot ng ca h thng SIMD .................................................................................... 11 Hnh 4. Hot ng ca h thng MIMD................................................................................... 12 Hnh 5. M hnh lp trnh a lung .......................................................................................... 14 Hnh 6. M hnh truyn thng ip .......................................................................................... 14 Hnh 7. M hnh song song d liu .......................................................................................... 15 Hnh 8. M hnh SPMD............................................................................................................ 16 Hnh 9. M hnh MPMD .......................................................................................................... 16 Hnh 10: nh chp 3dfx Voodoo3 ........................................................................................... 19 Hnh 11: Kin trc GPU ca NVIDIA v AMD c mt lng s cc n v lp trnh c t chc song song thng nht ................................................................................................... 25 Hnh 12:Hiu nng qut trn CPU, v GPU da trn ha (s dng OpenGL), v GPU tnh ton trc tip (s dng CUDA). Kt qu thc hin trn GeForce 8800 GTX GPU v Intel Core2Duo Extreme 2.93 GHz CPU. Hnh v c ly H. Nguyen (ed), GPU Gems 3, copyright (c) 2008 NVIDIA Corporation, published by Addison-Wesley Professional.......... 33 Hnh 13: Kin trc b phn mm CUDA ................................................................................. 37 Hnh 14: Cc thao tc thu hi v cp pht b nh .................................................................... 38 Hnh 15: Vng nh dng chung mang d liu gn ALU hn .................................................. 39 Hnh 16: Khi lung ................................................................................................................. 41 Hnh 17:M hnh b nh .......................................................................................................... 42 Hnh 18: M hnh phn cng.................................................................................................... 44 Hnh 19: Hnh nh m phng N-body [~8] .............................................................................. 68 Hnh 20: Biu so snh thi gian thc hin gia GPU v CPU theo s lng phn t trong m phng n-body......................................................................................................................73 Hnh 22: Biu th hin t s tng tc CPU/GPU khi s phn t trong m phng n-body tng ........................................................................................................................................... 74 Hnh 21: Ti tnh ton trn CPU khi chy m phng n-body vi s phn t 256K. 1 CPU lun 100%, i khi chim thm ti 100% ca cc CPU khc ....................................................... 75 Hnh 23: Biu hiu nng trn GPU Geforce 8800 GTX trong m phng n-body khi s phn t tng ....................................................................................................................................... 76

Danh mc bng biu


Bng 1: Kt qu th nghim bi ton N-body trn GPU Nvidia GeForce 8800 GTX v CPU Intel(R) Core(TM)2 Quad 2.66GHz......................................................................................... 73 Bng 2: T s tng tc gia CPU v GPU .............................................................................. 74 Bng 3: Tc x l trn GPU 8800 GTX khi s phn t tng ............................................. 76

Chng 1. TNG QUAN V TNH TON SONG SONG V GPU


1.1. Tng quan v tnh ton song song
Khoa hc k thut ngy cng pht trin, t ra nhiu bi ton vi khi lng tnh ton rt ln. Trong s c nhng bi ton m kt qu ch c ngha nu c hon thnh trong khong thi gian cho php. V d nh cc tnh ton trong thi gian thc, m phng cc hot ng mc lng t, tnh qu o chuyn ng ca vt th trong khng gian, d bo thi tit... gii quyt nhng bi ton ny, ngi ta nghin cu tng tc tnh ton bng hai phng php hay kt hp c hai: Phng php 1: Ci tin cng ngh, tng tc x l ca my tnh. Cng vic ny i hi nhiu thi gian, cng sc v tin ca, nhng tc cng ch t c n mt gii hn no . Phng php 2: Chia bi ton ra thnh nhng cng vic nh c th chy song song trn nhiu b x l. Vic pht trin cng ngh tnh ton theo phng php 2 cho ra i cng ngh tnh ton song song, l vic s dng ng thi nhiu ti nguyn tnh ton gii quyt mt bi ton. Cc ti nguyn tnh ton c th bao gm mt my tnh vi nhiu b vi x l, mt tp cc my tnh kt ni mng hay l mt s kt hp ca hai dng trn. Cng ngh tnh ton song song cho php gim thi gian thc thi bi ton ty thuc cch phn chia v s b x l thc thi chng trnh. Nguyn tc quan trng nht ca tnh ton song song chnh l tnh ng thi hay x l nhiu tc v cng mt lc. Trong tnh ton song song hin nay, c hai cng ngh chnh: Th nht l s dng cc siu my tnh vi rt nhiu b x l c tch hp bn trong c thit k ng b c v phn cng v phn mm. Cc cng ngh c p dng trong cc siu my tnh thng l cc cng ngh tin tin lm cho gi thnh ca h thng siu my tnh tng rt cao.V th cc siu my tnh thng c s dng trong cc lnh vc m vn tnh ton phc tp, nhy cm v yu cu thi gian thc nh m phng thc hin ca cc ng c my bay, quc phng, v tr... Cch th hai l kt ni cc my tnh li vi nhau v cng thc hin bi ton. H thng cc my tnh kt ni ny chnh l h thng tnh ton song song phn cm. H thng ny c u im l gi thnh r hn rt nhiu so vi siu my tnh c cng sc mnh (do s dng cc thit b thng thng) v tnh linh hot ca h thng (s nt, s b x l, b nh, thit b mng... u mang tnh tu bin cao). S pht trin mnh m ca mng my tnh, cc cng ngh mng hin nay lp i hn ch v truyn thng trong h thng my tnh song song phn cm lm cho n c pht trin rng ri. Cc lnh vc s dng h thng tnh ton song song phn cm thng yu cu tnh ton

khng qu ln, khng yu cu thi gian thc nh x l nh, nhn dng vn tay, tnh ton kt cu cng trnh, m phng cc th nghim... 1.1.1. Cc m hnh my tnh song song Mt h thng my tnh song song l mt my tnh vi nhiu hn mt b x l cho php x l song song. nh ngha ny c th bao qut c tt c cc siu my tnh vi hng trm b x l, cc mng my tnh trm, hay cc h thng nhng Thm ch trong my nm gn y cc my tnh c vi x l p dng cng ngh mi multicore cho php nhiu nhn trong mt b x l cng c coi l h thng my tnh song song [8]. Da vo s phn bit kt ni gia cc b x l (hay thnh phn x l), gia b x l v b nh m c rt nhiu loi kin trc my tnh song song khc nhau. Nhng theo nguyn tc phn loi ca Flynn th c hai kin trc my tnh song song song thng dng sau [8]: SIMD - Single Instruction Multiple Data: n lnh a d liu MIMD - Multiple Instruction Multiple Data: a lnh a d liu S phn chia ny c da trn kin trc b nh ca cc my tnh song song. Cc my tnh song song c b nh chia s (shared memory) c nhiu b x l cng c truy nhp n mt vng nh tng th dng chung. Tt c cc s thay i ni dung b nh do mt b x l to ra s c nhn bit bi cc b x l khc.

Hnh 1. My tnh song song c b nh chia s

Trong lp my tnh ny c th phn chia lm 2 lp nh hn: Lp UMA (Uniform Memory Access Truy cp b nh ng nht) cho php thi gian truy cp b nh i vi mi b x l l nh nhau; Lp NUMA (Non-Uniform Memory Access Truy cp b nh khng ng nht) c thi gian truy cp b nh khng phi lc no cng nh nhau.

Cn li, cc my tnh song song c b nh phn tn cng c nhiu b x l nhng vi mi b x l ch c th truy cp n b nh cc b ca n, khng c mt vng nh dng chung no cho tt c cc b x l. Cc b x l hot ng c lp vi nhau v s thay i trong vng nh cc b khng lm nh hng n vng nh ca cc b x l khc.

Hnh 2. My tnh song song c b nh phn tn 1.1.1.1. M hnh n lnh a d liu - SIMD

Hnh 3. Hot ng ca h thng SIMD

SIMD l mt kiu my tnh song song c tt c cc b x l ch thc hin mt lnh duy nht. Tuy nhin lnh ny c thc hin trn cc b d liu khc nhau ng vi tng b x l khc nhau. M hnh ny c u im l n gin trong phn cng cng nh phn mm nhng ch ph hp gii quyt cc vn tng i c th c tnh cn i cao trong x l nh x l nh Cc gii thut cho cc a my tnh thng chy khng hiu qu trn cc my SIMD.

1.1.1.2.

M hnh a lnh a d liu - MIMD.

MIMD l mt m hnh kin trc my tnh song song thng dng hin nay. Vi m hnh ny th tt c cc b x l s thc hin cc lnh khc nhau vi cc d liu ring khc nhau. S thc thi cc lnh c th theo c ch ng b hoc khng ng b (synchronous or asynchronous), xc nh hay khng xc nh (deterministic or nondeterministic). iu ny gip cho m hnh MIMD rt linh hot trong vic x l song song.

Hnh 4. Hot ng ca h thng MIMD

Tuy nhin, cng vi tnh linh hot ca mnh, m hnh MIMD cng mang theo mt s phc tp nht nh. Vic lp trnh c nhng bi ton song song theo m hnh ny i hi nhiu cng sc nghin cu, phn tch bi ton tm ra mt cch phn r ti u. lp trnh theo m hnh ny, lp trnh vin cn c trnh cao trong c chuyn mn v trong k thut lp trnh song song. 1.1.2. M hnh lp trnh song song Cng vic lp trnh song song bao gm vic thit k, lp trnh cc chng trnh my tnh song song sao cho n chy c trn cc h thng my tnh song song. Hay c ngha l song song ho cc chng trnh tun t nhm gii quyt mt vn ln hoc lm gim thi gian thc thi hoc c hai. Lp trnh song song tp trung vo vic phn chia bi ton tng th ra thnh cc cng vic con nh hn ri nh v cc cng vic n tng b x l (processor) v ng b cc cng vic nhn c kt qu cui cng. Nguyn tc quan trng nht y chnh l tnh ng thi hoc x l nhiu tc v cng mt lc. Do , trc khi lp trnh song bn cn phi bit c rng bi ton c th c song song ho hay khng

(c th da trn d liu hay chc nng ca bi ton). C hai hng chnh trong vic tip cn lp trnh song song: Song song ho ngm nh (implicit parallelism): b bin dch hay mt vi chng trnh khc t ng phn chia cc cng vic n cc b x l. Song song ho bng tay (explicit parallelism): ngi lp trnh phi t phn chia chng trnh ca anh ta n c th thc thi song song. Ngoi ra trong lp trnh song song, ngi lp trnh vin cn phi tnh n yu t cn bng ti (load balancing) trong h thng. Phi lm cho cc b x l thc hin s cng vic nh nhau, nu c mt b x l c ti qu ln th cn phi di chuyn cng vic n b x l c ti nh hn. Vic truyn thng gia cc b x l l mt cng vic khng th thiu ca lp trnh song song. C hai k thut truyn thng ch yu l: dng b nh chia s (shared memory) hoc truyn thng ip (message passing). Mt m hnh lp trnh song song l s dng mt tp cc k thut phn mm th hin cc gii thut song song v a ng dng vo thc hin trong h thng song song. M hnh bao gm cc ng dng, ngn ng, b bin dch, th vin, h thng truyn thng v vo/ra song song. Trong thc t, cha c mt my tnh song song no cng nh cch phn chia cng vic cho cc b x l no c th p dng c hiu qu cho mi bi ton. Do , ngi lp trnh phi la chn chnh xc m hnh lp trnh song song hoc pha trn cc m hnh pht trin cc ng dng song song trn mt h thng ring bit. Hin nay c rt nhiu m hnh lp trnh song song: a lung (Threads), Truyn thng ip (Message Passing), Song song d liu (Data Parallel), Lai (Hybird) [9].
1.1.2.1. M hnh a lung

Trong m hnh a lung (Threads), mt lung c th c rt nhiu lung x l. V d, mt chng trnh chnh a.out c a vo h thng chy. N s thc hin mt vi cng vic tun t ri to ra mt s lung con. Mi lung c d liu cc b ring ca mnh nhng cng c th truy cp n cc ti nguyn chung ca chng trnh a.out. Mi lung c th c coi l mt chng trnh con ca chng trnh chnh v c th c thc hin song song vi cc lung khc.

Hnh 5. M hnh lp trnh a lung

kha cnh lp trnh th m hnh a lung c c th hin bao gm: Mt th vin cc hm c gi trong m ngun chng trnh song song. Mt tp cc ch dn bin dch trong m ngun chng trnh tun t hay song song. Hai h th vin lp trnh song song cho m hnh ny l POSIX Threads v OpenMP.
1.1.2.2. M hnh truyn thng ip

Truyn thng ip (Message Passing) l m hnh c s dng rng ri trong tnh ton song song hin nay. N thng p dng cho cc h thng phn tn. Cc c trng ca m hnh l: Mt tp cc lung s dng vng nh cc b ring ca chng trong sut qu trnh tnh ton Nhiu lung c th cng s dng mt ti nguyn vt l. Cc lung trao i d liu bng cch gi nhn cc thng ip Vic truyn d liu thng yu cu thao tc iu phi thc hin bi mi lung. V d, mt thao tc gi mt lung th phi ng vi mt thao tc nhn lung khc.

Hnh 6. M hnh truyn thng ip

V mt lp trnh th m hnh truyn thng ip th hin bi vic s dng cc th tc con ca h th vin lp trnh vo bn trong m ngun. Hai h th vin ph bin nht hin nay l MPI (Message Passing Interface) v PVM (Parallel Virtual Machine).
1.1.2.3. M hnh song song d liu

Hnh 7. M hnh song song d liu

M hnh song song d liu (Data Parallel) nhn mnh cc thao tc song song trn mt tp d liu. Cc lung lm vic chung trn cng mt cu trc d liu nhng cc phn khc nhau. Vi kin trc b nh chia s, tt c cc lung c th truy cp cu trc d liu chung thng qua vng nh dng chung. Vi kin trc b nh phn tn th cu trc d liu chung c chia ra thnh tng phn v nh v trn vng nh cc b ca mi lung. Lp trnh vi m hnh song song d liu thng c thc hin bi vic vit chng trnh cng vi vic xy dng song song d liu. Vic lm ny c th thc hin bi cc hm th vin hoc cc ch dn bin dch ca chng trnh bin dch song song d liu nh Fortran 90 hay HPF (High Performance Fortran).
1.1.2.4. Cc m hnh khc

M hnh lai M hnh lai (hybird) l s kt hp ca hai hay nhiu m hnh lp trnh song song to ra s thun li v hiu qu hn trong vic tnh ton. Mt v d hay thy nht l s dng m hnh truyn thng ip (MPI) kt hp vi m hnh a lung (POSIX Threads hay OpenMP) tng sc mnh tnh ton bng cch s dng cc my SMP (Symmetric Multiprocessor). M hnh n chng trnh a d liu

M hnh n chng trnh a d liu (Single Program Multiple Data - SPMD) l mt m hnh lp trnh mc cao m c th thc hin bi s kt hp cc m hnh lp trnh song song trn. Mt chng trnh c thc thi bi tt c cc tc v cng mt lc v cc tc v s dng cc d liu khc nhau. Trong mt thi im bt k, cc tc v c th thc thi cng mt lnh hay cc lnh khc nhau trong cng chng trnh.

Hnh 8. M hnh SPMD

M hnh a chng trnh a d liu Ging nh SPMD, m hnh a chng trnh a d liu (Multiple Program Multiple Data - MPMD) l mt m hnh lp trnh mc cao m c th thc hin bi s kt hp cc m hnh lp trnh song song trn. Mi ng dng MPMD thng th c nhiu chng trnh c thc thi bi cc tc v khc nhau v mi tc v th li s dng cc d liu khc nhau.

Hnh 9. M hnh MPMD

1.1.3. S cn thit ca cng c pht trin ng dng song song Lp trnh l mt cng vic i hi cn u t nhiu cng sc v thi gian. V th cc mi trng pht trin tch hp c pht trin t rt sm nhm tr gip cho cc lp trnh vin thun li hn trong vic lp trnh ng thi lm gim thi gian lp trnh. Hin nay, cc mi trng pht trin tch hp nh Microsoft Visual Studio, Borland Studio, Eclipse, KDevelop, Anjuta ... thc s lm cho vic lp trnh tr ln d dng thm ch i vi c nhng ngi mi bt u hc lp trnh. i vi vic lp trnh song song nh cp trong phn 1.4, i hi cn c mt m hnh lp trnh song song c th. Cc m hnh lp trnh song song ny thng cung cp mt th vin lp trnh cho php lp trnh song song theo mt trong nhng ngn ng lp trnh thng dng, thng l C/C++ hay Fortran. Nhng bin dch hay chy chng trnh th cn phi dng cc cng c ng vi tng m hnh lp trnh ch khng

phi s dng cc trnh bin dch ca cc ngn ng lp trnh. Cc cng c ny thng c s dng di dng dng lnh (console), chng hn nh mpicc, mpirun i vi m hnh lp trnh song song truyn thng ip MPI. Cng vic lp trnh song song s gp nhiu kh khn i vi lp trnh vin nht l khi phi pht trin cc ng dng ln. Mt khc trong lp trnh th li l iu khng th trnh khi, cc li trong lp trnh song song li cng phc tp hn so vi lp trnh tun t. C s tr gip ca phn mm g ri trong lp trnh song song vic lp trnh s tr ln thun li hn. Ngoi ra, cc h thng tnh ton song song thng c kin trc phc tp khin cho vic m hnh ho v lp trnh cc bi ton i hi tnh chuyn nghip v s hiu bit su v tnh ton song song. Do vy vic xy dng mt cng c pht trin ng dng song song l rt cn thit to c s cho vic ng dng tnh ton song song trong khoa hc k thut v trong cuc sng. Nm bt nhu cu ny cc cng ty, t chc, trng i hc trn th gii cng nghin cu xy dng nhiu cng c pht trin ng dng song song. Cc cng c ny a phn mc th nghim nghin cu, cha c s dng rng ri. Cc cng c c th k n l: Sun HPC ClusterTools [10], PTP-Eclipse [~6], P-GRADE (Parallel Grid Run-time and Application Development Environment) [~29], PADE (Parallel Applications Development Environment) [~4]. Mi mi trng pht trin tch hp ny thng ch thit k cho mt m hnh lp trnh song song c th v c p dng vo mt h thng c th m cng ty, t chc, trng i hc ang c. Cha c mt cng c no c th p dng cho mi m hnh lp trnh song song song v c th trin khai trn mi h thng. Mc d th, cc cng c ny cng h tr cho lp trnh vin thun li hn rt nhiu trong vic lp trnh gii quyt cc bi ton song, lm n gin ho cc bc pht trin cc ng dng song song.

1.2. Tng quan v GPU


1.2.1. Gii thiu GPU B x l ha (Graphics Processing Unit) hay gi tt l GPU l b x l chuyn dng cho biu din hnh nh 3D t b vi x l ca my tnh. N c s dng trong cc h thng nhng, in thoi di ng, my tnh c nhn, my trm, v iu khin game. B x l ha ngy nay rt hiu qu trong cc thao tc ha my tnh, v cu trc song song cao cp lm cho chng c nng lc x l tt hn nhiu so vi b vi x l thng thng trong cc thut ton phc tp. Trong my tnh c nhn, mt GPU c bit ti nh mt card mn hnh (video card) hoc c tch hp lun trn bng mch ch. Hn 90% cc my tnh c nhn hoc my tnh xch tay hin i c tch hp GPU nhng thng yu hn nhiu so vi GPU tch hp trn cc card mn hnh chuyn dng.

1.2.2. Lch s pht trin GPU GPU [~32] l b x l gn vi card ha, chuyn dng tnh ton cc php ton du phy ng. S pht trin ca card ha kt hp cht ch vi cc chip vi x l. Ban u GPU l b x l gn trn card ha phc v vic tnh ton cho cc php ton du phy ng. B gia tc ha kt hp vi cc vi mch siu nh ty chn cha mt s php ton c bit c s dng ph bin trong bin i thnh ha ba chiu (graphic rendering). Kh nng ca cc vi mch t xc inhj kh nng ca b gia tc ha. Chng c s dng ch yu trong cc tr chi 3B, hoc bin i thnh u ra 3D. GPU thc thi mt s php ton ha nguyn thy lm chng chy nhanh hn rt nhiu so vi vic v trc tip trn mn hnh vi CPU.

Nhng nm 1970: Hng sn xut chip ANTIC v CTIA a ra b iu khin phn cng cho vic kt hp ha v ch text, tnh ton v tr v hin th (theo khun dng phn cng h tr) v nhng hiu ng khc trn cc my tnh ATARI 8-bit. Chp ANTIC l mt b x l chuyn bit cho nh x (theo cch lp trnh c) gia text v d liu ha ti u ra video. Nh thit k chip ANTIC, Jay Miner, sau thit k chip ha cho Commodore Amiga. Nhng nm 1980: Commodore Amiga l my tnh thng mi u tin c cha cc b blit (BLock Image Transfer l s chuyn ng ca mt bitmap ln trong game 2D) trong phn cng video ca n, h thng ha 8514 ca IBM l mt trong nhng card video u tin trn PC c th thc thi cc php ton 2D nguyn thy trn phn cng. Amiga l thit k duy nht, theo thi gian, nhng tnh nng ca n by gi c cng nhn l b gia tc ha y , gim ti thc t tt c cc chc nng th h video cho phn cng, bao gm v ng thng, t mu vng, chuyn khi hnh nh, v b ng x l ha vi cng vi tp cc ch th lnh nguyn thy ca ring n. Trc (v sau mt thi gian kh di trn hu ht h thng) CPU s dng vo mc ch chung phi x l mi kha cnh ca vic v hnh nh hin th. Nhng nm 1990: Nm 1991, S3 Graphics gii thiu b gia tc chip 2D u tin, cc 86C911 S3 (m nh thit k ca n t theo tn ca Porsche 911 vi ngha th hin du hiu ca s gia tng hiu sut nh cam kt). Cc 86C911 sinh ra mt my ch ca cc bt trc: nm 1995, tt c cc nh sn xut chip ha my tnh ln thm vo cc h tr tng tc 2D cho chip ca h. Bi thi gian ny, b tng tc Windows vi c tnh

c nh chc nng ni chung t tin vt b ng x l ha mc ch chung trong hiu sut Windows, v cc b ng x l phai m dn trong cc th trng PC. Trong sut nhng nm 1990, 2D GUI tip tc tng tc pht trin. T kh nng sn xut c ci thin tc ng vo cc mc tch hp chip ha. Thm vo cc giao din lp trnh ng dng (API) em li mt lng ln tc v, chng hn nh th vin ha ca Microsoft WinG cho Windows 3.x, v giao din sau DirectDraw ca h cho tng tc phn cng ca game 2D trong Windows 95 v sau . Trong u v gia thp nin 1990, vi s h tr CPU-thi gian thc, ha 3D tr nn ngy cng ph bin trong my tnh v giao din iu khin tr chi, dn n nhu cu pht trin rng ri phn cng tng tc ha 3D. V d u tin v lot trn th trng phn cng ha 3D c th c tm thy trong cc tr chi video th h console th nm nh PlayStation v Nintendo 64. Trong th gii PC, ln th u tin khng thnh cng ng ch ng cho nht cho cc chip ha 3D gi thnh r l ViRGE Hnh 10: nh chp 3dfx Voodoo3 S3, ATI Rage, v Matrox Mystique. Nhng chip ny v c bn l b gia tc 2D th h trc b sung thm cc tnh nng 3D then cht. Nhiu thnh phn c thit k tng thch vi th h chip trc d thc hin v chi ph ti thiu. Ban u, hiu nng ha 3D chp nhn c vi bng mch ri dnh ring cho cc chc nng tng tc 3D (thiu chc nng 2D GUI) nh 3dfx Voodoo. Tuy nhin, nh cng ngh sn xut mt ln na tin trin, video, b tng tc 2D GUI, v chc nng 3D c tch hp tt c vo mt con chip. chipset Verite ca Rendition c l sn phm u tin lm iu ny v cng c lu . OpenGL xut hin vo u nhng nm 90 nh l API ha chuyn nghip, nhng tr thnh mt lc lng chi phi trn my tnh, v l mt ng lc cho pht trin phn cng. Trin khai phn mm ca OpenGL c ph bin trong thi gian ny mc d nh hng ca OpenGL cui cng dn n h tr phn cng rng ri. Theo thi gian mt s la chn ni ln gia cc tnh nng c sn bng phn cng v nhng tnh nng cung cp ti OpenGL. DirectX tr thnh ph bin vi cc nh pht trin game Windows trong thi gian cui nhng nm 90. Khng ging nh OpenGL, Microsoft khng nh nghim ngt v vic cung cp s h tr mt-mt ca phn cng. Cch tip cn lm DirectX t ph bin nh l API ha ng mt mnh ngay t u trong khi cc GPU cung cp nhiu tnh nng c bit ca ring mnh, m hin c ng dng OpenGL c th c hng li, li DirectX thng l mt th h sau. Theo thi gian, Microsoft bt u lm vic cht ch hn

vi cc nh pht trin phn cng, v bt u nhm mc tiu cc bn pht hnh ca DirectX vi nhng phn cng ha h tr. Direct3D 5,0 l phin bn API u tin ang pht trin t c p dng rng ri trn th trng chi game, v n cnh tranh trc tip vi nhiu phn cng c th hn, thng l cc th vin ha c quyn, trong khi OpenGL duy tr iu . Direct3D 7,0 h tr phn cng tng tc bin i v nh sng (T & L). B tng tc 3D bin i t ch l b qut ng thng n gin n c thm phn cng quan trng dng cho cc ng ng dn bin i 3D. NVIDIA Geforce 256 (cn c gi l NV10) l sn phm u tin trn th trng vi kh nng ny. Phn cng bin i v nh sng, c hai u c trong OpenGL, c trong phn cng nhng nm 90 v t tin cho cc pht trin sau l cc n v bng im nh v bng vector m vi c tnh linh hot hn v lp trnh c. T nm 2000 n nay: Vi s ra i ca API OpenGL v cc tnh nng tng t trong DirectX, GPU thm vo tnh nng bng lp trnh c. Mi im nh by gi c th c x l bi mt chng trnh ngn c th bao gm cc cu hnh hnh nh b xung l u vo, v mi vector hnh hc c th c x l bi mt chng trnh ngn trc khi n c chiu ln mn hnh. NVIDIA ln u tin c sn xut mt con chip c kh nng lp trnh bng, GeForce 3 (tn m NV20). Thng 10 nm 2002, vi s ra i ca ATI Radeon 9.700 (cn gi l R300), b tng tc Direct3D 9.0 ln u tin trn th gii, b bng im nh v vector c th thc hin vng lp v cc php ton du phy ng di, v ni chung nhanh chng tr nn linh ng nh CPU, v i hi cn c bc pht trin nhanh hn cho cc php ton mng lin quan n hnh nh (image-array operations). bng im nh thng c s dng cho nhng th nh lp bn bump, thm vo cc kt cu (texture), lm cho mt i tng trng bng, m m, th rp, hoc thm ch cng mn hoc li lm. Khi sc mnh x l ca GPU c tng ln ko theo nhu cu ngun in cao hn. GPU hiu sut cao, thng c tiu th nng lng nhiu hn cc CPU hin ti Ngy nay, GPU song song bt u thc hin xm nhp my tnh v cnh tranh vi CPU, v theo mt nghin cu bn l, gi l GPGPU cho tnh ton chung (General Purpose Computing) trn GPU, tm thy con ng ca mnh ng dng vo cc lnh vc khc nhau nh thm d du, x l hnh nh khoa hc, i s tuyn tnh, ti to 3D v h tr la chn gi c phiu. iu ny tng p lc ln cc nh sn xut GPU t "ngi dng GPGPU" ci tin thit k phn cng, thng tp trung vo vic thm tnh linh hot hn cho m hnh lp trnh. 1.2.3. Kin trc GPU GPU lun lun l mt b x l vi d tha ti nguyn tnh ton. Tuy nhin xu hng quan trng nht gn y l trng by kh nng tnh ton cho cc lp trnh vin. Nhng nm gn y, GPU pht trin t mt hm c nh, b x l chuyn

dng ti b x l lp trnh song song, y tnh nng c lp vi vic b sung thm cc chc nng c nh, v cc chc nng chuyn bit. Hn bao gi ht cc kha cnh v kh nng lp trnh ca b x l chim v tr trung tm. Ti bt u bng cch ghi chp li s tin trin ny, bt u t cu trc ca ng ng dn ha GPU v lm th no GPU tr thnh kin trc, cng c ginh cho cc mc ch thng dng, sau i xem xt k hn cc kin trc ca GPU hin i.
1.2.3.1. ng ng dn ha (Graphics Pipeline)

Cc u vo ca GPU l danh sch cc hnh hc nguyn thy, in hnh l tam gic, trong mt th gii khng gian 3 chiu. Qua nhiu bc, nhng khi hnh nguyn thy c lm bng m (shade) v c t v ln mn hnh, ni chng c lp rp to ra mt hnh nh cui cng. y l kin thc c bn u tin gii thch cc bc c th trong ng ng dn kinh in trc khi cho thy lm cch no m cc ng ng tr thnh lp trnh c [~3]. Cc php ton vector: Cc hnh hc nguyn thy (primary geometric) c hnh thnh t cc vector ring r. Mi vector phi c chuyn thnh khng gian trn mn hnh v c bng m, thng thng bng cch tnh ton tng tc ca chng vi cc lung nh sng trong mt bi cnh c th. Bi v nhng bi cnh tiu biu c th c hng chc n hng trm ngn vector, v mi vector c th c tnh ton c lp. Do kch bn ny l rt ph hp cho phn cng song song. Thnh phn nguyn thy: Cc vector c lp rp vo cc hnh tam gic, chnh l phn t h tr phn cng c bn trong GPU ngy nay. S qut mnh: Qut mnh (rasterization) l qu trnh xc nh nhng v tr im nh no trong khng gian mn hnh c bao cha bi mi tam gic. Mi tam gic to ra mt thnh t nguyn thy c gi l "mnh" ti cc v tr im nh trong khng gian mn hnh m n bao cha. V do nhiu tam gic c th chng ln nhau ti mt v tr im nh bt k nn gi tr mu ca mi im nh c th c tnh t nhiu mnh. Thao tc trn mnh: S dng thng tin mu sc t vector v c th ly d liu b sung t b nh ton cc trong cc hnh dng ca s kt hp (s kt hp l hnh nh c nh x ln b mt), mi mnh c lm bng m xc nh mu sc cui cng ca n. Cng nh trong kch bn vector, mi mnh c th c tnh ton song song. Giai on ny thng l i hi nhiu tnh ton nht trong ng ng dn ha. Thnh phn:

Cc mnh c lp rp thnh hnh nh cui cng vi mt mu cho mi im nh, thng l bng cch gi li mnh gn ng knh nht cho mi v tr im nh. Trc y, cc php ton hin c ti khung cnh vector v mnh c cu hnh nhng khng th lp trnh c. V d, mt trong nhng tnh ton chnh khung cnh vector l tnh ton cc mu sc mi vector nh l mt chc nng ca thuc tnh vector v cc sng trong bi cnh . Trong ng ng chc nng c nh, cc lp trnh vin c th kim sot c v tr v mu sc ca cc vector v nh sng, nhng khng phi l m hnh chiu sng m xc nh tng tc gia chng.
1.2.3.2. Tin ha ca kin trc GPU

Cc ng ng chc nng c nh thiu tnh tng qut c biu din hiu qu cc trng hp lm bng m phc tp hn v cc php ton nh sng, m li l nhng iu kin tin quyt cho cc hiu ng phc tp. Bc then cht trn c thay th bng cc hm c nh chc nng trn mi vector v cc php ton trn mi mnh vi chng trnh ch nh ngi s dng chy trn tng vector v tng mnh. Trong hn su nm qua, cc chng trnh vector v chng trnh mnh c ngy cng nhiu kh nng, vi gii hn ln hn v kch c v tiu th ti nguyn, vi b ch th (tp lnh) y tnh nng, v vi cc php ton iu khin lung linh hot hn. Sau nhiu nm ca cc b ch th lnh ring r cho cc php ton trn vector v mnh, GPU hin ti h tr m hnh bng m thng nht 4.0 (unified Shader Model 4.0) trn c bng m vector v mnh [~3]: Cc phn cng phi h tr cc chng trnh bng m t nht l 65 nghn (65k) ch th tnh v ch th ng khng gii hn. Cc tp lnh, ln u tin, h tr c s nguyn 32 bit v s du phy ng 32 bit. Cc phn cng phi cho php s lng ty thao tc c trc tip v gin tip t b nh ton cc (kt cu - texture). Cui cng, iu khin lung ng trong cc dng vng lp v r nhnh phi c h tr. Khi m hnh bng ra i v pht trin mnh hn, tt c cc loi ng dng GPU tng phc tp chng trnh vector v mnh, kin trc GPU ngy cng tp trung vo cc b phn lp trnh c ca ng ng dn ha. Qu thc, trong khi cc th h trc y ca GPU c th c m t chnh xc nht nh l phn thm vo kh nng lp trnh c cho ng ng chc nng c nh, GPU ngy nay c khc ha tt hn, nh l cng c lp trnh c bao quanh bi cc n v h tr c chc nng c nh.

1.2.3.3.

Kin trc ca GPU hin i

Trong phn gii thiu, chng ti ghi nhn rng GPU c xy dng cho cc nhu cu ng dng khc nhau so vi CPU, l cc yu cu tnh ton ln chy song song, vi trng tm l thng lng hn l tr. Do , cc kin trc ca GPU pht trin theo mt hng khc so vi CPU. Xem xt mt ng ng dn ca cc tc v (task), nh chng ta thy hu ht cc giao din lp trnh ha (v nh nhiu ng dng khc) phi x l mt lng ln cc yu t u vo. Trong mt ng ng dn nh vy, u ra ca mi nhim v thnh cng c a vo u vo ca cc tc v tip theo. ng ng t ra c ch song song ng dng, nh l d liu trong nhiu khung cnh trong ng ng c th c tnh cng mt thi im; trong tng khung cnh, tnh ton nhiu hn mt phn t ti mt thi im l c ch song song d liu. thc hin loi ng ng nh vy, CPU c th ly mt phn t n (hoc nhm cc phn t) v x l khung cnh (stage) u tin trong ng ng, sau cc khung cnh tip theo cng lm nh vy. CPU chia ng ng dn theo thi gian, p dng tt c cc ngun lc ca b x l vo trong tng khung cnh khi n lt. GPU c lch s ly mt cch tip cn khc CPU. GPU phn chia cc ngun lc ca b x l theo cc khung cnh khc nhau, sao cho ng ng c chia theo khng gian ch khng phi thi gian. Cc phn ca b vi x l lm vic trn mt trong nhng khung cnh cp d liu u ra trc tip vo mt phn khc m s hot ng trong giai on tip theo. C ch t chc ny rt thnh cng ti GPU c nh chc nng v hai l do. u tin, phn cng trong bt k khung cnh no c th khai thc c ch song song d liu trong khung cnh , x l nhiu phn t cng mt lc, v v nhiu c ch song song cng vic c chy bt k lc no, GPU c th p ng nhu cu tnh ton rt ln ca cc ng ng dn ha. Th hai, phn cng ca mi khung cnh c th c ty chnh vi phn cng chuyn dng cho cng vic a ra ca n, cho php tnh ton ln hn ng k v mc hiu qu vt qua gii php cho mc ch chung. V d, giai on rasterization, cn tnh thng tin bao ph im nh ca tng im nh tam gic u vo, l hiu qu hn khi thc hin trn phn cng dng. Theo cc khung cnh lp trnh c (chng hn nh cc chng trnh vector v mnh) thay th khung cnh c nh chc nng, cc mc ch chuyn dng, cc thnh phn c nh chc nng c n gin thay th bng thnh phn lp trnh c, nhng nhim v t chc thc hin song song khng thay i. Kt qu l mt ng ng GPU di, c tnh cht feed-forward c nhiu khung cnh, mi khung cnh thng tng tc cho mt mc ch c bit, v thch hp vi phn cng song song. Trong CPU, bt k php ton no cng c th mt khong 20 chu k hot ng theo th t tnh t lc bt u n khi ri khi ng ng CPU. Trn GPU, mt php ton ha cho trc c th mt hng ngn chu k t khi bt u n khi kt thc. tr ca bt k php ton no thng l lu. Tuy nhin, c ch song

song tc v v d liu t khung cnh ny ti khung cnh khc v gia cc khung cnh to ra thng lng cao. Bt li chnh ca ng ng GPU song song tc v l vn cn bng ti. Ging nh bt k ng ng no, hiu sut ca ng ng GPU ph thuc vo khung cnh chm nht ca n. Nu cc chng trnh vector rt phc tp v chng trnh mnh l n gin, tng th thng qua l ph thuc vo hiu sut ca cc chng trnh vector. Trong nhng ngy u ca cc khung cnh lp trnh c, tp ch th ca cc chng trnh vector v cc chng trnh mnh kh khc nhau, do , nhng khung cnh ny c tch bit. Tuy nhin, khi c hai chng trnh vector v chng trnh mnh tr nn y tnh nng, v tp ch th lnh hi t nh nhau, kin trc GPU xem xt li ng ng song song tc v nghim ngt trong li th ca kin trc bng hp nht (unified shader), trong tt c n v lp trnh c trong ng ng chia s mt n v phn cng lp trnh c duy nht. Trong khi phn ln cc ng ng vn cn l song song tc v, cc n v lp trnh by gi phn chia thi gian ca n gia cng vic vector, cng vic mnh, v cng vic hnh hc (vi DirectX c b bng 10 loi hnh hc khc nhau). Cc n v ny c th khai thc c hai c ch song song tc v v song song d liu. Khi cc b phn lp trnh c ca ng ng chu trch nhim tnh ton ngy cng nhiu trong cc ng ng dn ha th kin trc ca GPU chuyn t kin trc song song tc v trong mt ng ng nghim ngt sang kin trc c pht trin xung quanh mt n v lp trnh c theo c ch song song d liu thng nht. AMD gii thiu cc kin trc bng hp nht u tin cho sn phm GPU Xenos GPU ca n trong Xbox 360 (2005). Ngy nay, c GPU ca AMD v NVIDIA u c tnh nng bng hp nht (unified shaders) (hnh 10). Li ch cho ngi s dng GPU l cn bng ti tt hn vi chi ph cho phn cng phc tp hn. Li ch cho ngi dng GPGPU r rng: vi tt c ngun lc lp trnh c trong mt n v phn cng duy nht, lp trnh vin GPGPU by gi c th tip cn n v lp trnh c theo cch trc tip, hn hn trc cch tip cn trc y l phn chia cng vic trn nhiu n v phn cng.

Hnh 11: Kin trc GPU ca NVIDIA v AMD c mt lng s cc n v lp trnh c t chc song song thng nht

1.2.4. Tnh ton trn GPU Phn trn chng ta thy kin trc phn cng ca GPU, chng ta quay sang m hnh lp trnh ca n.
1.2.4.1. M hnh lp trnh trn GPU

Cc n v lp trnh ca GPU tun theo m hnh lp trnh SPMD (single program, multiple data): n chng trnh, a d liu. hiu qu, GPU x l rt nhiu yu t (vector hoc mnh) song song bng cch s dng nhiu chng trnh ging nhau. Mi phn t c c lp vi cc phn t khc, v trong lp trnh m hnh c s, cc yu t khng th giao tip vi nhau. Tt c cc chng trnh GPU phi c t chc theo cch: song song nhiu thnh phn, mi thnh phn c x l song song bi mt n chng trnh. Mi thnh phn c th hot ng trn s nguyn 32-bit hay d liu du phy ng vi mt tp cc ch th lnh va dng cho mc ch thng dng (general purpose). Cc thnh phn c th c d liu t mt b nh chia s ton cu (hot ng "thu thp" (gather)thng tin) v, vi GPU mi nht, cng ghi tr li v tr ty trong b nh chia s ton cu (hot ng "pht tn" (scatter) thng tin). y l m hnh lp trnh rt ph hp vi cc chng trnh lm vic vi ng thng, nh nhiu thnh phn c th c x l trong cc bc ni tip c m chy chnh xc nh nhau. Cu lnh c vit ra theo cch ny c gi l "SIMD", dng cho n ch th lnh, a d liu. Khi chng trnh bng tr nn phc tp hn, cc lp trnh vin thch cho php cc phn t khc nhau c ng i khc nhau thng qua chng trnh

ging nhau, dn n m hnh SPMD tng qut hn. M hnh ny c h tr trn GPU nh th no? Mt trong nhng li ch ca GPU l phn ln ti nguyn dnh cho vic tnh ton. Vic cho php cc con ng thc thi khc nhau cho tng phn t i hi ng k phn cng iu khin. Thay vo , GPU ngy nay h tr lung iu khin ring cho tng lung, nhng p t mt hnh pht nng cho nhng phn nhnh tp nham. Cc nh cung cp GPU phn ln thng qua cch tip cn ny. Cc yu t c nhm li vi nhau thnh nhng khi v cc khi c x l song song. Nu cc yu t phn nhnh ra cc hng khc nhau trong mt khi, th phn cng tnh c hai bn ca nhnh cho tt c cc phn t trong khi. Kch c ca khi c gim vi th h GPU gn y, ngy ny l th t ca 16 phn t. Trong khi vit chng trnh trn GPU th r nhnh c php nhng khng min ph. Ngi lp trnh t chc m ngun ca h sao cho khi c r nhnh mch lc s tn dng phn cng tt nht.
1.2.4.2. Tnh ton thng dng trn GPU (GPGPU)

GPGPU l vic nh x cc bi ton tnh ton mc ch thng thng ln GPU s dng phn cng ha theo cch ging nh bt c ng dng ha chun no. Bi v s tng t ny, n va d dng hn v cng kh khn hn trong vic gii thch qu trnh hot ng. Mt mt, cc hot ng thc t l nh nhau v rt d lm theo. Mt khc, thut ng ny c im khc nhau gia ha v s dng cho mc ch thng thng. Harris cung cp mt m t tuyt vi ca qu trnh nh x ny [3]. Chng ti bt u bng cch m t lp trnh trn GPU s dng cc thut ng ha, sau cho thy cch cc bc tng t c s dng theo cch thng thng to ra ng dng GPGPU, v cui cng l s dng cc bc tng t th hin n gin hn v trc tip hn v cch ngy nay cc ng dng tnh ton trn GPU c vit nh th no. 1) Lp trnh GPU cho ha: Chng ti bt u vi cng mt ng ng dn GPU m chng ta m t trn v tp trung vo cc kha cnh lp trnh c ca ng ng ny. Lp trnh vin xc nh dng hnh hc s bao ph mt khu vc trn mn hnh. Qu trnh qut mnh trn mn hnh to ra mt mnh mi v tr im nh c bao ph bi hnh hc . Mi mnh c lm bng m ca chng trnh mnh. Cc chng trnh mnh tnh gi tr ca cc mnh bng cch kt hp ca php ton ton hc v b nh ton cc c t b nh kt cu ton cc. Cc hnh nh kt qu sau c th c s dng nh l kt cu trong tng lai i qua cc ng ng dn ha.

2) Lp trnh GPU cho cc chng trnh mc ch thng dng (c):

ng la chn ng ng dn ny thc hin tnh ton general-purpose lin quan n cng cc bc c th ging nhau, nhng k hiu khc nhau. Mt v d tch cc l mt m phng tnh cht lng c tnh ton trn li: ti mi bc, chng ti tnh ton trng thi tip theo ca cht lng cho mi im li t tnh trng hin ti trn li ca n v trng thi cc im hng xm ca n trn li. Lp trnh vin ch r mt hnh nguyn thy bao gm mt min tnh ton a thch. Cc chng trnh qut mnh to ra mt mnh (fragment) mi v tr im nh trong hnh . (Trong v d ca chng ti, mu gc phi bao ph mt mng li cc mnh bng vi kch thc ca cht lng m phng.) Mi mnh c lm bng m bi chng trnh gerenal - purpose SPMD. (Mi im li chy cng mt chng trnh cp nht tnh trng cht lng ca n). Cc chng trnh mnh (fragment program) tnh gi tr ca mnh bng cch kt hp cc php ton ton hc v cc truy cp "thu thp" t b nh ton cc. Mi im li c th truy cp trng thi ca cc lng ging ca n bc tnh ton trc trong khi tnh ton gi tr hin ti ca n. Cc b nh m cha kt qu trong b nh ton cc sau c th c s dng nh l mt u cho cc chu k tip theo trong tng lai. Cc trng thi hin ti ca cht lng s c s dng trn cc bc tip theo.

3) Lp trnh GPU cho chng trnh mc ch thng dng (mi): Mt trong nhng kh khn trong lch s lp trnh ng dng GPGPU l mc d cc tc v general-purpose ca chng khng c lin quan g ti ha, cc ng dng vn phi c lp trnh bng cch s dng cc API ha. Ngoi ra, chng trnh c cu trc trong iu kin ca ng ng ha, vi cc n v lp trnh c ch c th truy cp c nh mt bc trung gian trong ng ng, trong khi cc lp trnh vin chc chn mun truy cp vo cc n v lp trnh c trc tip. Cc mi trng lp trnh chng ti m t chi tit trong Mc Mi trng phn mm, c gii quyt kh khn ny bng cch cung cp mt giao din t nhin hn, trc tip hn, khng c giao din ha cho phn cng v c bit l cc n v lp trnh c. Ngy nay, ng dng tnh ton GPU c t chc theo cch sau: 1) Cc lp trnh vin trc tip xc nh tn min tnh ton a thch nh mt li cu trc ca cc lung (thread). 2) Chng trnh general-purpose SPMD tnh gi tr ca tng lung. 3) Cc gi tr cho mi lung c tnh bng cch kt hp cc php ton ton hc v c truy cp "thu thp" (c) v "scatter" (ghi) b nh ton cc. Khng ging nh hai phng php trc , cng mt b m c th c dng cho c c v ghi, cho php thm cc thut ton mm do hn (v d, cc thut ton s dng t b nh).

4) Cc vng m cha kt qu trong b nh ton cc sau c th c s dng nh l mt u vo ca tnh ton sau . M hnh lp trnh ny mnh v mt s l do sau. u tin, n cho php cc phn cng khai thc trit c ch song song d liu ca cc ng dng bng cch xc nh r rng c ch song song trong chng trnh. Tip theo, n gy n tng bng vic to ra s cn bng vng chc gia tnh ph bin (mt th tc hon ton c th lp trnh ti mi phn t) v s hn ch m bo hiu nng tt (m hnh SPMD, c cc hn ch v phn nhnh cho hiu qu, c hn ch v d liu giao tip gia cc thnh phn v gia ht nhn /chu k, v.v..). Cui cng, kh nng truy cp trc tip n cc n v lp trnh c loi b nhiu thch thc phc tp ca cc lp trnh vin GPGPU trc y trong vic ng thi chn giao din ha cho lp trnh mc ch thng dng. Kt qu l cc chng trnh thng c th hin bng ngn ng lp trnh quen thuc (chng hn nh ngn ng lp trnh ca NVIDIA ging nh c php ca C th hin trong mi trng lp trnh CUDA ca h) v n gin hn v d dng hn xy dng v g li (v ang ngy cng hon thin nh l cc cng c lp trnh c lp). iu to nn mt m hnh lp trnh cho php ngi dng ca mnh tn dng y cc sc mnh phn cng ca GPU nhng cng cho php m hnh lp trnh mc cao ngy cng tng gip sn xut ca cc ng dng phc tp. 1.2.5. Mi trng phn mm Trong qu kh, phn ln cc chng trnh GPGPU c thc hin trc tip thng qua cc API ha. Mc d nhiu nh nghin cu thnh cng lm cho cc ng dng lm vic thng qua cc API ha nhng c mt iu khng ph hp c bn gia m hnh lp trnh truyn thng m mi ngi ang dng v cc mc tiu ca cc API ha. Ban u, ngi ta s dng cc hm c nh, cc n v ha c th (v d nh cc b lc kt cu (texture filter), trn (blending), v cc php ton to mu t m thc hin cc thao tc GPGPU. iu ny nhanh chng tt hn vi phn cng l b x l cc mnh hon ton lp trnh c vi ngn ng assembly m gi, nhng cch ny vn kh tip cn cho d c tt c cc nh nghin cu nhng hng hi nht bt tay vo. Vi DirectX 9, lp trnh bng cao cp c thc hin c th thng qua ngn ng bng cp cao ("high-level shading language - HLSL), n c biu din ging nh giao din lp trnh C cho lp trnh bng. NVIDIA Cg cung cp cc tnh nng tng t nh HLSL, nhng c th bin dch ra nhiu ch v cung cp ngn ng lp trnh cp cao u tin cho OpenGL. Ngn ng bng OpenGL (OpenGL Shading Language - GLSL) by gi l ngn ng bng tiu chun cho OpenGL. Tuy nhin, vn chnh vi Cg / HLSL / GLSL cho GPGPU l chng vn l ngn ng bng. Tnh ton vn phi c th hin bng cc thut ng ha nh vector, kt cu (texture), mnh (fragment), v pha trn (blending). V vy, mc d bn c th lm tnh ton thng dng hn vi ha API v ngn ng bng, chng vn phn ln khng tip cn c bi cc lp trnh vin thng thng.

Nhng g cc nh pht trin thc s mun l c c mt ngn ng cp cao hn c thit k tnh ton mt cch r rng v tru tng ha tt c cc c ch ha ca GPU. BrookGPU [~9] v Sh [~25] l hai u d n nghin cu u tin vi mc tiu tru tng GPU nh l b x l dng (streaming processor). M hnh lp trnh dng t chc chng trnh thc hin song song v cho php giao tip hiu qu v truyn d liu ng thi ph hp vi cc ngun lc x l song song v h thng b nh c sn trn GPU. Mt chng trnh dng bao gm mt tp cc dng (stream), cc tp c sp xp d liu, v ht nhn (kernel), cc hm chc nng c thit lp vi tng phn t trong tp cc dng to ra mt hay nhiu dng u ra. Brook i theo cch tip cn tru tng tnh ton dng n gin, biu din d liu nh l cc dng v tnh ton nh l cc ht nhn. Khng c khi nim v kt cu vector, mnh, hoc trn (blending) trong Brook. Ht nhn l cc tnh ton c vit trong mt tp hp con gii hn ca C, c bit l khng c con tr v scatter (s tn x - theo tc ghi b nh), vi u vo, u ra nh ngha trc, v trm cc dng c s dng trong ht nhn nh mt phn ca nh ngha ca n. Brook cha cc chc nng truy cp dng nh: lp li v thot khi vng lp, rt gn cc dng, v kh nng xc nh tn min, tp con cc dng s dng nh u vo v u ra. Nhng ht nhn c chy cho mi phn t trong min cc dng u ra. Ht nhn ca ngi dng c nh x ti on code bng cho mnh v n cc dng lin quan ti kt cu. D liu ti ln v ti v GPU c thc hin thng qua cc li gi c / ghi r rng c phin dch thao tc cp nht kt cu v cp nht vo b m phn hi. Cui cng, tnh ton c thc hin bi mt bin i vo khng gian 3 chiu vng cc im nh trong min u ra. D n Microsofts Accelerator (b gia tc ca Microsoft) [6] c mc tiu tng t nh Brook ch tp trung vo kha cnh tnh ton, nhng thay v s dng bin dch offline, b gia tc da vo bin dch tc thi (just-in-time) ca cc php ton d liu song song cho b bng mnh. Khng ging nh m hnh ca Brook v Sh c phn ln cc phn m rng t C, b gia tc l ngn ng da trn mng (array-base language) pht trin t ngn ng C #, v tt c cc tnh ton c thc hin thng qua cc php ton trn cc mng. Khng ging nh Brook, nhng tng t nh Sh, m hnh nh gi tr cho bin dch tc thi tch cc hn dn n kh nng chuyn bit hn v ti u code to ra thc hin trn GPU. Trong nm qua, c nhng thay i ln trong mi trng phn mm cho php pht trin cc ng dng GPGPU d dng hn nhiu cng nh to ra cc h thng pht trin mnh m hn, cht lng thng mi hn. RapidMind [~24] thng mi ha Sh v by gi t mc tiu nhiu platform trong mt GPU, cc STI Cell Broadband Engine, v CPU a-li, v h thng mi tp trung nhiu hn na vo tnh ton so vi SH trong vic bao gm nhiu php ton ha trung tm.

Tng t nh b gia tc ca Microsoft, RapidMind s dng c lng tr v bin dch online chp li v ti u ha m ngun ng dng ca ngi dng cng vi cc php ton v m rng kiu ca C ++ to ra nhng h tr trc tip cho mng. PeakStream [8] l h thng mi, sng to t Brook, c thit k xoay quanh cc php ton trn mng. Tng t nh RapidMind v b gia tc, PeakStream ch s dng trong bin dch tc thi, nhng linh hot hn nhiu trong vic vector ha code ca ngi dng nhm t hiu sut cao nht trn kin trc SIMD. PeakStream cng l platform u tin cung cp h tr profiling v g li, l cc kha cnh m sau tip tc l mt vn hc ba trong pht trin GPGPU. C hai n lc ny gip cho cc nh cung cp ca bn th ba to cc h thng vi s h tr t cc nh cung cp GPU. Trong mt bui gii thiu qung co v cc iu l th xung quanh GPGPU v s thnh cng ca phng php ny cho tnh ton song song, Google mua PeakStream trong nm 2007. C AMD v NVIDIA by gi cng c ring h thng lp trnh GPGPU. AMD cng b v pht hnh h thng ca h cho cc nh nghin cu vo cui nm 2006. CTM, hay "Close To The Metal", cung cp mc tru tng phn cng cp thp (HAL) cho dng R5XX v dng R6XX ca GPU ATI. CTM-HAL cung cp truy cp mc assembly th cho ng c mnh (b x l dng - stream processor) cng vi b lp rp v b m lnh iu khin thc thi trn phn cng. Khng tnh nng ha c th no c xut qua cc giao din ny. Tnh ton c thc hin bng cch rng buc b nh nh l u vo v u ra cc b vi x l dng, ti m nh phn ELF, v nh ngha mt min cc kt qu u ra m trn thc thi nh phn. AMD cng a ra tng tru tng tnh ton - Compute Abstraction Layer (CAL) . Tn ny a thm cc cu trc (construct) cp cao hn, ging nh thnh phn tng t trong ht thng chy ca Brook, v h tr bin dch GPU ISA cho GLSL, HLSL, v m gi Assembly nh Pixel Shader 3.0. i vi lp trnh cp cao hn, AMD h tr bin dch cc chng trnh Brook trc tip n phn cng R6XX, cung cp mt mc lp trnh tru tng cao hn so vi CAL hoc HAL. NVIDIA CUDA l mt giao din cp cao hn HAL v CAL ca AMD. Tng t nh Brook, CUDA cung cp mt c php ging C thc hin trn GPU v bin dch offline. Tuy nhin, khng ging nh Brook ch khai thc mt hng x l song song l song song d liu thng qua c ch dng, CUDA khai thc hai cp x l song song l song song d liu v a lung. CUDA cng khai thc cc ngun ti nguyn phn cng nhiu hn Brook, lm l nhiu cp ca b nh h thng phn cp; cc thanh ghi theo tng lung, b nh chia s nhanh chng gia cc lung trong mt khi, b nh bo mch, v b nh my ch. Cc ht nhn trong CUDA cng linh hot hn trong Brook bng cch cho php s dng con tr (mc d d liu phi trn bo mch), vic ly ra/lu tr thng thng vo b nh cho php ngi s dng tn x (scatter) d liu t bn trong mt ht nhn, v ng b gia cc lung trong mt khi lung. Tuy nhin, tt c s linh hot ny v hiu qu tim nng t c i km vi ci gi i hi ngi

s dng phi hiu nhiu hn cc chi tit cp thp ca phn cng, c bit l s dng thanh ghi, lung v lp lch cho khi lung, v cc hnh vi ca cc mu truy cp b nh. Tt c cc h thng ny cho php ngi pht trin xy dng cc ng dng ln d dng hn. V d, Folding@Home GPU client v ng dng m phng cht lng ln c vit bng BrookGPU, NAMD v VMD h tr thc thi trn GPU thng qua CUDA, RapidMind th nghim m phng chm tia v s hi t, v PeakStream biu din du v kh t v cc ng dng tnh ton ti chnh. CUDA cung cp iu chnh v ti u ha th vin Blas v FFT s dng nh xy dng khi cho cc ng dng ln. Truy cp cp thp vo phn cng, nh l cung cp bi CTM, hoc h thng GPGPU c th nh CUDA, cho php cc ngi pht trin vt qua mt cch c hiu qu cc trnh iu khin ha v duy tr n nh hiu nng v tnh ng n. S pht trin v ti u ha trnh iu khin (driver) ca cc nh cung cp trong cc API ha c xu hng ch kim th trn cc tr chi mi nht v ph bin nht. Vic ti u c thc hin ti u ha cho hiu nng game c th nh hng ti tnh n nh v hiu nng ca cc ng dng GPGPU. 1.2.6. K thut v ng dng By gi chng ta kho st mt s c tnh tnh ton quan trng, thut ton, v cc ng dng tnh ton GPU. Chng ti ln u tin nu bt bn php ton song song d liu tp trung tnh ton GPU: thc hin php ton tn x (scatter) / tp hp (gather) b nh, nh x mt chc nng vo nhiu yu t song song, gim mt b su tp cc yu t thnh mt yu t hoc mt gi tr, v tnh ton rt gn cho trc mt mng song song. Chng ti nghin cu k tnh ton nguyn thy ct li mt s chi tit trc khi chuyn n mt cch nhn tng quan mc cao v cc vn thut ton m cc nh nghin cu nghin cu trn GPU: qut, sp xp, tm kim, truy vn d liu, phng trnh vi phn, v i s tuyn tnh. Cc thut ton cho php mt lot cc ng dng khc nhau, t c s d liu, khai ph d liu, n cc m phng khoa hc, nh l ng lc hc v chuyn ng nhit ca cht lng (chng ta s xem k hn trong Phn VI v VII), chuyn ng vt l trong tr chi v ng lc hc phn t.
1.2.6.1. Tnh ton nguyn thy:

Cc kin trc song song d liu ca GPU i hi thut ng lp trnh quen thuc t lu vi ngi s dng siu my tnh song song, nhng thng l mi vi cc lp trnh vin ngy nay trng thnh t my mc tun t hoc cm my tnh kt ni lng lo. Chng ta tho lun ngn gn v bn cc thnh ng quan trng: tn x / tp hp (scatter/gather), nh x, rt gn, v qut. Chng ti m t nhng tnh ton nguyn thy ny trong bi cnh c "C" (da trn ha) v "mi" (tnh ton trc tip) trn tnh ton GPU nhn mnh s n gin v tnh linh hot ca cch tip cn tnh ton trc tip.

Tn x/tp hp (scatter/gather) : vit vo hoc c ra mt v tr c tnh ton trong b nh. Tnh ton GPU da trn ha cho php tp hp hiu qu bng cch s dng cc h thng con v kt cu, lu tr d liu nh hnh nh kt v nh a ch d liu bng cch tnh ton ta hnh nh tng ng v thc hin php np kt cu. Tuy nhin, hn ch v kt cu lm cho kh pht trin rng ri: hn ch kch thc kt cu i hi cc mng cha trn 4.096 phn t thnh nhiu dng ca mt kt cu 2D, b sung thm php ton nh a ch, v php np kt cu n ch c th ly 4 gi tr du phy ng 32bit, hn ch b nh lu tr mi phn t. Php tn x trong tnh ton GPU da trn ha kh khn v i hi phi ti lin kt d liu thc thi nh l cc vector, hoc s dng php np kt cu nh hoc render-to-vertex-buffer. Ngc li lp trc tip tnh ton cho php c v ghi khng gii hn n cc a im ty trong b nh. CUDA ca NVIDIA cho php ngi dng truy cp vo b nh bng cch s dng cc cu trc C chun (mng, con tr, bin); CTM ca AMT cng gn linh hot c nh vy, nhng s dng a ch 2D. nh x (Map): p dng mt php ton mi phn t trong b su tp. M t in hnh l vng lp for trong chng trnh tun t (nh l mt lung trn mt CPU n li), mt thc thi song song c th gim thi gian cn thit bng cch p dng php ton n nhiu phn t song song. Tnh ton GPU da trn ha thc hin php nh x nh l chng trnh mnh c gi t b su tp im nh (mt im nh cho mi phn t). Tng chng trnh mnh ca im nh c (fetch) d liu t kt cu ti mt v tr tng ng vi v tr ca im nh trong hnh nh bin i (render), thc thi php ton , sau lu tr cc kt qu ti im nh u ra. Tng t, CTM v CUDA thng sinh ra mt chng trnh lung thc hin php ton trong nhiu lung, vi mi lung np vo mt phn t, thc hin tnh ton, v lu tr kt qu. Lu rng v vng lp h tr mi lung c th cng lp nhiu ln trn nhiu phn t. Rt gn (Reduce): lin tc p dng mt php ton kt hp nh phn rt gn mt tp hp cc phn t thnh mt phn t duy nht hoc mt gi tr duy nht. V d bao gm vic tm kim tng (trung bnh, ti thiu, ti a, phng sai, vv...) ca mt tp cc gi tr. Mt thc thi tun t trn CPU truyn thng s lp trn mt mng, tnh tng tng phn t bng cch chy php cng tt c cc phn t hin c. Ngc li, mt rt gn tng theo c ch song song thc hin nhiu ln php cng song song trn mt tp thu hp cc phn t. Tnh ton GPU da trn ha thc hin rt gn da trn bin i (rendering) tp gim dn cc im nh. Trong tng bin i tng vt qua chng trnh mnh c nhiu gi tr t mt kt cu (thc thi khong 4 hoc 8 ln c kt cu), tnh tng , v ghi gi tr vo im nh u ra trong kt cu khc (nh hn 4 hoc 8 ln), m sau s b rng buc nh l u vo cho b bng mnh tng t v qu trnh lp i lp li cho n khi u ra l mt im nh n cha kt qu cui

cng ca qu trnh rt gn. CTM v CUDA cng cho ra cng mt qu trnh trc tip hn, v d bng cch to ra mt tp cc lung, mi lung dc 2 phn t v ghi tng ca chng vo mt phn t n. Mt na s lung lp li qu trnh trn, sau l na cn li, c nh vy cho n khi cn li mt lung sng st s ghi kt qu cui cng ra b nh.

Hnh 12:Hiu nng qut trn CPU, v GPU da trn ha (s dng OpenGL), v GPU tnh ton trc tip (s dng CUDA). Kt qu thc hin trn GeForce 8800 GTX GPU v Intel Core2Duo Extreme 2.93 GHz CPU. Hnh v c ly H. Nguyen (ed), GPU Gems 3, copyright (c) 2008 NVIDIA Corporation, published by Addison-Wesley Professional.

Qut (Scan): i khi c gi l tng tin t song song, qut ly mt mng A cc phn t v tr v mt mng B c cng chiu di, trong mi phn t B [i] i din cho mt php rt gn mng con A[1...i]. Qut l cng c xy dng khi d cc k hu ch cho thut ton song song d liu; Blelloch m t nhiu ng dng tim nng ca qut t Sp xp nhanh (quicksort) ti cc php ton ma trn tha tht[9]. Harris v ng nghip[10] gii thiu mt thc thi ca qut hiu qu bng cch s dng CUDA (hnh 12); kt qu ca h minh ha cho nhng li th ca tnh ton trc tip hn l tnh ton GPU da trn ha. CUDA thc hin nhanh hn so vi CPU bi mt mt tha s ln n 20 v OpenGL bi mt tha s ln n 7.
Gii thut v ng dng

1.2.6.2.

Khi xy dng phn ln vo cc php ton nguyn thy trn, cc nh nghin cu biu din nhiu thut ton mc cao v cc ng dng khai thc cc th mnh tnh ton ca GPU. Cc thm d v cc thut ton tnh ton GPU v cc min ng dng ca c th tham kho [~13].

Sp xp (Sort): GPU c nhng ci thin ng k trong sp xp t khi cng ng tnh ton trn GPU nghin cu li, p dng, v ci thin cc thut ton sp xp, ng ch l sp xp bitonic merge [~6]. Thut ton "sorting network" ny v bn cht l song song v m, c ngha l cc bc tng t c thc hin bt k u vo. Govindaraju v cc ng nghip ginh gii hiu nng "PennySort" trong cuc thi "TeraSort" nm 2005 [~29] bng vic s dng h thng thit k cn thn v s kt hp ca ci tin nhiu thut ton. Tm kim v truy vn c s d liu (Search & database queries): Cc nh nghin cu cng trin khai thc hin mt s hnh thc tm kim trn GPU, nh tm kim nh phn (v d: Horn [~4]) v tm kim lng ging gn nht [~2], cng nh cc thao tc c s d liu c xy dng trn phn cng ha mc ch c bit (gi l b m su stencil) v cc thut ton sp xp nhanh trn [~28], [~27]. Phng trnh vi phn (Differential equations): Nhng n lc sm nht s dng GPU cho tnh ton phi ha tp trung vo gii quyt cc tp ln phng trnh vi phn. Php tm o hm l mt ng dng GPU ph bin cho phng trnh vi phn thng (ODEs), c s dng rt nhiu trong m phng khoa hc (v d, h thng thm d lu lng ca Kruger [~15]) v ti cc hiu ng trc quan cho cc ch tri trn my tnh. GPU c s dng nhiu gii quyt cc vn trong phng trnh vi phn ring (PDEs) nh phng trnh NavierStokes cho dng chy t do. ng dng c bit thnh cng m GPU PDE gii quyt bao gm cc ng lc cht lng (v d nh Bolz [~12]) v phng trnh thit lp phn chia m thanh [~1]. i s tuyn tnh (Linear algebra): chng trnhi s tuyn tnh l cc khi to dng ct li cho mt rt ln cc thut ton s hc, bao gm gii php PDE cp trn. ng dng cha m phng cc hiu ng vt l nh: cht lng, nhit, v bc x, hiu ng quang hc nh lnh vc su [~23], v tng t, theo ch ca i s tuyn tnh trn GPU nhn c nhiu s ch . Mt v d in hnh l sn phm ca Kr uger v Westermann [~14] gii quyt mt lp rng ca cc vn i s tuyn tnh bng cch tp trung vo biu din ma trn v vect trong tnh ton trn GPU da trn ha (v d nh ng gi cc vector dy c (dense) v tha tht (sparse) vo cc kt cu, b m vector, v.v..). Mt sn phm ng ch khc l cc phn tch v php nhn ma trn dy c ca Fatahalian v ng nghip [~19] v gii php cho cc h thng tuyn tnh dy c ca Gallapo v ng nghip [~26], tc gi cho thy c kh nng tt hn thm ch cc trin khai ATLAS ti u ho mc cao. ng dng ca cc tng trc tip tnh ton nh CUDA v CTM va n gin ho ng thi ci thin hiu sut ca i s tuyn tnh trn GPU. V d, NVIDIA cung cp uBLAS, mt gi i s tuyn tnh dy c thc thi trong

CUDA v sau l cc quy c BLAS ph bin. Cc thut ton i s tuyn tnh tha tht c nhiu bin i v phc tp hn so vi loi dy c ang l mt lnh m v hng nghin cu tch cc, cc nh nghin cu mong c m ngun tha tht kim chng li ch tng t hoc ln hn t tng tnh ton mi GPU.
1.2.6.3. Tng kt

Mt s ch nh k ni ln khp cc thut ton v khm ph cc ng dng trong tnh ton GPU cho n nay. Xem xt ch ny cho php chng ti m t li GPU lm tt nhng g. ng dng tnh ton GPU thnh cng c cc c tnh sau: Nhn mnh x l song song (Emphasize parallelism): GPU l v c bn my song song v vic s dng hiu qu n ph thuc vo mc x l song song trong khi lng cng vic. V d, NVIDIA CUDA thch chy hng ngn lung chy ti mt thi im, ti a ha c hi che du tr b nh bng cch s dng a lung. Nhn mnh x l song song i hi la chn cc thut ton m chia min tnh ton thnh cng nhiu mnh c lp cng tt. ti a ha s lng lung chy ng thi, GPU lp trnh cng nn tm cch gim thiu vic s dng thread chia s ti nguyn (nh dng cc thanh ghi cc b v b nh dng chung CUDA), v nn s ng b gia cc lung l t i. Gim thiu s phn k SIMD (Minimize SIMD divergence): Nh trong phn 2.3 nu, GPU cung cp mt m hnh lp trnh SPMD: nhiu lung chy cng mt chng trnh tng t, nhng truy cp d liu khc nhau v do c th c s khc nhau trong thc thi ca chng. Tuy nhin, trong mt s trng hp c bit, GPU thc thi ch SIMD cho cc l cc lung (nh CUDA "Warps" s m t trong chng 2). Nu lung trong mt l trch ra, ton b l s thc thi cng cc ng code cho n khi cc lung hi t li. Tnh ton hiu nng cao GPU i hi c cu code sao cho gim thiu s phn k trong l. Tng ti a cng s hc (Maximize arithmetic intensity): Trong khung cnh tnh ton ngy nay, cc tnh ton thc t l tng i r nhng bng thng l qu gi. iu ny tht s rt ng vi GPU ni c nhiu sc mnh du phy ng rt phong ph. tn dng ti a sc mnh cn cu trc thut ton ti a ha cng s hc, hoc s lng cc tnh ton trn s thc hin trong mi thao tc vi b nh. Truy cp d liu mch lc bng cc lung tr gip ring bit bi v cc thao tc ny c th kt hp lm gim tng s thao tc b nh. S dng b nh dng chung CUDA trn GPU NVIDIA cng gip gim overfetch (do cc lung c th giao tip) v cho php cc chin lc "blocking" vic tnh ton trn b nh ca chip. Khai thc bng thng dng (Exploit streaming bandwidth): Mc d c tm quan trng ca cng s hc, n l cn lu rng GPU c bng thng rt t (very high peak) trn b nh i km, trn th t ca 10 CPU - bng thng b

nh thng dng trn nn my PC. y l l do ti sao GPU c th thc thi tt hn CPU cc tc v nh sp xp, trong c t l tnh ton/bng thng thp. t c hiu nng cao trn cc ng dng nh th i hi cc mu truy cp b nh dng (streaming) trong cc lung c v ghi vo cc khi ln lin mch (ti a ha bng thng cho mi giao dch) nm trong cc khu vc ring bit ca b nh (trnh cc ri ro d liu). Kinh nghim cho thy rng khi cc thut ton v ng dng c th lm theo cc nguyn tc thit k cho tnh ton trn GPU - chng hn nh cc gii php PDE, cc gi i s tuyn tnh gi, v h thng c s d liu ni trn, v cc tr chi vt l v ng dng ng lc hc phn t c th t c tc gp 10-100 ln so vi cc on code CPU hon thin, ti u.

Chng 2. H THNG CHNG TRNH DCH V NGN NG LP TRNH GPU


2.1. Gii thiu v mi trng pht trin CUDA
CUDA- vit tt ca Compute Unified Device Architecture, l kin trc mi bao gm c phn cng v phn mm pht trin v qun l vic tnh ton trn GPU nh mt thit b tnh ton song song m khng cn nh x vo cc hm lp trnh ha. Kin trc ny c trong gii php ca GeForce 8 Series, Quadro FX 5600/4600, v Tesla ca NVIDIA. C ch a nhim ca h iu hnh chu trch nhim cho vic qun l truy cp ti GPU bi cc ng dng CUDA v ng dng ha chy song song. B phn mm CUDA bao gm cc lp m t trong hnh 13: dirver cho phn cng, API lp trnh, mi trng thc thi; v hai th vin ton hc mc cao hn ca cc hm thng dng, CUFFT v CUBLAS. Phn cng c thit k h tr dirver hng nh v lp mi trng thc thi, t cho tc cao.

Hnh 13: Kin trc b phn mm CUDA

Th vin lp trnh CUDA bao gm cc hm m rng ca ngn ng C. CUDA cung cp cch nh a ch DRAM thng dng nh m t trong hnh 14 cho vic lp

trnh linh hot hn, bao gm c thao tc cp pht v thu hi b nh. T gc lp trnh, iu tng ng vi kh nng c v ghi d liu ti bt k a ch no trong DRAM, ging nh CPU.

Thu hi

Cp pht Hnh 14: Cc thao tc thu hi v cp pht b nh

CUDA c c tnh lu d liu m song song v v b nh chia s trn chip vi tc c ghi rt cao, cc lung dng b nh ny chia s d liu vi nhau. Nh m t trong hnh 15, ng dng c th t kt qu tt vi vic ti thiu vic ly/tr d liu t DRAM, t tr gim ph thuc bng thng truyn b nh DRAM.

Khng c vng nh dng chung

C vng nh dng chung Hnh 15: Vng nh dng chung mang d liu gn ALU hn

2.2. M hnh lp trnh


2.2.1. B ng x l a lung mc cao Trong lp trnh CUDA, GPU c xem nh l mt thit b tnh ton c kh nng thc hin mt s lng rt ln cc lung song song. N hot ng nh l mt b ng x l vi CPU chnh. Ni cch khc, d liu song song, phn tnh ton chuyn dng ca cc ng dng chy trn host c tch ri (off-loaded) khi thit b. Chnh xc hn, mt phn ca mt ng dng c thc hin nhiu ln, nhng c lp v mt d liu, c th nhm thnh mt chc nng c thc hin trn thit b nh nhiu lung khc nhau. c iu , mt chc nng c bin dch thnh cc tp lnh ca thit b v to tra chng trnh, gi l nhn (kernel), c ti vo thit b. C hai host v thit b duy tr DRAM ring ca n, c gi l b nh host v b nh thit b. C th sao chp d liu gia DRAM ca host v thit b thng qua API ti u ha c s dng c ch truy cp b nh trc tip tc cao (DMA) ca thit b. 2.2.2. Gom l cc lung (Thread Batching) L cc lung thc hin c nhn t chc thnh mt li cc khi lung c miu t trong phn khi lung v li cc khi lung di y.
2.2.2.1. Khi lung

Mt khi lung l mt tp cc lung, c th ng thi x l vi nhau bng cch dng d liu trong b nh dng chung v th thi ng b phi hp truy cp b nh. Chnh xc hn, c th xc nh cc im ng b trong nhn, ni cc lung trong khi s dng cho n khi tt c cc lung ti im ng b. Mi lung c xc nh bi ID, l s hiu ca lung trong khi. h tr vic nh a ch phc tp da trn ID lung, mt ng dng cng c th ch nh mt

khi nh mt mng hai hoc ba chiu c kch thc ty v xc nh tng lung bng cch s dng ch s 2 hoc 3 thnh phn thay th. i vi cc khi kch thc 2 chiu(Dx, Dy), thread ID ca phn t c ch s (x, y) l (x + y Dx) v cho mt khi kch thc ba chiu (Dx, Dy, Dz), thread ID ca phn t (x, y, z) l (x + yDx + z Dx Dy).
2.2.2.2. Li cc khi lung (Grid of Thread Blocks)

S lng lung ti a trong mt khi c gii hn. Tuy nhin, cc khi cng s chiu v kch thc thc thi trn cng nhn c th nhm vi nhau thnh li cc khi, do vy tng s lung chy trn mt nhn l ln hn nhiu. iu ny xut pht ti cc chi ph hp tc gia cc lung gim, v cc lung trong cc l khc nhau trong li khng th trao i v ng b vi nhau. M hnh ny cho php cc nhn chy hiu qu m khng phi dch li trn cc loi thit b khc nhau vi kh nng chy song song khc nhau: mt thit b c th chy trn tt c khi ca li mt cch tun t nu n c rt t kh nng chy song song, hoc chy song song nu n c kh nng chy song song nhiu, hoc kt hp c hai. Mi khi c xc nh bi ID ca n, l s khi trong li. h tr vic nh a ch phc tp da trn block ID, mt ng dng c th xc nh mt li nh mt mng 2 chiu vi kch thc c nh v nh danh mi khi s dng ch mc 2 thnh phn. Vi khi 2 chiu kch thc (Dx, Dy), block ID ca block (x,y) l (x + y Dx).

Hnh 16: Khi lung

2.2.3. M hnh b nh Mt lung thc thi trn thit b ch truy cp vo DRAM ca thit b v b nh trn chip qua cc khng gian nh sau, nh m t trong hnh 17: c- ghi trn cc thanh ghi ca mi lung c-ghi b nh cc b mi lung c-ghi b nh dng chung ca mi khi. c-ghi b nh ton cc ca mi li. Ch c b nh hng s ca mi li Ch c b nh kt cu (texture) ca mi li

Cc cng nh ton cc, hng s v kt cu c th c hoc ghi bi host v lin tc gia cc ln thc thi nhn bi cng mt ng dng. Cc vng nh ton cc, hng s v kt cu c ti u ha cho cc cch s dng b nh khc nhau. Vng nh kt cu cng a ra cc c ch nh a ch khc, cng nh lc d liu, cho mt s loi d liu c bit

Hnh 17:M hnh b nh

2.3. Thit lp phn cng


2.3.1. Tp cc b a x l SIMD vi b nh dng chung trn chip Thit b c cu hnh nh mt tp cc b a x l nh m t trong hnh 18. Mi b a x l c mt kin trc n lnh, a d liu (SIMD): ti mt chu k ng h cho trc, mi b x l ca b a x l thc thi cng mt lnh, nhng vi d liu khc nhau. Mi b a x l c b nh trn chip thuc 4 loi sau:

Mt tp cc thanh ghi cc b 32 bit cho mi b x l. Mt vng m d liu song song hoc vng nh dng chung c chia s bi tt c cc b x l v ci t bi khng gian b nh dng chung. Mt vng m hng s ch c c dng chung bi tt c b x l v tng tc c t khng gian b nh hng s, c ci t nh mt vng ch c ca b nh thit b. Mt vng m kt cu ch c c dng bi tt c cc b x l v tng tc c t khng gian b nh kt cu, c ci t nh mt vng ch c ca b nh thit b

Khng gian nh ton cc v cc b, c ci t nh m vng c ghi trn b nh thit b v khng c b m. Mi b a x l truy cp vng m kt cu thng qua n v kt cu (texture unit) thc thi nhiu ch nh a ch v lc d liu trong cp trong phn 2.2.3.

Hnh 18: M hnh phn cng

2.3.2. M hnh thc thi M li cc khi lung c thc thi trn thit b bng cch thc thi mt hoc nhiu khi trn tng b a x l s dng lt ct thi gian: Mi khi c tch thnh cc nhm SIMD ca cc lung gi l warp; mi warp c cng s lng lung, gi l kch thc warp, c thc thi bng b ng x l trong m hnh SIMD, b lp lch lung nh k chuyn t warp ny sang warp khc ti a mc s dng ti nguyn tnh ton ca b a x l. Half-warp l na th nht hoc na th hai ca warp.

Cch tch mt khi thnh cc warp lun ging nhau, mi warp bao gm cc lung thc hin lin tc, vi id ca lung tng dn, warp u tin bao gm thread 0. Phn 2.2.2.1 m t mi quan h gia ID ca lung vi ch s ca lung trong khi. Mt khi lung c x l bng ch mt b a x l, do vy khng gian nh dng chung trong vng nh dng chung trn chip dn ti tc truy cp b nh rt nhanh. Cc thanh ghi ca b a x l c cp pht gia cc lung trong khi. Nu s lng thanh ghi s dng cho 1 lung nhn vi s lng lung ln hn tng s thanh ghi trn b a x l, khi khng th thc thi v nhn tng ng khng th chy c. Mt vi khi c th thc hin trn cng mt b a x l ng thi bng cch cp pht cc thanh ghi ca b a x l v b nh dng chung gia cc khi. Th t cc warp trong mt block khng xc nh, nhng vic thc thi ca chng c th ng b, nh m t trong phn 2.2.2.1, phi hp ng thi truy cp b nh ton cc v b nh chia s. Th t cc khi trong mt li cc khi lung khng xc nh v khng c c ch ng b gia cc khi, do vy lung cc khi khc nhau trong li khng th giao tip vi nhau mt cch an ton qua vng nh ton cc trong qu trnh thc thi li. Nu cc lnh khng l nguyn t thc hin trong warp ghi vo cng v tr trong vng nh ton cc hoc vng nh chia s cho nhiu hn mt lung ca warp , s lng v th t thc hin cc php ghi tun t xy ra ti v tr din ra khng xc nh, nhng mt trong cc lnh ghi c m bo thnh cng. Nu lnh l lnh nguyn t (xem phn 2.4.4.6) thc thi bi warp c, thay i v ghi ti cng mt v tr trong vng nh ton cc cho nhiu lung ca warp, tng thao tc c, thay i, ghi ti v tr c ni tip nhau, nhng th th t chng din ra khng xc nh. 2.3.3. Kh nng tnh ton Cc tnh nng ca mt thit b c th hin trn s hiu phin bn chnh v s hiu ph i km. Thit b vi cng mt s phin bn chnh c cng kin trc ct li. GeForce 8 Series, Quadro FX 5600/4600, v Tesla l cc gii php ca nng lc tnh ton 1.x (s hiu phin bn chnh l 1). S hiu ph tng ng vi mt s ci tin gia tng cc kin trc li, c th bao gm c tnh nng mi. Cc GeForce 8800 Series, Quadro FX 5600/4600, v Tesla l gii php c cc tnh nng lc 1.0 (nh hiu ph l 0) v GeForce 8600 v 8.500 Series c kh nng tnh ton 1.1. Cc thng s k thut ca cc kh nng tnh ton c a ra trong Ph lc A trong [99].

2.3.4. a thit b Vic s dng nhiu GPU nh cc thit b CUDA bi mt ng dng chy trn cc h thng a GPU ch m bo hot ng nu cc GPU ny cng loi. Tuy nhin, nu h thng trong ch SLI, ch mt GPU c th s dng nh l thit b CUDA do tt c GPU c gi mc thp nht trong stack driver. SLI mode cn c tt trong control panel CUDA c th kch hot tng GPU nh l thit b ring bit. 2.3.5. C ch chuyn i GPU dnh cho mt s vng nh DRAM cho ci gi l b mt chnh (primary surface), c s dng lm ti thit b hin th cho ngi dng xem. Khi ngi dng khi to ch chuyn i ca mn hnh bng cch thay i phn gii hoc s bit ca mn hnh (s dng NVIDIA control panel hoc Display control panel trn Windows), mt lng b nh cn cho thay i b mt chnh. V d nu ngi dng thay i phn gii t 1280x102x32 bit thnh 1600x1200x32 bit, h thng phi dnh ra 7.68 MB hin th b mt chnh thay v 5.24 MB. (ng dng ha full-screen chy vi ch chng rng ca c th yu cu b nh hin th nhiu hn na cho b mt chnh). Trn Windows, cc s kin khc c th kch hot chuyn ch hin th nh chy ng dng DirectX full-screen, nhn Alt-Tab task chuyn khi ng dng DirectX full-screen, hoc Ctrl+Alt+Del kha my. Nu chuyn ch tng dung lng b nh cn thit cho b mt chnh, h thng cn ly thm b nh cung cp cho ng dng CUDA, kt qu l gy v cc ng dng.

2.4. Giao din lp trnh ng dng


2.4.1. M rng cho ngn ng lp trnh C Mc tiu ca giao din lp trnh CUDA l cung cp cch tip cn kh n gin cho nhng ngi s dng quen vi ngn ng lp trnh C, c th d dng vit chng trnh cho vic x l bng cc thit b. N gm c : Mt thit lp ti thiu ca cc thnh phn m rng cho ngn ng lp trnh C, c miu t trong phn 2.4.2, cho php ngi lp trnh nhm ti cc phn chia m ngun chng trnh cho vic x l trn thit b. Th vin chy c chia thnh: + Thnh phn chnh (host component), c miu t trong 2.4.5, chy trn host v cung cp cc chc nng cho vic iu khin v truy nhp mt hoc nhiu thit b khc t host. + Cc thit b thnh phn (device component), miu t trong 2.4.4, c chy trn cc thit b v cung cp cc hm ring ca thit b .

+ Mt thnh phn chung (common component), miu t trong 2.4.3, cung cp xy dng trong kiu vector v l mt tp con th vin chun ca C n h tr cho c host v cc thit b thnh phn. Cn nhn mnh rng ch c hm t th vin chun ca C l c h tr cho vic chy trn cc thit b l cc chc nng c cung cp bi thnh phn chy chung. 2.4.2. M rng ngn ng M rng cho ngn ng lp trnh C bn kha cnh: T kha phm vi kiu hm cho php xc nh liu mt hm thc hin trn host hay trn thit b v liu n c th c triu gi t host hoc t thit b .(Phn 2.4.2.1); T kha phm vi kiu bin cho php c t v tr b nh trn thit b ca mt bin (phn 2.4.2.2); Mt ch th mi xc nh cch nhn c thc hin trn thit b t pha host (phn 2.4.2.3) Bn bin build-in xc nh chiu ca li v khi, ch s khi v lung (phn 2.3.2.4)

Vi mi file ngun cha cc phn m rng trn phi c bin dch vi CUDA bng trnh bin dch nvcc, c miu t ngn gn trong 2.3.2.5. Nhng miu t chi tit ca nvcc c th c tm thy trong cc ti liu khc. Mi phn m rng i km vi mt s hn ch c m t trong phn di, nvcc s a ra li hoc thng ip cnh bo mt s xung t ca cc phn hn ch trn, nhng mt s chng c th khng c nhn ra.
2.4.2.1. T kha phm vi kiu hm

2.4.2.1.1.

__device__

Khai bo __device__ nh ngha mt hm: X l trn thit b Ch c gi t thit b __global__ Khai bo __global__ nh ngha mt hm nh l mt ht nhn: X l trn thit b Ch c th triu gi c t host __host__ Khai bo __host__ l mt hm:

2.4.2.1.2.

2.4.2.1.3.

X l trn host Ch c th triu gi c t host.

N tng ng vic khai bo mt hm vi ch xc nh trong host hoc khai bo n bn ngoi ca host, thit b hoc khai bo ton cc; trong mt s trng hp khc cc hm c kt hp vi nhau ch cho host. Tuy nhin vic cc hm hn nh trong host cng c th s dng kt hp vi cc hm hn nh trong thit b, trong mt vi trng hp chc nng kt hp cho c host v thit b. 2.4.2.1.4. Cc hn ch Cc hm ca __device__ l hm ng (inlined). Cc hm ca __device__v __global__khng h tr s quy. Cc hm ca __device__v __global__khng th khai bo cc bin static trong thn hm. Cc hm ca __device__v __global__khng th c s bin ca thay i. Cc hm ca __device__khng th ly c a ch ca chng; hm tr ti cc hm __global__ c h tr.
__global__ v __host__ khng th s dng ng thi. __global__ phi

c kiu tr v l kiu void. Li gi hm __global__ phi ch r cu hnh thc hin n nh trong miu t phn 2.3.2.3. Gi ti mt hm __global__ l khng ng b, c ngha l n tr v trc khi thit b hon thnh xong x l ca n. Tham s ca hm ton cc hin ang c truyn qua b nh dng chung vi thit b v gii hn ln 256 byte.
T kha phm vi kiu bin

2.4.2.2.

2.4.2.2.1.

__device__

Khai bo __device__ nh ngha bin ch c gii hn trn thit b . Nhiu nht l mt trong ba kiu khai bo bn di c th s dng cho cc thit b khc tip tc ch nh khng gian b nh m bin thuc. Nu khng ai trong chng th hin, cc bin : Tn ti trong khng gian b nh ton cc C vng i ca mt ng dng Truy nhp c t tt c cc lung bn trong li v host thng qua th vin runtime

2.4.2.2.2.

__constant__

Khai bo __constant__ c th c dng vi khai bo __device__ nh ngha mt bin: Tn ti trong khng gian b nh khng i C lifetime ca mt ng dng Truy nhp c t tt c cc lung bn trong li v host thng qua th vin runtime __shared__ Bin chia s la chn s dng vi cc thit b khc, miu t mt bin c : Tn ti trong mt khng gian b nh chia s ca mt lung C lifetime ca mt khi. Ch c th truy nhp t tt c cc ch th trong khi.

2.4.2.2.3.

C y trnh t nht qun ca cc bin chia s trong phm vi mt lung. Ch sau khi thc hin mt syncthreads() (trong phn 4.4.2) lm vic vit t cc lung khc m bo nhn thy c. Trnh bin dch khng b rng buc ti u ha nhng ln c ghi vo b nh cha s min l nhng cu lnh trc c p ng. Khi khai bo mt bin trong b nh chia s nh mt mng m rng nh :
extern __shared__ float shared[];

Kch thc ca mng c xc nh ti thi im khi to (xem phn 4.2.3). Tt c cc bin nu khai bo trong thi im ny , bt u ti cng mt a ch trong b nh, do cch b tr ca cc bin trong mng phi c qun l mt cch r rng thng qua offsets. V d, nu mun tng ng vi:
short array0[128]; float array1[64]; int array2[256];

Trong b nh chia s ng c to ta c th khai bo v khi to cc mng theo cch sau:


extern __shared__ char array[]; __device__ void func() // Hm chc nng ton cc hoc thit b { short* array0 = (short*)array; float* array1 = (float*)&array0[128]; int* array2 = (int*)&array1[64]; }

2.4.2.2.4.

Cc rng buc

Nhng hn nh l khng cho php vo thnh phn struct v union, trn cc thng s chnh thc v trn cc bin cc b trong mt hm thc thi trn host.

__shared__ v __constant__ khng th s dng trong vic kt hp vi cc bin

khc. Cc bin __shared__ v __constant__ m ch lu tr tnh. Cc bin __device__ v __constant__ ch cho php trong phm vi file Cc bin __constant__ khng th c gn t thit b, ch t cc host lu tr. Cc bin __shared__ khng th c mt khi to nh b phn khai bo. Mt bin t ng khai bo trong m thit b m khng cn phi khai bo trong mt ng k c ch chung no. Tuy nhin trong mt s trng hp, trnh bin dch c th chn t n trong b nh cc b. y thng l trng hp cho cc cu trc ln hoc mng s tiu th khng gian ng k qu nhiu, v mng m trnh bin dch khng th xc nh rng chng c lp ch mc vi s lng khng i. Kim tra cc m assembly ptx (thu c bng cch bin dch ty chn ptx hoc -keep) s cho bit nu mt bin c t trong b nh cc b trong ln bin dch u tin n s c khai bo s dng thuc tnh nh .local v truy nhp s dng Id.local v st.local. Ngc li, cc ln bin dch tip theo c th quyt nh cch khc nu chng tm thy n tiu th qu nhiu khng gian thanh ghi cho mc ch cu trc. Con tr trong code thc thi trn thit b c h tr min l trnh bin dch c th gii quyt liu c phi chng ch ti mt khng gian b nh dng chung hay khng gian b nh ton cc, nu khng chng phi c hn ch tr ti khng gian b nh ch nh hoc khai bo trong khng gian b nh ton cc. Truy nhp vo vng nh m mt con tr tr ti b nh ton cc hoc b nh dng chung trong code c thc thi trn host; hoc ti b nh ca host trong code l thc thi trn thit b dn ti mt hnh vi khng xc nh trc, thng xuyn nht trong phn on li v khi kt thc ng dng.
2.4.2.3. Thc hin cu hnh

Bt k li gi ti hm ton cc (global) phi xc nh cu hnh thc hin cho li gi. Cu hnh x l xc nh kch thc li v khi m s c s dng thc hin chc nng trn thit b. N c xc nh bng cch chn mt biu thc mu dng
<<< Dg, Db, Ns >>> gia tn hm v danh sch tham s c trong ngoc n,

y : Dg l kiu dim3 (miu t trong 2.4.3.1.2) v xc nh mc ch v kch thc ca li, sao cho Dg.x * Dg.y bng vi s khi c a ra. Db l kiu dim3 (miu t trong 2.4.3.1.2) v xc nh mc ch kch thc ca mi khi, sao cho Db.x*Db.y*Db.z bng s lng cc lungs trn khi.

Ns l mt kiu size_t v xc nh s byte trong b nh chia s n cho php khai bo ng trn mi khi cho li gi ngoi vic cp pht b nh tnh. Vic cp pht b nh ng s dng bi bt k bin khai bo nh l mt mng m rng nh c cp n trong phn 2.3.2.2.3, Ns l mt i s ty chn mc nh l 0. Cc i s cu hnh c c lng trc khi thc hin hm thc t. Mt v d cho vic khai bo hm:
__global__ void Func(float* parameter);

Phi gi ging nh:


Func<<< Dg, Db, Ns >>>(parameter); 2.4.2.4. Cc bin Built-in

2.4.2.4.1. 2.4.2.4.2.

gridDim blockIdx

y l bin kiu dim3 (xem phn 2.4.3.1.2) v cha cc kch thc ca li y l bin thuc kiu uint3 (xem phn 2.4.3.1.1) cha cc ch s khi trong li 2.4.2.4.3. 2.4.2.4.4. blockDim threadIdx Bin ny l loi dim3 (xem phn 2.4.3.1.2) cha kch thc ca khi. Bin ny thuc loi uint3 (xem phn 2.4.3.1.1) v cha cc ch s lung trong khi 2.4.2.4.5. Cc hn ch N khng cho php a ra a ch ca bt k bin built-in no N khng cho php gn gi tr cho bt k bin built-in no
Bin dch vi NVCC nvcc l mt trnh iu khin trnh bin dch bng vic n gin ha qu trnh

2.4.2.5.

bin dch m CUDA: N cung cp cc ty chn dng lnh n gin v quen thuc thc hin chng bng cch gi tp hp ca cc cng c thc hin cc cng on bin dch khc nhau.
nvcc bao gm lung cng vic c bn trong vic tch code thit b t code host

v bin dch code thit b sang dng nh phn hoc cc i tng cubin. Cc code host sinh ra l u ra c th l code C c bin dch bng cch s dng mt tool khc hoc l code i tng trc tip bi vic triu gi trnh bin dch host trong giai on bin dch trc .

ng dng hoc c th b qua cc code host sinh ra v ti i tng cubin vo thit b v v khi ng code thit b s dng trnh iu khiu API ca CUDA (xem phn 4.5.3), hoc lin kt ti code host sinh ra, trong bao gm cc i tng cubin c xem nh mng d liu khi to ton cc v cha mt bn dch cc c php thc thi cu hnh miu t trong phn 2.3.2.3 thnh code cn thit khi ng trong thi gian chy CUDA np v khi ng mi ln bin dch ht nhn (xem phn 2.3.5.2). Frond end ca trnh bin dch x l cc file ngun CUDA theo c php quy nh C++. Tuy nhin, ch c cc tp con C ca C++ c h tr. iu ny c ngha l nhng c tnh c trng ca C++ nh cc lp (classes), s k tha, hoc vic khai bo cc bin trong khi c bn l khng c h tr. Nh mt h qu ca vic s dng c php C++, con tr void (v d nh tr li malloc()) khng th c gn ti nhng con tr non-void m khng c p kiu. M t chi tit ca nvcc c th c tm thy trong cc ti liu ring bit. 2.4.3. Thnh phn chung trong thi gian chy Cc thnh phn chung trong thi gian chy ph bin c th c s dng bi c hm ca host v thit b.
2.4.3.1. Cc loi vector built-in

2.4.3.1.1. char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2,short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4 y l kiu vector k tha t cc kiu c bn l s nguyn v du phy ng. Chng c cu trc v c 1, 2, 3, hoc 4 thnh phn, cc thnh phn ny c th truy nhp thng qua cc trng x,y, z v w, theo th t nh sn. Tt c chng c c sau gi hm khi to make_<type name> ; v d:
int2 make_int2(int x, int y);

To ra mt vector vi kiu int2 vi gi tr (x,y).

2.4.3.1.2.

Kiu dim3

Kiu ny l mt kiu vector integer c s da trn uint3 c s dng ch nh kch thc. Khi xc nh mt bin ca kiu dim3 bt k thnh phn no c li khng ch r c khi to l 1. 2.4.3.1.3. Hm ton hc Bng B-1 trong [99] cha danh sch y th vin chun ca C/C++ thc hin cc hm ton hc hin ang c h tr, cng vi cc li tng ng khi x l trn thit b. Khi thc thi m host, mt hm a ra s dng cc ci t trong thi gian chy ca C nu c sn.

2.4.3.1.4.

Hm thi gian

clock_t clock();

Tr v gi tr ca mt b m lun tng theo mi chu k ng h. Ly mu gi tr b m ny u v cui ca mt nhn, ly s khc bit ca hai mu, v ghi kt qu vi tng lung cung cp mt s o cho mi lung theo s chu k ng h cn thit bi thit b hon thnh x l lung, nhng khng phi l s chu k ng h m thit b thc s dnh ra x l cc ch th lnh trong lung. S trc ln hn s sau do cc lung c cc nht ct thi gian. 2.4.3.1.5. Kiu kt cu CUDA h tr mt tp hp con cc phn cng to kt cu m GPU s dng cho ha truy cp vo b nh kt cu. c d liu t b nh kt cu thay v t b nh ton cc c th c mt s li ch hiu sut nh m t trong phn 2.5.4. Nhn c b nh kt cu bng cch dng cc hm thit b gi l hm c kt cu (texture fetches), miu t trong phn 2.3.4.5. Tham s u tin ca mt hm c kt cu c t mt i tng gi l mt tham chiu kt cu (texture reference). Tham chiu kt cu nh ngha phn b nh kt cu c ly ra. N phi c gii hn thng qua cc hm runtime trn host (phn 0 v 2.3.5.3.7) cho ti mt v vng trong b nh, c gi l kt cu, trc khi n c th s dng bi mt nhn. Mt vi tham chiu kt cu c th b rng buc vo cng mt kt cu hoc nhng kt cu ln nhau trong b nh. Tham chiu kt cu c mt vi thuc tnh. Mt trong s l chiu ca n s xc nh xem a ch kt cu u, liu kt cu c nh a ch trong mt chiu s dng mt ta kt cu, hay trong mng hai chiu s dng hai ta kt cu. Mi phn t ca mng gi l texel, vit tt cho texture elements. Nhng thuc tnh khc nh ngha kiu d liu u vo v u ra ca hm c kt cu, cng nh cch ta u vo c phin dch v lung no cn thc hin. 2.4.3.1.6. Khai bo tham chiu kt cu Mt s nhng thuc tnh ca tham chiu kt cu l khng thay i v phi c bit n lc thi gian bin dch, chng c xc nh khi khai bo cc tham chiu kt cu. Mt tham chiu kt cu l khai bo file phm vi nh l mt bin ca loi kt cu:
texture<Type, Dim, ReadMode> texRef; y:

Type xc nh kiu d liu c tr v khi ly kt cu; c gii hn trong s

nguyn c bn v kiu du phy ng v vector 1, 2, 4 thnh phn c nh ngha trong phn 2.3.3.1.1; Dim xc nh chiu ca cc tham chiu kt cu v bng 1 hoc 2. Dim l mt

i s ty chn mc nh l 1.

ReadMode

bng

cudaReadModeNormalizedFloat

hoc

cudaReadModeElementType; nu n cudaReadModeNormalizedFloat

v loi l kiu interger 16-bit hoc 8-bit, gi tr thc tr v nh l kiu floating-point v y di a ch ca loi integer c nh x vo [0.0, 1.0]; v d, mt unsigned 8-bit, phn t kt cu vi gi tr 0xFF c nh 1; nu n l cudaReadModeElementType, khng thc hin vic chuyn i; ReadMode l mt i s ty chn mc nh l cudaReadModeElementType. Thuc tnh tham chiu kt cu trong thi gian chy

2.4.3.1.7.

Cc thuc tnh khc ca mt tham chiu kt cu c th thay i thi gian chy thng qua host runtime (phn 4.5.2.4 cho thi gian chay API v phn 4.5.3.7 cho iu khin API). Chng xc nh r ta chun ha kt cu hay khng, ch a ch, v b lc kt cu, nh chi tit bn di. Theo mc nh, tham chiu kt cu s dng ta du phy ng trong khong [0,N) trong N l kch thc ca kt cu chiu tng ng vi ta . V d, mt kt cu m l 64x32 trong kch thc s c tham chiu vi ta trong khong [0,63] v [0,31] cho x v kch thc y, tng ng. Chun ha cc ta kt cu to ra cc ta s c ch r trong phm vi [0.0,1.0] thay v [0,N], do ging nh kt cu 64x32 cng s c xc nh bi ta chun ha trong khong [0,1) trong c hai ta x v y . Chun ha ta kt cu l mt ta ph hp vi t nhin yu cu ca mt s ng dng. N thch hp hn cho ta kt cu cho vic trin khai ca kch c kt cu. Cc ch a ch xc nh nhng ci s xy ra khi ta trn phm vi. Khi s dng khng chun ha ta kt cu, ta kt cu bn ngoi khong [0,N): cc gi tr di 0 c t l 0 v gi tr ln hn hoc bng N c t l N-1. Gi c nh l cch nh a ch mc nh ch khi s dng chun ha ta kt cu: Cc gi tr di 0.0 hoc cao hn 1.0 n phm vi [0.0,1.0). i vi chun ha ta , wrap a ch cng c th c ch nh. Wrap a ch c s dng khi kt cu c cha mt chu k tn hiu. N ch s dng cc phn phn on ca ta kt cu; v d, 1.25 tng ong vi 0.25 v -1.25 c xem nh l 0.75 Tuyn tnh lc kt cu c th c thc hin ch cho kt cu c cu hnh tr v d liu du phy ng. N thc hin php ni suy chnh xc thp gi cc im gn nhau texels. Khi kch hot, texels xung quanh mt truy cp v tr kt cu l c v tr v gi tr. Php ni suy tuyn tnh n gin c thc hin cho nhng kt cu v php ni suy song tuyn tnh mt chiu c thc hin cho hai chiu. 2.4.3.1.8. To kt cu t b nh tuyn tnh so snh vi mng CUDA Mt kt cu c th bt k vng no trong b nh tuyn tnh hoc mt mng CUDA (xem phn 2.3.5.1.2) . Phn b cc kt cu trong b nh tuyn tnh:

S chiu ch c th bng 1 Khng h tr lc kt cu Ch c th nh a ch bng cch ta kt cu s nguyn khng chun ha. Khng h tr cc c ch nh a ch khc nhau: truy cp vo kt cu ngoi vng nh tr v 0.

Phn cng p t cc yu cu lin kt kt cu da trn a ch. tru tng ha cc yu cu lin kt ny t pha ngi pht trin, nhng hm rng buc cc tham chiu kt cu vo b nh thit b tr v mt byte offset phi c p dng cho hm c kt cu c t b nh mong mun. Nhng con tr c s tr v t cc th tc phn phi ca CUDA phi tun theo cc rng buc lin kt , do ng dng c th trnh offsets hon ton bng cch gi con tr c cp pht cho
cudaBindTexture()/cuTexRefSetAddress().

2.4.4. Thnh phn thit b thi gian chy Thnh phn thit b thi gian thc c th ch c s dng trong cc hm thit b.
2.4.4.1. Cc hm ton hc

i vi s hm ca bng B-1, mt phin bn t chnh xc hn, nhng nhanh hn phin bn tn ti trong thnh phn thit b thi gian thc; n cng tn vi tin t bt u bng __ (nh __sin(x)). Cc hm ny thc cht c lit k trong bng B-2, cng nhng ranh gii li tng ng. Trnh bin dch ny c mt ty chn (-use_fast_math) bt buc mi hm ti bin dch to bn sao t chnh xc hn nu n tn ti.
2.4.4.2. Hm ng b void __syncthreads();

ng b ha tt c cc lung trong mt khi. Sau khi tt c cc lung t n im ny, thc hin tip tc li bnh thng.
__syncthreads() c s dng phi hp giao tip gia cc lung ca cng

mt block. Khi mt s lung bn trong mt block truy nhp cng mt a ch c chia s hoc b nh ton cc, c kh nng read-after-write, write-after-read, hoc write-after-write mt s nguy him cho truy nhp b nh. Nhng nguy him d liu c th c trnh bi vic ng b ha cc lung gia cc truy cp.
__syncthreads() cho php trong m c iu kin nhng ch khi c iu kin

nh gi ging nhau trn ton b lung ca khi, nu khng thc hin m hp l c kh nng treo hoc to ra cc ngoi l khng mong mun.
2.4.4.3. Cc hm chuyn i kiu

Hu t trong cc hm di y m t cc kiu lm trn c nu trong IEEE754:

Rn l lm trn ti s chn gn nht (round-to-nearest-even) Rz l lm trn v 0 (round-towards-zero) Ru l lm trn ln (round-up) n v cng dng Rd l lm trn xung (round-down) n v cng m

int __float2int_[rn,rz,ru,rd](float);

Chuyn i cc tham s du phy ng thnh mt s nguyn, s dng thit lp ch lm trn.


unsigned int __float2uint_[rn,rz,ru,rd](float);

Chuyn i cc tham s du phy ng thnh mt unsigned integer, s dng thit lp ch lm trn


float __int2float_[rn,rz,ru,rd](int);

Chuyn i cc i s nguyn thnh du phy ng, s dng thit lp ch lm trn.


float __uint2float_[rn,rz,ru,rd](unsigned int);

Chuyn i cc i s integer unsigned thnh du phy ng, s dng thit lp ch lm trn.


2.4.4.4. Cc hm p kiu float __int_as_float(int);

Thc hin kiu du phy ng trn cc i s nguyn, tr v gii tr khng thay i. Chng hn: __int_as_float(0xC0000000) bng -2
int __float_as_int(float);

Thc hin mt kiu s nguyn cast trn kiu du phy ng, tr v gi tr khng thay i. Chng hn, __float_as_int(1.0f) bng 0x3f800000.
2.4.4.5. Cc hm kt cu

2.4.4.5.1.

To kt cu t b nh thit b

Khi to kt cu t b nh thit b, kt cu c truy cp vi h cc hm tex1Dfetch(), v d:


template<class Type> Type tex1Dfetch ( texture<Type, 1, cudaReadModeElementType> texRef, int x); float tex1Dfetch (

texture<unsigned char, 1, cudaReadModeNormalizedFloat> texRef, int x); float tex1Dfetch ( texture<signed char, 1, cudaReadModeNormalizedFloat> texRef, int x); float tex1Dfetch ( texture<unsigned short, 1, cudaReadModeNormalizedFloat> texRef, int x); float tex1Dfetch ( texture<signed short, 1, cudaReadModeNormalizedFloat> texRef, int x);

Nhng hm ny ly cc vng b nh tuyn tnh gn cho tham chiu kt cu l texRef s dng ta kt cu x. Khng c c ch lc kt cu hay nh a ch no c h tr. i vi cc loi s nguyn, cc chc nng ny c th ty chn cho s du phy ng 32-bit. Bn cnh hm hin th trn 2- v 4-tuples c h tr. V d:
float4 tex1Dfetch ( texture<uchar4, 1, cudaReadModeNormalizedFloat> texRef, int x);

Ly b nh tuyn tnh gn cho tham chiu kt cu texRef s dng ta x ca kt cu 2.4.4.5.2.


tex2D() template<class Type, enum cudaTextureReadMode readMode> Type tex1D(texture<Type, 1, readMode> texRef, float x); template<class Type, enum cudaTextureReadMode readMode> Type tex2D(texture<Type, 2, readMode> texRef, float x, float y);

To kt cu t mng CUDA

Khi to kt cu t mng CUDA, kt cu c truy cp bi tex1D() hay

Cc hm trn ly ra mng CUDA gn vo tham chiu kt cu texRef bng cch dng ta kt cu x v y. T hp ca cc thuc tnh khng bin i (thi gian dch) v bin i (thi gian chy) ca tham chiu kt cu xc nh cch cc ta c

phin dch, lung no s xut hin trong qu trnh ly kt cu, v gi tr tr v c giao cho qu trnh ly kt cu.
2.4.4.6. Hm nguyn t

Cc hm nguyn t ch c cho cc thit b phc v tnh ton. Chng c lit k phc lc C ca [99] . Hm nguyn t thc thi thao tc nguyn t c- thay i- ghi (read-modify-write) trn cc t 32-bit c trong b nh ton cc. V d atomicAdd() c mt t 32-bit trong mt vng nh ca b nh ton cc, cng thm 1 s nguyn cho n, v ghi kt qu tr v vo cng a ch c ra. Thao tc trn l nguyn t trong ng cnh n c m bo thc thi m khng c s can thip t lung khc. Ni cch khc, khng lung no khc c th truy cp vo a ch cho ti khi thao tc trn c hon thnh. Cc hm nguyn t ch lm vic vi s nguyn 32-bit c du v khng du.

2.5. Hng dn hiu nng


2.5.1. Hiu nng lnh x l mt lnh cho mt warp cc lung, b a x l cn thc hin: c ton hng lnh cho mi lung ca warp, Thc hin lnh Ghi kt qu ca mi lung

Do vy, thng lng x l lnh ph thuc vo thng lng lnh thun ty, cng vi tr v bng thng ca b nh. N c ti a nu: Ti thiu vic s dng lnh vi thng lng thp Ti a vic s dng bng thng ca tt c cc loi b nh Cho php cc b lp lch lung c th chng cc thao tc b nh vi cc thao tc tnh ton ton hc ti a c th, iu ny yu cu: + Chng trnh c thc hin bi cc lung c cng s hc cao, c ngha l s lng ln php ton s hc trn php ton b nh; + C nhiu lung c th chy ng thi.
2.5.1.1. Thng lng lnh

2.5.1.1.1. -

Cc lnh s hc

pht ra mt lnh ca warp, b a x l mt: 4 chu k ng h cho php ton cng, nhn, cng-nhn du phy ng, cng s nguyn, php dch bit, so snh, ln nht, nh nht, p kiu.

16 chu k ng h cho i ng, cn bc hai, __log (x) (xem Bng B-2 ca [99]).

Php nhn s nguyn 32 bit ht 16 chu k ng h, nhng __mul24 v __umul24 (ph lc B ca [99]) cung cp php nhn c du v khng du s nguyn 24 bit trong 4 chu k ng h. Tuy nhin trong kin trc tng lai, __ [u] mul24 s chm hn php nhn s nguyn 32 bit, do nn cung cp hai nhn, mt s dng __[u]mul24 v mt s dng php nhn s nguyn 32 bit, c gi mt cch thch hp bi ng dng. Php chia s nguyn v php ly s d chim nhiu thi gian v nn trnh nu c th thay th bi ton t dch bit. N n l ly tha ca 2, (i/n) tng ng vi (i>>log2(n)) v (i%n) tng ng vi (i&(n-1)); chng trnh dch s thc hin cc chuyn i ny nu n l ch. Cc chc nng khc chim nhiu chu k ng h hn, v chng c thc hin bng cch thc hin nhiu lnh. Php cn bc hai du phy ng c ci t bng php ly cn bc 2 i ng, do mt t nht 32 chu k ng h cho warp. Php chia du phy ng mt 36 chu k ng h, nhng __fdividef(x, y) cung cp mt bn nhanh hn vi 20 chu k ng h.
__sin (x), __cos (x), __exp (x) mt 32 chu k ng h.

Nhiu khi chng trnh dch phi thm lnh i kiu, lm tng mt s chu k ng h: Cc php ton trn char or short m cc ton hng cn i kiu v int. Cc hng s du phy ng chnh xc kp double c s dng nh u vo ca cc php ton du phy ng chnh xc n. Cc bin du phy ng chnh xc n s dng cc tham s u vo nh l chnh xc n ca cc hm ton hc. Hai trng hp cui c th trnh bng cch: cc hng s du phy ng chnh xc n, xc nh bi bin c hu t f nh 3.141592653589793f, 1.0f, 0.5f. Phin bn ton hc chnh xc n, vi hu t f u nh sinf(),
logf(), expf().

Cho code c chnh xc n, nn s dng cc loi bin float v cc hm ton hc chnh xc n. Khi dch cho cc thit b m khng c h tr cc php ton du phy ng chnh xc i double, chng hn nh cc thit b tnh ton th h 1.x, bin kiu double b p kiu thnh float nh mc nh v cc hm ton hc chnh

xc i double c nh x ti cc php ton chnh xc n tng ng. Tuy nhin cc thit b tng lai s h tr chnh xc i double, cc hm ny s nh x ti vic thc hin chnh xc i double. 2.5.1.1.2. Cc lnh iu khin Cc lnh iu khin (if, switch, do, for, while) c th nh hng ln n thng lng lnh bi n lm cc lung trong cng warp phn ra, c ngha l theo cc ng thc hin (execution path) khc nhau. Nu iu ny xy ra, cc ng thc hin khc nhau c thc hin ni tip, tng tng s lnh thc hin cho warp. Khi tt cc cch thc hin hon thnh, cc lung hi t li v cng 1 ng thc hin. t c hiu qu tt nht trong cc trng hp lung iu khin ph thuc vo thread ID, iu kin iu khin phi c vit sao cho ti thiu s lng phn nhnh warp. iu ny hon ton c th bi vic phn tn cc warp trn cc khi c xc nh trong phn 3.2 ca chng ny. Mt v d nh l khi iu kin iu khin ch ph thuc vo (threadIdx / WSIZE) vi WSIZE l kch thc warp. Trong trng hp ny khng warp no phn nhnh do kiu kin iu khin l hon ton lin kt vi warp. i khi trnh bin dch c th unroll vng lp hoc n ti u ha nu ton t if hoc switch bng cch s dng d on nhnh thay th, nh m t chi tit pha di. Trong cc trng hp ny, khng warp no c lch. Khi s dng d on nhnh khng lnh no m s thc thi ca lnh ph thuc vo iu kin iu khin b b qua. Thay vo , mi lnh lin kt vi mt m iu kin mi lung hoc chc chn c thit lp l true hoc false da trn iu kin iu khin v mc d mi lnh iu c lp lch thc thi, ch cc lnh vi predicate l true mi c thc hin thc s. Cc lnh vi preidicate l false khng ghi kt qu, v khng nh a ch hoc c ton hng. Trnh bin dch s thay th mt lnh nhnh vi mt lnh predicated ch nu s lng lnh iu khin bi iu kin nhnh nh hn hoc bng mt ngng no : Nu trnh bin dch xc nh rng iu kin c kh nng sinh nhiu phn nhnh warp, ngng l 7, ngc li l 4. 2.5.1.1.3. Cc lnh b nh Cc lnh b nh bao gm cc lnh c v ghi ti vng nh chia s hoc vng nh ton cc. B a x l mt 4 chu k ng h a ra mt lnh b nh cho warp. Khi truy cp b nh ton cc, thm vo s mt tr l 400 ti 600 chu k ng h. V d php gn trong on code sau:
__shared__ float shared[32]; __device__ float device[32]; shared[threadIdx.x] = device[threadIdx.x];

Phi mt 4 chu k ng h a ra mt lnh c t vng nh ton cc, 4 chu k ng h a ra mt lnh vit vo b nh dng chung, nhng trn 400 ti 600 chu k ng h c mt bin fload t b nh ton cc. tr b nh ton cc nhiu n mc c th n bi b lp lch lung nu c cc lnh s hc khng ph thuc c th ban hnh trong khi ch truy cp b nh kt thc. 2.5.1.1.4. Lnh ng b
__syncthreads mt 4 chu k ng h gn cho mt warp nu khng lung

no phi i lung no.


2.5.1.2. Bng thng b nh

2.5.1.2.1.

B nh ton cc

Khng gian nh ton cc khng c lu vo b nh m, v th iu quan trng l truy xut ng cch c c bng thng ti a, c bit l chi ph cho vic truy cp b nh thit b. u tin, thit b c th c cng lc 32, 64 hoc 128 bit cng lc t b nh ton cc vo thanh ghi vi 1 cu lnh. V d:
__device__ type device[32]; type data = device[tid];

Bin dch on trn thnh lnh my, type phi c gi tr bng biu thc sizeof(type), thng bng 4,8,16 v cc bin kiu kiu type phi cn 2,8,16 bytes (v c 2,3 hoc 4 bit c ngha ti thiu ca a ch bng zero) Vic xp b nh ca cc bin nh vy c lm t ng cho cc kiu c sn c trong phn 4.3.1.1 nh float2 hoc float4. Vi cc kiu cu trc, kch thc v xp b nh c th c thi hnh bi trnh bin dich s dng nhng ch th c th nh __align__(8) hoc __align__(16), v d:
struct __align__(8) { float a; float b; };
Hoc

struct __align__(16) { float a; float b; float c; float d; };

Vi nhng cu trc ln hn 16 bytes, trnh bin dch to ra vi lnh np. m bo s cu lnh c to ra l t nht, cc cu trc nn c nh ngha vi ch th __align__(16), v d:
struct __align__(16) { float a; float b; float c; float d; float e; };

Cu trc trn s c bin dch thnh 2 lnh my np c di 128 bit thay v 5 lnh my np di 32 bit Th 2, a ch b nh ton c c truy xut ng thi bi tng lung trong sut vic thi hnh ca 1 lnh my c hoc ghi nn c xp xp vic truy cp b nh c th kt hp thnh vic truy xut 1 vng nh lin tc duy nht Chnh xc hn, trong mi half-warp, lung s N trong half-warp nn truy cp vo a ch
HalfWarpBaseAddress + N

Vi HalfWarpBaseAddress l kiu con tr type* tun theo cch dn b nh nh tho lun trn. Hn na, HalfWarpBaseAddress nn c cp vng nh theo cch 16*sizeof(type) byte; ni cch khc, n nn c s bit c ngha ti thiu log2(16*sizeof(type)) bng zero. Bt k a ch BaseAddress ca 1 bin thng tr trong b nh ton cc hoc c tr li bng 1 trong cc cch cp pht b nh c nhc n trong D.3 hoc E.6 lun c a vo vng nh t nht 256 bytes, v th tha mn rng buc dn xp b nh, HalfWarpBaseAddress nn l bi ca 16*sizeof(type). Ch rng nu 1 half-warp tha mn tt c yu cu bn trn, cc truy xut b nh ca tng lung lun lin tc vi nhau mc d 1 vi lung ca half-warp khng thc s truy xut b nh Nn tun t cc yu cu v gn kt ca ton b warp hn ch vi cc half-warp ring r v cc thit b trong tng li s cn iu cho vic kt tp 1 cch truy xut b nh ton cc l khi mi lung ca lung c ID l tid truy cp 1 phn t ca mng c cp pht ti a ch BaseAddress ca kiu type* s dng a ch sau:
BaseAddress + tid

c c vic truy xut kt tp, type phi tun theo kch thc v yu cu cp pht b nh nh tho lun trn. c bit, iu ngha l nu type l 1 cu trc ln hn 16 byte, n nn c chia nh thnh vi cu trc khc ph hp vi cc yu cu v d liu nn c phn chia trong b nh thnh danh sch ca vi mng ca cu trc thay v 1 mng duy nht ca kiu type* Mt cch truy cp b nh ton cc ph bin khc l khi mi lung c ch s (tx,ty) truy cp 1 phn t ca mng 2 chiu t ti a ch BaseAddress ca kiu type* v chiu rng width s dng a ch sau:
BaseAddress + width * ty + tx

Trong trng hp , vic truy xut b nh c th kt tp cho tt c half-warp ca khi lung nu: + chiu rng ca khi lung l bi s ca kch thc ca warp. + chiu rng phi l bi s ca 16 c bit, iu c ngha 1 mng c chiu rng khng phi l bi s ca 16 s c truy xut hiu qu hn nu n thc t c cp pht vi chiu rng c lm trn ln thnh bi s ca 16 v cc hng ca n cng c xp nh vy. Cc hm cuMemAllocPitch() v cudaMallocPitch() v cc hm sao chp b nh c lin quan c m t trong D.3 v E.6 cho php ngui pht trin vit cc dng lnh khng ph thuc vo phn cng cp pht cc mng tha mn cc iu kin 2.5.1.2.2. B nh hng s Khng gian b nh hng s c lu vng m, v vy vic c t mt b nh hng s mt thi gian bng mt ln c t thit b nh ch trong trng hp khng c trong cache, cn trng hp cn li ch bng mt ln c trong vng m hng s. i vi tt c lung ca half-warp, vic c t vng m hng s nhanh nh l vic c t thanh ghi min l tt c cc lung c cng a ch. Gi ca vic c t vng nh hng s gn nh t l vi s a ch khc nhau c c bi cc lung. Tt c cc lung ca ton b mch c cng a ch i lp vi trng hp tt c lung nm trong mt na ca mch. 2.5.1.2.3. B nh kt cu Khng gian vng nh kt cu c lu vo vng m, v vy vic c kt cu mt mt ln c t thit b nh ch trong trng hp khng c trong cache, ngc li n ch mt mt ln c t vng m kt cu. Vng m kt cu c ti u cho khng gian 2D, cc lung ca cng warp c a ch kt cu gn nhau hn s t c hiu nng ti a.

c b nh thit b qua vic ly kt cu c th l mt la chn nng cao c b nh thit b t b nh ton cc hoc b nh hng s nh m t chi tit trong phn V.4. 2.5.1.2.4. B nh dng chung V b nh dng chung gn trn chip, nn khng gian b nh dng chung nhanh hn nhiu so vi cc khng gian b nh cc b v b nh ton cc. Trong thc t, tt c cc lung ca mt warp, truy cp vo b nh dng chung s nhanh nh truy cp vo mt thanh ghi min l khng c bt k s xung t di nh (bank) gia cc lung, nh chi tit di y. c c bng thng b nh cao, b nh dng chung c chia thnh cc module b nh c kch thc bng nhau, c gi l cc di nh, m c th c truy cp cng mt lc. V vy, bt k b nh c hoc ghi yu cu thc hin ca cc n a ch nm trong n di nh ring bit th c th c phc v ng thi, hiu sut ca mt bng thng hiu qu cao hn n ln bng thng ca mt module n l. Tuy nhin, nu hai a ch ca mt yu cu b nh ri cng vo mt di nh, l mt xung t di nh v vic truy cp vo cc di nh phi c ni tip. Phn cng, khi cn thit, thc hin vic chia tch mt yu cu vng nh vi cc xung t di nh thnh nhiu cc yu cu khng b tranh chp ring bit, lm gim bng thng hiu qu do mt yu t bng vi s yu cu ca b nh ring bit. Nu s lng yu cu b nh ring bit l n, yu cu vng nh khi to ban u s gy ra xung t bank theo n cch. c c hiu sut ti a, quan trng l hiu c cc a ch vng nh c nh x vi cc di nh nh th no t lp ra lch trnh cc yu cu vng nh v gim thiu s xung t gia di nh. Trong trng hp ca khng gian b nh dng chung, cc bank c t chc nh lin tip cc t 32-bit v c gn cho lin tip cc di nh v mi bank c mt bng thng 32 bit trong mi hai chu k ng h. i vi cc thit b kh nng tnh ton 1.x, kch thc warp l 32 v s lng ca cc bank nh l 16 (xem phn V.1); mt yu cu b nh dng chung cho mt warp c chia thnh mt yu cu cho na u ca warp v yu cu mt cho na sau ca warp. Nh mt h qu, c th khng c xung t di nh gia mt lung thuc mt na u tin mt warp v mt mt lung thuc na sau cng warp 2.5.2. S lng lung trong mt khi Khi nim s lng lung trn li, s lung trn khi lnh (block) hoc s khi lnh nn c la chn nhm ti u ha ti nguyn sn dung ca b x l. iu ny c ngha l, vi vic x l nhiu tp lnh khc nhau nn cn nhiu b x l trn mi thit b phn cng.

Hn na, vic x l mt khi lnh trn b x l a nhn tng ng vi vic lm tng thi gian ngh ca b x l trong sut qu trnh ng b v truy cp b nh nu khng lung trn mt khi lnh. iu ny vn tt hn cho vic x l hai hoc nhiu tp lnh thc hin trn cng mt b x l a nhn, cc khi lnh vn phi i thc hin theo hng. i vi vn ny, vic x l nhiu khi lnh i hi phi c nhiu b x l trn cng mt thit b, m cn lin quan ti vic phn chia vng nh chia s cho mi khi lnh nn c dnh phn ln s lng b nh sn dng i vi mi b a x l. Rt nhiu lung x l theo lung thng qua thit b v c thc thi dn dn tun t. Vi nhng khi lnh c s lng ln, s lng lung thc thi trn mi khi lnh nn c la chn x l vi nhiu kch c khc nhau nhm trnh phi ch i ti nguyn tnh ton sai lch, x l tt hn, tham kho ton t 64 bit c nu trong mc V.1.2.5. Vic t chc nhiu lung x l trn khi lnh c cho hiu qu tt hn khi chi nh theo thi gian, nhng nhiu lung x l trn mt b lnh tng ng vi vic mt nhiu chi ph ng k cho mi lung. iu ny s ngn chn cch gi hm nhn x l tun t nu hm nhn thc hin x l nhiu ng k hn l vic c php bi cu hnh thc hin t trc. i vi cc thit b tnh ton gp 1.x, s lng ng k sn sng trn mt lung c tnh theo cng thc:
R B ceil(T,32)

Trong :
R tng s ng k trn b a x l (ph lc A, B chnh l s lng khi lnh

hin thi)
T s lung x l trn mt khi lnh ceil(T,32) l php tnh lm trn ln s bi s ca 32.

64 lung trn mt khi lnh l ti thiu v to ra cm gic c rt nhiu khi lnh ang thc hin ng thi. Vi cc gi tr 192 hoc 256 lung trn khi lnh thng hp l hn v cho php ng k va x l. S lng khi lnh trn li nn ti thiu l 100 nu mun cc khi lnh ny phn chia thnh cc thit b tng lai; 1000 khi lnh s phn ngang cp thnh mt s pht sinh khc. T l s lng sai lch khi thc thi x l ng thi trn b a x l so vi s lng ti a c gi l thi gian chim gi b x l (occupancy).

2.5.3. Truyn d liu gia Host v device Bng thng trao i d liu gia thit b v b nh thng cao hn nhiu so vi bng thng gia b nh v b nh ca my ch. V vy, nn phn u ti thiu ha vic truyn ti d liu gia my ch v thit b. V d cu trc d liu tc thi c th c to trong b nh thit b, tnh ton bi thit b v qu trnh hy b khng ph thuc vo my ch / b nh trn my ch. i vi tng chi ph trn ng truyn, vic x l theo l cc gi nh trn ng truyn ln s mang li hiu qu tt hn so vi vic truyn ti mt ln vi khi lng ln. 2.5.4. Li ch ca vic t chc b nh Vic t chc b nh theo cu trc trn cc thit b nh s mang li nhiu li ch hn so vi vic truy cp t b nh ton cc (global) hoc hng s (constant): D liu c lu vo vng m, kh nng truy xut d liu nhanh hn rt nhiu Chng khng c b rng buc vo vic truy xut cc phn t b nh Qu trnh tnh ton a ch c gim thiu hn, ci thin hiu nng cho ng dng truy xut d liu ngu nhin. D liu ng gi c th c phn chia thnh cc bin tch bit trong ton t n. D liu u vo s nguyn 8 v 16 bit c th c chuyn i thnh 32 bit du phy ng thuc vi cc gi tr nm trong di [0, 1].

Nu vic t chc theo mng CUDA (tham kho mc 2.3.3.4.2), phn cng s p ng nhiu im kh nng khc, mang li hiu qu cho nhiu ng dng khc nhau, c bit trong cng ngh x l nh: c im B lc Ph hp vi ni dung Gii hn Nhanh, chnh xc thp Ch c gi tr nu t chc ni suy gia cc texel tham chiu tr d liu s thc Chun ha cu trc ta phn gii nh m c lp C ch nh a ch T ng nhn bit cc gi C th s dng chun tr bin ha cu trc ta

Chng 3. NG DNG GPU VO BI TON N-BODY V TH NGHIM CHNG TRNH


3.1. Bi ton m phng N-body
N-body l bi ton tiu biu cho tnh ton hiu nng cao, c ng dng rng ri trong cc m phng vt l, ha hc, thin vn hc vi khi lng tnh ton rt ln. M phng n-body l m phng s lng rt ln cc ht di nh hng ca cc lc vt l, thng l lc hp dn. M phng ny thng c s dng trong v tr hc nghin cu cc qu trnh d liu cu trc phi tuyn tnh nh c cu hnh thnh cc di thin h v cc ngi sao t h en trong thin vn hc. M phng n-body trc tip c dng trong nghin cu v n ca cc cm sao. Trong nhiu trng hp, kch thc ca ph phng thin vn hc N-body b gii hn bi cc ti nguyn tnh ton hin c. M phng cho h thng N-body hp dn thun khit l mt v d in hnh. V lc hp dn l s tng tc khong cch di (long-range), phc tp tnh ton cho s tng tc gia tt c cc phn t l O(N2) cho tng bc tnh ton ca m hnh n gin nht, vi N l s lng phn t trong h thng. Chng ta c th gim phc tp tnh ton t O(N2) cn O(NlogN) bng cch s dng mt vi thut ton xp x, nh thut ton cy Barnes-Hut [~11], nhng h s t l (scaling coefficient) thc s ln. Do vy, tnh ton s tng tc gia cc phn t thng l phn "t" nht trong ton b vic tnh ton, v do gii hn s lng cc phn t chng ta c th x l. Smoothed Particle Hydrodynamics (SPH) [[~3][~22] trong cc phn t biu din phn t cht lng (kh) l mt v d khc. Trong cc tnh ton SPH, phng trnh tnh ton thy ng hc c biu din bi s tng tc gia cc phn t khong cch ngn (short-range). phc tp tnh ton ca SPH tng i cao bi s lng phn t trung bnh tng tc vi 1 phn t thc s ln, thng dao ng 50, v tnh ton tng tc gia tng cp 2 phn t th phc tp hn mt cht so vi tng tc hp dn. Trong m phng thin vn hc N-body, tng tc quan trng nht l lc hp dn. S dng ti my tnh, chng ta tnh c lc hp dn ca phn t th i t j phn t theo cng thc sau: Thin vn hc khng phi l ng dng duy nht ca m phng N-body. M phng chuyn ng phn t (MD - Molecular dynamics) v phng php khoanh vng thnh phn (BEM - Boundary element method) l nhng v d ca phng php s hc trong tng thnh phn trong h thng nm trong nhng tng tc c bn vi cc thnh phn cn li ca h thng. Trong c 2 trng hp, cch tip cn ging vi thut ton cy Barnes-Hut hoc FMM [~21] gip gim c chi ph tnh ton, tuy nhin tnh ton s tng tc vn nh hng ln ti tng chi ph tnh ton.

Mt cch tip cn tiu biu lm tng tc m phng N-body l xy dng 1 my tnh chuyn bit (special-purpose) cho vic tnh ton tng tc. Hai c trng ca tnh ton tng tc lm chng rt ph hp vi cch tip cn ny, l: Th nht, tnh ton tng tc trn tng cp i l tng i n gin. Trong trng hp tng tc hp dn, tng s php ton du phy ng (m ton b php ton bao gm c phpcn bc hai v php chia) cng ch dao ng quanh 20. Do vy khng kh hiu khi thit k 1 b x l c ng ng dn y , gn cng v c kim sot bng vi mch in t phc v vic tnh ton tng tc hp dn. Vi nhng ng dng khc nh SPH, hay chuyn ng phn t hc th tnh ton tng tc phc tp hn, tuy nhin cch tip cn v phn cng vn kh thi. c trng th hai, s tng tc theo cch n gin nht ca n tc ng tt c ln tt c (all-to-all). Ni cch khc, tng phn t trong h thng tc ng ln tt c cc phn t cn li. Do c rt nhiu cch thc song song p dng c. Ni c th, c th thit k mt phn cng tnh ton lc tc ng t 1 phn t ln nhiu phn t mt cch song song. Theo cch ny chng ta c th gim yu cu v bng thng (bandwidth) b nh. Tt nhin l nu s tng tc thuc vo loi khong cch ngn (short-range), th n c th thc hin mt s cch thng minh lm gim chi ph tnh ton t O(N2) thnh O(N), v vic lm gim bng thng b nh khng hiu qu bng trng hp tnh ton O(N2)

Hnh 19: Hnh nh m phng N-body [~8]

3.2. Xy dng bi ton N-body trn CPU


M phng n-body c 2 loi l m phng trc tip v m phng tng i. Trong phn tip theo m t m phng trc tip s dng phng php n gin nht l: phng php ht-ht (particle-particle (PP) ). Tc gi lun vn ci t m phng Nbody chy trn CPU theo phng php trn. ng thi c cc nh gi th nghim trn m phng ny so snh vi m phng trn GPU. Chi tit c trnh by trong phn kt qu th nghim. Phng php ht-ht da trn thut ton tch hp thi gian v thut ton tnh lc. Thut ton Verlet(1967) m t di y l thut ton tch hp thi gian ph bin nht v n gin nht. Nhng thut ton tch hp thi gian khc nh: leap-frog, Beeman(1976), Tuckermane l cc thut ton c nhiu bc thi gian. 3.2.1. Thut ton tch hp thi gian Verlet: Gi s r (t ) , v (t ) v a (t ) tng ng l v tr, vn tc v gia tc ca ht P ti thi im t. Thut ton Verlet c biu din di dng:

r (t + t ) = r (t ) + t.v (t + t / 2)
v (t + t / 2) = v (t t / 2) + t.a (t )

(1) (2)

3.2.2. Cng thc tnh lc c bn v tnh tim nng


cng thc n gin, ta gi s N= 2, gi r1 , v1 , m1 v r2 , v2 , m2 l v tr, vn tc v gia tc ca mt s lng ln hai ht P1, v P2, trong h thng 2-body. Lc F1 a vo ht P1 ph thuc vo P2:

Cng thc tng t cho lc F2. S dng nh lut 2 NewTon F = ma chng ta c gia tc ca ht P1.

Trong G l hng s hp dn v

12 l vector n v r

Tim nng

ti v tr r1 ph thuc vo P2:

Trong m t sn xut, cc nh nghin cc thng s dng cc tham s mm trnh cc lc v gia tc rt ln gy ra do khong cch rt gn gia cc ht. Trong h thng N-body, cc tham s mm thng c nh ngha l :

S dng cc tham s mm, biu thc (4), (5) tr thnh:

V:

ng nng Ki ca cc ht Pi vi i=1,2 l

V tim nng nng lng Wi ca ht Pi l

ng nng Ki v nng lng tim nng Wi h thng 2-body l:

V nng lng tng cng ca h thng l:

3.2.3. Thut ton m phng N-Body


1. nh gi gia tc ban u ca mi ht bng cch s dng biu thc (7). 2. nh gi tng nng lng ban u E0 ca h thng bng cch s dng biu thc (2), (8) - (13). 3. Xc nh cc tham s mm bng cch s dng biu thc (6). Set t = / 2 , v t=0. 4. Xc nh s lng bc thi gian v gii hn trn 5. for // lp cho mi bc theo thi gian 6. 7. 8. Tnh vn tc ca mi ht ti thi im t + t / 2 s dng biu thc (2). Tnh v tr mi ca mi ht ti thi im t + t bng biu thc (1). Tnh gia tc ca mi ht ti thi im t + t s dng biu thc (7).

tu

ca t.

9. 10. 11. 12. 13.

Tnh tc ca mi ht ti thi im t + t s dng biu thc (2). (Ty chn) Tnh tng nng lng Et ca h thng s dng biu thc (8)-(13). (Ty chn) Tnh sai s tng i (Et E0)/E0 nu bc 8 thc hin (Ty chn) Tnh 2K/|W| gim st s cn bng ca h thng (equilibirum).

t = t + t / 2

14. nu t > tu hoc s bc thi gian t ti mt s lng nh trc, break thot khi vng lp. 15. endfor //kt thc lp

3.3. Xy dng bi ton N-body trn GPU


Bi ton th nghim N-body trn GPU c tham kho t [~8]. Tc gi lun vn tm hiu, nghin cu m ngun v d, iu chnh cc tham s chng trnh v ci t trn mi trng th nghim. Cc bc thc hin: 1. 2. 3. Ci t b tool kit ca NIVIDIA phin bn 1.0 tr ln. C th download ti http://developer.nvidia.com/cuda Cu hnh GPU cn thit: NVIDIA 8-Series hoc mi hn. Bin dch chng trnh trn Linux:

Make file bin dch chng trnh:


########################################################################## ###### # # Build script for project # ########################################################################## ###### # Add source files here EXECUTABLE := nbody # Cuda source files (compiled with cudacc) CUFILES := bodysystemcuda.cu # C/C++ source files (compiled with gcc / c++) CCFILES := \ nbody_gold.cpp bodysystemcpu.cpp bodysystemcuda.cpp nbody.cpp \ render_particles.cpp \ USEGLLIB := 1 USEPARAMGL := 1 USEGLUT := 1 ########################################################################## ###### # Rules and targets

include ../../common/common.mk

th mc gc, g lnh:
Make; Make dbg=1 "Make -f Makefile_paramgl; Make -f Makefile_paramgl dbg=1

Vo th mc: projects/nbody, g lnh:


Make; Make dbg=1; Make emu=1; Make emu=1 dbg=1

Chng trnh c chy trong th mc:


bin/linux/release/nbody

Chng trnh c 3 ch chy: interactive ( ha) , benchmark v test. Ch interactive: chy ha, cho php ngi dng c th nhn thy m phng n-body, cc ht chuyn ng. Ch test: m phng c chy trn c CPU v GPU. Nu m phng trn GPU nm trong s cho php ca m phng trn CPU th kt qu hin th "Test PASSED", ngc li hin th "Test FAILED" Ch benchmark: ch chy tnh ton cc tng tc, khng c chuyn i sang ha 3D v khng c thi gian ch. Bo co m phng gm: tng thi gian, thi gian trung bnh, trung bnh tng tc gia cc phn t trong 1 giy, t s GFLOP/s. Tc gi lun vn la chn ch ny chy th nghim hiu nng GPU so snh vi CPU. Lnh chy nh sau:
nohup echo "8388608 start" >> linh_result.txt && date >> linh_result.txt && nbody benchmark -n=8388608 >> linh_result.txt && date >> linh_result.txt && echo "8388608 end" >> linh_result.txt & Trong : linh_result.txt l file cha kt qu chy chng trnh nohup: lnh chy m phng trn server, khng cn tng tc vi client date: lnh ghi thi gian h thng khi bt u v kt thc chng trnh nbody -benchmark -n=8388608: chy m phng ch benchmark, vi s phn t n= 8388608 (8192K)

3.4. Th nghim
3.4.1. Mi trng th nghim: Hai m phng c miu t phn 3.2 v 3.3 c ci t th nghim trn mi trng my PC c cu hnh nh sau:
CPU: Intel(R) Core(TM)2 Quad CPU @ 2.66GHz

Cache: 4096 KB RAM: 2 GB GPU: Nvidia GeForce 8800 GTX H iu hnh Linux

3.4.2. Kt qu th nghim 3.4.2.1.1. So snh thi gian thc hin trn GPU v CPU
S phn t (K ) 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Thi gian chy trn GPU 0.348 0.687 0.7 1 1 1 1 3 12 48 179 719 2820 11348 Thi gian chy trn CPU 0.035 141 1 2 10 36 148 608 2940 10009 38922 Thi gian tnh ton rt ln c vi ngy

Bng 1: Kt qu th nghim bi ton N-body trn GPU Nvidia GeForce 8800 GTX v CPU Intel(R) Core(TM)2 Quad 2.66GHz
So snh thi gian thc hin theo s lng phn t gia GPU v CPU
100000,000 Thi gian thc hin (giy) 10000,000 1000,000 100,000 10,000 1,000
12 8 51 2 25 6

GPU CPU

0,010

S ph n t (K)

Hnh 20: Biu so snh thi gian thc hin gia GPU v CPU theo s lng phn t trong m phng n-body

Nhn xt: Vi s lng phn t nh < 8K th tnh ton trn CPU v GPU u rt nhanh, c di 1 giy, v CPU c phn nhanh hn. Nhng khi s lng phn t trong m phng tng ln theo cp s 2 th tnh ton trn GPU nhanh hn CPU rt nhiu ln. Trn th thi gian thc hin c th chng minh phc tp thut ton bng O(N2)

10 24 20 48 40 96 81 92 16 38 4

32

64

16

0,100

vi N l s phn t. Khi s phn t tng ln gp 2 th thi gian thc hin tng ln gp 4 ln trn c CPU v GPU. 3.4.2.1.2.
S phn t T s tng tc

T s tng tc gia CPU v GPU


1 0,10 2 0,21 4 1,43 8 2,00 16 10,00 32 36,00 64 148,00 128 202,67 256 245,00 512 208,52 1024 217,44

Bng 2: T s tng tc gia CPU v GPU

T s tng tc gia CPU v GPU


300,00 T s tng tc (l n) 250,00 200,00 150,00 100,00 50,00 0,00 1 2 4 8 16 32 64 128 256 512 1024 S lng ph n t (K) t s tng t c

Hnh 21: Biu th hin t s tng tc CPU/GPU khi s phn t trong m phng n-body tng

Nhn xt: Thc nghim trn kho st t l tng tc tc gia CPU v GPU. nhng m phng s lng phn t thp th tc ca CPU v GPU l tng ng nhau. Nhng khi s lng phn t cao t 64K tr nn th tc chnh lnh ng k duy tr mc trn 200 ln. iu ny cho thy sc mnh tnh ton ca GPU.

3.4.2.1.3.

Hiu nng tnh ton trn CPU

Hnh 22: Ti tnh ton trn CPU khi chy m phng n-body vi s phn t 256K. 1 CPU lun 100%, i khi chim thm ti 100% ca cc CPU khc

Khi thc hin m phng trn CPU th ti CPU lun t mc cao nht 100% trong sut thi gian chy. Chng t s tiu tn ti nguyn khi tnh ton trn CPU.

3.4.2.1.4.

Hiu nng tnh ton trn GPU


Biu hiu nng trn GPU khi s phn t tng

300.000 S phn t (K) 250.000 200.000 150.000 100.000 50.000 0 S GFLOP/s

Hnh 23: Biu hiu nng trn GPU Geforce 8800 GTX trong m phng n-body khi s phn t tng
S phn t (K) 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 T x l (GFLOP/s ) 60.191 122.183 245.350 247.043 247.676 247.999 248.091 248.119 248.137 248.146 248.144 248.145 248.142 248.139

Bng 3: Tc x l trn GPU 8800 GTX khi s phn t tng

Nhn xt: Khi thc hin m phng N-body trn GPU 8899 GTX, ti nguyn c s dng m mc ti a khi s phn t t 4K tr nn. V duy tr mc 248GFLOP/s.

3.5. Kt lun th nghim


Cc kt qu th nghim trong lun vn cho thy nng lc tnh ton vt tri ca GPU so vi CPU trong bi ton tnh ton song song m phng n-body. Kt qu cho thy card ha Geforce 8800 GTX c nng lc x l trong bi ton song song gp khong hn 200 ln so vi chip Itel Quad core 2.66GHz. Mt kt qu n tng!

64 12 8 25 6 51 2 10 24 20 48 40 96 81 92
GFLOP/s

16

32

KT LUN
Lun vn nghin cu tng quan v tnh ton song song, l iu kin cn pht trin ng dng GPU cho mc ch thng dng. Tc gi lun vn cng tm hiu v c ch hot ng ca GPU, cc kin trc bn trong n, m hnh lp trnh trn GPU. Trong chng 2, lun vn tm hiu cng c lp trnh GPU ph bin nht hin nay l CUDA. Tc gi lun vn cng trnh by chi tit cc m hnh lp trnh, thit lp phn cng trn card ha ca Nvidia, giao din lp trnh cng nh cc ch dn hiu nng khi chy ng dng trn card ha. T cc hiu bit trn, tc gi thc hin th nghim nng lc tnh ton ca GPU so snh vi CPU kim chng nhng iu m l thuyt ni. Cc kt qu th nghim c trnh by chi tit trong chng 3 ca lun vn. Vi cc kt qu t c, tc gi mong mun c cc nghin cu thm v ci tin hiu nng bi ton m phng n-body trn GPU, gim phc tp tnh ton t O(N2) xung cn O(nlogn). Mong rng cc kt qu nghin cu trong tng lai ca lun vn s t c iu .

TI LIU THAM KHO


[1] E. Lefohn, A streaming narrow-band algorithm: Interactive computation and visualization of level-set surfaces, Masters thesis, University of Utah, Dec. 2003. Bustos, O. Deussen, S. Hiller, and D. Keim, A graphics hardware accelerated algorithm for nearest neighbor search, in Proceedings of the 6th International Conference on Computational Science, ser. Lecture Notes in Computer Science. Springer, May 2006, vol. 3994, pp. 196 199. Blythe, The Direct3D 10 system, ACM Transactions on Graphics, vol. 25, no. 3, pp. 724734, Aug. 2006. Horn, Stream reduction operations for GPGPU applications, in GPU Gems 2, M. Pharr, Ed. Addison Wesley, Mar. 2005, ch. 36, pp. 573589. Tarditi, S. Puri, and J. Oglesby, Accelerator: Using data-parallelism to program GPUs for general-purpose uses, in Proceedings of the Twelfth International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2006, pp. 325335. Gingold and J. J. Monaghan, Smoothed particle hydrodynamics - theory and application to non-spherical stars, MNRAS, vol. 181, pp. 375389, 1977. GPU Gems 3, Chapter 31. Fast N-Body Simulation with http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html CUDA

[2]

[3] [4] [5]

[6] Eclipse Parallel Tools Platform, http://www.eclipse.org/ptp/ [7] [8] [9]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, Brook for GPUs: Stream computing on graphics hardware, ACM Transactions on Graphics, vol. 23, no. 3, pp. 777786, Aug. 2004. Computing,

[10] Introduction to Parallel http://www.llnl.gov/computing/tutorials/parallel_comp/

[11] J. Barnes and P. Hut, A Hierarchical O(NlogN) Force-Calculation Algorithm, Nature, vol. 324, pp. 446449, Dec. 1986. [12] J. Bolz, I. Farmer, E. Grinspun, and P. Schroder, Sparse matrix solvers on the GPU: Conjugate gradients and multigrid, ACM Transactions on Graphics, vol. 22, no. 3, pp. 917924, Jul. 2003. [13] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. Purcell, A survey of general-purpose computation on graphics hardware, Computer Graphics Forum, vol. 26, no. 1, pp. 80 113, 2007. [14] J. Kruger and R. Westermann, Linear algebra operators for GPU implementation of numerical algorithms, ACM Transactions on Graphics, vol. 22, no. 3, pp. 908916, Jul. 2003. [15] J. Kruger, P. Kipfer, P. Kondratieva, and R. Westermann, A particle system for interactive visualization of 3D flows, IEEE Transactions on Visualization and Computer Graphics, vol. 11, no. 6, pp. 744756, Nov./ Dec. 2005.

[16] J. Postel, J. Reynolds, http://www.ietf.org/rfc/rfc0959.txt , RFC File Transfer Protocol, 1985 [17] John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips, "GPU Computing", PROCEEDINGS OF THE IEEE, VOL. 96, NO. 5, MAY 2008 [18] K. E. Batcher, Sorting networks and their applications, in Proceedings of the AFIPS Spring Joint Computing Conference, vol. 32, Apr. 1968, pp. 307314. [19] K. Fatahalian, J. Sugerman, and P. Hanrahan, Understanding the efficiency of GPU algorithms for matrix-matrix multiplication, in Graphics Hardware 2004, Aug. 2004, pp. 133138. [20] L. B. Lucy, A numerical approach to the testing of the fission hypothesis, Astronomical Journal, vol. 82, pp. 10131024, Dec. 1977 [21] L. Greengard and V. Rokhlin, A fast algorithm for particle simulations, Journal of Computational Physics, vol. 73, pp. 325348, Dec. 1987 [22] M. Harris, Mapping computational concepts to GPUs, in GPU Gems 2, M. Pharr, Ed. Addison Wesley, Mar. 2005, ch. 31, pp. 493508. [23] M. Kass, A. Lefohn, and J. Owens, Interactive depth of field using simulated diffusion on a GPU, Pixar Animation Studios, Tech. Rep. #06-01, Jan. 2006, http://graphics.pixar.com/DepthOfField/. [24] M. McCool, Data-parallel programming on the Cell BE and the GPU using the RapidMind development platform, in GSPx Multicore Applications Conference, Oct./Nov. 2006. [25] "M. McCool, S. Du Toit, T. Popa, B. Chan, and K. Moule, Shader algebra, ACM Transactions on Graphics, vol. 23, no. 3, pp. 787795, Aug. 2000" [26] N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha, LUGPU: Efficient algorithms for solving dense linear systems on graphics hardware, in Proceedings of the ACM/IEEE Conference on Supercomputing, Nov. 2005, p. 3. [27] N. K. Govindaraju and D. Manocha, Efficient relational database management using graphics processors, in ACM SIGMOD Workshop on Data Management on New Hardware, Jun. 2005, pp. 2934. [28] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha, Fast computation of database operations using graphics processors, in Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Jun. 2004, pp. 215226. [29] "N. K. Govindaraju, M. Henson, M. C. Lin, and D. Manocha, Interactive visibility ordering of geometric primitives in complex environments, in Proceedings of the 2005 Symposium on Interactive 3D Graphics and Games, Apr. 2005, pp. 4956." [30] PADE, http://math.nist.gov/mcsd/savg/pade/ [31] P-GRADE, http://www.lpds.sztaki.hu/pgrade/
[32] wikipedia http://en.wikipedia.org/wiki/Graphics_processing_unit

You might also like