Attribution Non-Commercial (BY-NC)

4 views

Attribution Non-Commercial (BY-NC)

- Low Latency in managed code
- c-skill
- Full Lab Program CN2010
- Mf 2010 Student Awards
- C language Material
- Matlab Coding -- First Slopes
- 2.Overview OpenMP
- Project Report
- Linked List
- Cpu Scheduling
- Operating System Concepts
- applys
- Numerical Computation Guide and What every scientist should know about floating point arithmetic.pdf
- Pugh Matrix Template
- Representing Data r
- onstatQRC(2)
- CSN 104 COMPUTER PROGRAMMING Practical Assignment 5 & 6.docx
- H264 Encoder Short
- Elemental matrices for Finite element electromagnetics
- chp%3A10.1007%2F978-3-642-28145-7_11

You are on page 1of 5

For example in original tiled matrix multiplication tiled algorithm one element of output matrix is processed by one thread. This requires a dot product between one row of Md and one column of Nd. In this case multiple blocks are redundantly loading each Md tile. The redundancy can be eliminated by each block letting load 2*tile_size and each thread calculating 2 elements of the output matrix. Hence improving the performance. This reduces the global memory access by 1/4. (2) __shared__ float list[256]; list[threadIdx%256] = list[255 threadIdx%256]; (3) The compiler flags will be -g -G -o0 (4) After compiling the code with above flags added in the make file make with = make --debug=1 Now we can the executable with cuda-gdb and put some break point for better debugging. Then run the program line by line. When program will reach the error point it will show the error command and we will be able to pin point error location. It will also be useful to check the values in different thread to find corrupt values. (5) I would say, since __syncthread() will only update or synchronize data into their own block but it will not update data in other blocks. Therefore it is very well possible that neighbor of a point in block 1 which is in block 2 has not been update. This will definitely cause some error in the calculations. Therefore launch same kernel at every step and use cudaThreadSynchronize() and cudaDeviceSynchronize after it. (6) The ability of a number to be represented up to what largest digit is defined as precision. It is determined by the number of the mantissa bits. Larger the mantissa bits larger the precision of the system. On the other hand accuracy is determined by the operations performed on the a floating number. The accuracy of the floating point arithmetic operation is measured by the maximal error introduced by it. The most common source of error is the rounding in the operation. (7) Since it is performing 10 iterations it will have maximum error of 0.5 ULP. (8) For histgram the student can use shared memory for privatization and do atomic operations on the shared memory to calculate the histogram of each block. After each calculatation just do the write it back to global memory using atomic operations. (9) cudaHostAlloc uses pinned memory and that is why student is able to get high transfer rate between pinned memory and device. The DMA controller does not need to use extra copy statement in this case hence better performance. She should also be cautious because since her system just have 2Gb memory; allocation 1Gb of pinned memory will cause system performance to suffer. Also she must free the pinned memory before exiting the program.

(10) Since it has all global memory access it will not be able to acheive that high peformatce.

Question (2) (2a) four thread block will be generated. (2b) 256/32 = 8 (2c) 1024 (2d) There is control divergence and it is in line 2. The divergence comes from ceil function because it generates larger number of thread than number of elements. So for last wast only first 8 threads will do the work and rest will be discarded. (2e) No there is no control divergence is this case. Because number of thread generated equals to number of elements. (2f) Line 4 Line 5 Line 6 Line 7 Line 10 padded_size) int padded_size = (ceil(n/BLOCK_SIZE)* BLOCK_SIZE); cudaMalloc = ((void**) &A_d, ( padded_size*sizeof(float))); cudaMalloc = ((void**) &B_d, ( padded_size*sizeof(float))); cudaMalloc = ((void**) &C_d, ( padded_size*sizeof(float))); vecAddKernel<<<ceil(n/BLOCK_SIZE), BLOCK_SIZE>>>(A_d, B_d, C_d,

(2g) The if code is not needed since n inside the kernel will always be greater than or equal to i. Hence that condition is always satiesfied. (2h) padded_size for 1,000,000 elements will be 1,000,192. Therefore there will be 192 extra element processed but this number is very little compared to 1,000,000. Therefore it will bot have significant impact on performance.

Question (3): (3a) Since there has not been any syncthread applied this code will only run for a single warp. That total of 32 threads. Therefore 16 32 will be the right block size for this code to work on a device. (3b) __syncthread before line 10 will fix the problem. (3c) The block sizes 128 and 256 will be able to provide coalesced global memory access. Because with these all the threads will be able to utilize limitation on total number of threads on a SM. (3d) Now for global memory store there is no memory burst as long as TILE_SIZE is multiple of 16 it will be fine. Therefore good memory access block size will be 20, 36, 52, 68, 84, 100, 116, 132, 148, 164, 180, 196, 212, 228 and 244. (3e) For lecture 10 convolution the overhead for 16 TILE_SIZE was 144 extra cell coming per tile therefore total extra overhead = 144 * 1024/16 =

Question (4) I am assuming that total volume is 4096*1024*1024 cells (4a) There few things that must be considered in order to decompose the domain. First, Memory coalescing: Since memory is arranged according to first x, then y and then z. Therefore if it is not advisable to decomposed in x-direction. Also a fair bit of coalescing will be obtained if it is not decomposed along y-direction. Therefore z-direction will be the choice. Second, Number of boudary points: If it is decomposed along z direction it will have 1024*1024 boundary cells but if it decomposed along y or x direction it will have 2048*1024 cells. Once again zdecomposition. (4b) Time taken in each step = Total floating point calculations /(GFLOPS * 10^9) = (4096*1024*1024*12)/(480* 10^9) seconds = 107.37 msec (4c) data needs to be exchanged = 2 * (boundary cells)*sizeof(float) = 2*1024*1024*4 = 8388608 bytes = 8.389 MB (4d) Time needed to exchanged the data = Data/Link speed = 8388608/6/10^9 seconds = 1.39 msec (4e) Now calcutation time on two GPUs are 107.37/2 msec time taken to transfer this data = 1.39 msec data transfer/caluclation time = 1.39/53.68 = 0.0258 = 2.58 % Now since calculation time is at least 50 times higher than the transfer time therefore it is definitely an performance improvement for this case.

- Low Latency in managed codeUploaded byromainverdier
- c-skillUploaded byDhruv Sanan
- Full Lab Program CN2010Uploaded bynivedha
- Mf 2010 Student AwardsUploaded byPreetham Madapura
- C language MaterialUploaded byHemanth Sai
- Matlab Coding -- First SlopesUploaded byTheodor Munteanu
- 2.Overview OpenMPUploaded bytanvipuri45
- Project ReportUploaded byShakya A White Hat
- Linked ListUploaded byAni Vin
- Cpu SchedulingUploaded byNikhil Roy
- Operating System ConceptsUploaded bykaushik17
- applysUploaded byZeBaude
- Numerical Computation Guide and What every scientist should know about floating point arithmetic.pdfUploaded bywbdsdvyo
- Pugh Matrix TemplateUploaded byRudrani Sarkar
- Representing Data rUploaded bytnylson
- onstatQRC(2)Uploaded byEdi Chacon
- CSN 104 COMPUTER PROGRAMMING Practical Assignment 5 & 6.docxUploaded byAbhinav Bansal
- H264 Encoder ShortUploaded byRaoul Bitenbois
- Elemental matrices for Finite element electromagneticsUploaded byftdlaks
- chp%3A10.1007%2F978-3-642-28145-7_11Uploaded byde7yT3iz
- Dsp Lab FinalUploaded byAnonymous V7P5fNQI
- ParadigmsUploaded byDoota Suryaprakash
- Example ProgramUploaded byErvina Dika Tria P
- game.logUploaded bytyno
- Cpp.docxUploaded byDikshant Gautam
- DmaUploaded bysabeeha
- C Journal.docxUploaded byAsrar Ul Haq
- Final File- 5th SemesterUploaded byShriya
- 6) Unit II_Multiprocessor scheduling.pptUploaded byHehdhdhdjjd
- Computer ProjectUploaded byNiranjan Kumar

- 50062727-SEL-KOMPETEN.docUploaded byEnung Warsita Dahlan
- TDS - Jotun Thinner No. 2 - English (uk) - Issued.26.11.2010.pdfUploaded byNPT
- Hoja Tecnica QJ341Uploaded byGuillermo Pais
- Image Brochure UKUploaded byifm electronic
- TIB450 Kona-Kona Apparel New Firmware KN 6.0.0Uploaded byb5666266
- Colours-The Formula of LightUploaded byGoddessLight
- Avsim Ctd GuideUploaded byUlrid Yamnarm
- Air-Hopper-MALWARE-final-e.pdfUploaded byAnkur Parashar Pandey
- Zoback Abu Dhabi SPWLAUploaded byJairo Cortes
- Essential Math Methods 3/4 TextbookUploaded byTim Davies
- Anti Aliasing Filter2Uploaded byAlexandra Popescu
- Leadership SyllabusUploaded byevandrink
- 5Uploaded byAbesheik Haldurai
- Nitomortar SUploaded byAnonymous 94TBTBRks
- Users’ behavior regarding Electronic Information Resources in the Library of NIPER MohaliUploaded byEditor IJIRMF
- Energy Based Swinging up Control of the AcrobotUploaded byJuan Perez
- Anchor Bolt DesignUploaded bybhaskardharani
- Textile Testing InstrumentUploaded bydr tamer
- Rational_Design_ProcessUploaded by4204206969
- Fundamental Data StructuresUploaded byDjin DJon Wing
- Strategic Scientific and Medical WritingUploaded byebookecm.it
- User Guide.pdfUploaded byAnonymous u4zO4D
- 01 Laboratory Exercise 1Uploaded byMeterpreter Tcpdump
- CBWEUploaded byRavi Gupta
- file_134_2073Uploaded byGunadevan Chandrasekaran
- MAHM8e Chapter03.Ab.azUploaded byDinda Oktaviani
- S5T3170Uploaded bynooneez
- PhA 042 - The Platonic Cosmology.pdfUploaded byPhilosophvs Antiqvvs
- advanceddebatingtechniques-100107144301-phpapp02.pptUploaded byTO Chau
- ResumeUploaded byPrateek Mathur