You are on page 1of 5

Question 1) (1) When redundant work is done by a thread then it is useful to let the the thread do more work.

For example in original tiled matrix multiplication tiled algorithm one element of output matrix is processed by one thread. This requires a dot product between one row of Md and one column of Nd. In this case multiple blocks are redundantly loading each Md tile. The redundancy can be eliminated by each block letting load 2*tile_size and each thread calculating 2 elements of the output matrix. Hence improving the performance. This reduces the global memory access by 1/4. (2) __shared__ float list[256]; list[threadIdx%256] = list[255 threadIdx%256]; (3) The compiler flags will be -g -G -o0 (4) After compiling the code with above flags added in the make file make with = make --debug=1 Now we can the executable with cuda-gdb and put some break point for better debugging. Then run the program line by line. When program will reach the error point it will show the error command and we will be able to pin point error location. It will also be useful to check the values in different thread to find corrupt values. (5) I would say, since __syncthread() will only update or synchronize data into their own block but it will not update data in other blocks. Therefore it is very well possible that neighbor of a point in block 1 which is in block 2 has not been update. This will definitely cause some error in the calculations. Therefore launch same kernel at every step and use cudaThreadSynchronize() and cudaDeviceSynchronize after it. (6) The ability of a number to be represented up to what largest digit is defined as precision. It is determined by the number of the mantissa bits. Larger the mantissa bits larger the precision of the system. On the other hand accuracy is determined by the operations performed on the a floating number. The accuracy of the floating point arithmetic operation is measured by the maximal error introduced by it. The most common source of error is the rounding in the operation. (7) Since it is performing 10 iterations it will have maximum error of 0.5 ULP. (8) For histgram the student can use shared memory for privatization and do atomic operations on the shared memory to calculate the histogram of each block. After each calculatation just do the write it back to global memory using atomic operations. (9) cudaHostAlloc uses pinned memory and that is why student is able to get high transfer rate between pinned memory and device. The DMA controller does not need to use extra copy statement in this case hence better performance. She should also be cautious because since her system just have 2Gb memory; allocation 1Gb of pinned memory will cause system performance to suffer. Also she must free the pinned memory before exiting the program.

(10) Since it has all global memory access it will not be able to acheive that high peformatce.

Question (2) (2a) four thread block will be generated. (2b) 256/32 = 8 (2c) 1024 (2d) There is control divergence and it is in line 2. The divergence comes from ceil function because it generates larger number of thread than number of elements. So for last wast only first 8 threads will do the work and rest will be discarded. (2e) No there is no control divergence is this case. Because number of thread generated equals to number of elements. (2f) Line 4 Line 5 Line 6 Line 7 Line 10 padded_size) int padded_size = (ceil(n/BLOCK_SIZE)* BLOCK_SIZE); cudaMalloc = ((void**) &A_d, ( padded_size*sizeof(float))); cudaMalloc = ((void**) &B_d, ( padded_size*sizeof(float))); cudaMalloc = ((void**) &C_d, ( padded_size*sizeof(float))); vecAddKernel<<<ceil(n/BLOCK_SIZE), BLOCK_SIZE>>>(A_d, B_d, C_d,

(2g) The if code is not needed since n inside the kernel will always be greater than or equal to i. Hence that condition is always satiesfied. (2h) padded_size for 1,000,000 elements will be 1,000,192. Therefore there will be 192 extra element processed but this number is very little compared to 1,000,000. Therefore it will bot have significant impact on performance.

Question (3): (3a) Since there has not been any syncthread applied this code will only run for a single warp. That total of 32 threads. Therefore 16 32 will be the right block size for this code to work on a device. (3b) __syncthread before line 10 will fix the problem. (3c) The block sizes 128 and 256 will be able to provide coalesced global memory access. Because with these all the threads will be able to utilize limitation on total number of threads on a SM. (3d) Now for global memory store there is no memory burst as long as TILE_SIZE is multiple of 16 it will be fine. Therefore good memory access block size will be 20, 36, 52, 68, 84, 100, 116, 132, 148, 164, 180, 196, 212, 228 and 244. (3e) For lecture 10 convolution the overhead for 16 TILE_SIZE was 144 extra cell coming per tile therefore total extra overhead = 144 * 1024/16 =

Question (4) I am assuming that total volume is 4096*1024*1024 cells (4a) There few things that must be considered in order to decompose the domain. First, Memory coalescing: Since memory is arranged according to first x, then y and then z. Therefore if it is not advisable to decomposed in x-direction. Also a fair bit of coalescing will be obtained if it is not decomposed along y-direction. Therefore z-direction will be the choice. Second, Number of boudary points: If it is decomposed along z direction it will have 1024*1024 boundary cells but if it decomposed along y or x direction it will have 2048*1024 cells. Once again zdecomposition. (4b) Time taken in each step = Total floating point calculations /(GFLOPS * 10^9) = (4096*1024*1024*12)/(480* 10^9) seconds = 107.37 msec (4c) data needs to be exchanged = 2 * (boundary cells)*sizeof(float) = 2*1024*1024*4 = 8388608 bytes = 8.389 MB (4d) Time needed to exchanged the data = Data/Link speed = 8388608/6/10^9 seconds = 1.39 msec (4e) Now calcutation time on two GPUs are 107.37/2 msec time taken to transfer this data = 1.39 msec data transfer/caluclation time = 1.39/53.68 = 0.0258 = 2.58 % Now since calculation time is at least 50 times higher than the transfer time therefore it is definitely an performance improvement for this case.