This action might not be possible to undo. Are you sure you want to continue?

# Homework 1

CDA 5155: Spring 2012 Due Date: 02/08/2012 11:55 PM (EDGE Students: 02/11/2012 11:55 PM) Total: 20 points (5% of overall score)

You are not allowed to take or give help in completing this assignment. Submit the PDF version of the submission in e-Learning (Sakai) website before the deadline. Please include the sentence in bold on top of your submission (PDF): “I have neither given nor received any unauthorized aid on this assignment”. 1. Assume that you are the product manager for XXX processor. The chip has an area of 263 mm2, with a defect rate of 0.025 defects per cm2 and N=11.5. The die of each chip is occupied by four identical cores (70% total area) and a shared L3 cache (30% total area). For simplicity, we assumed here that each chip has only four cores and an L3 cache (no other components). a. [1 Point] What is the yield of the die? b. [1 Point] Some researchers proposed that the number of defects in a die can be modeled by Geometric distribution. Suppose we can use the yield as the probability that there is no defect on a die, what is the value of parameter p in Geometric distribution here? Note: Geometric distribution means the probability that there are exactly k defects (k being a non-negative integer, k = 0, 1, 2, ...) is equal to

where, k is the number of occurrences of defects p is a positive real number, c. [3 Points] In a defected chip, assume defects are independent and uniformly distributed within the die area. What is the probability that all defects in a DEFECTED die occur in the same core? (In other words, there is no defect on all other three cores and the L3 cache.) Please notice that there can be more than one detect on a die. d. [1 Point] If there is only one defected core in a chip with defect-free L3, we can still sell it by shutting down the defected core. Suppose you can sell the perfectly working (defect-free) chip for $259.99 each. Also assume that you need $179 to manufacture and test each chip. What is the minimum sale price for your chips with 3 working cores (the defective core is shutdown) to make break even (no profit, no loss)?

For simplicity. the computation results. at most 5MB/1MB data can be moved to CPU/GPU cache before any computation. we only consider floating. b. output data. while each GPU core can deliver 3. CPU and GPU cache are 5MB and 1 MB. Your job is to develop a numerical simulation program. GPU Core 512 . There is no overlap between data transfer and computation. The software will be used on a workstation with a single CPU core and 512 GPU cores. The results must be transferred back to memory before the next round of computation. the computation cannot be performed without data. unless there are any specific restrictions. you have to load the required input data into the cache within CPU or GPU cores.point operations in this problem. a. Before each round of computation. Each byte input data requires 2 Flops to produce 0. Assume that the memory have infinite capacity. . No data can be transferred during the computation. [2 Points] If all of the dynamic instructions in your main application are parallelizable. if any. The output data must be written back to the memory before next round of computation. . So you accepted an offer from a software division in a GPU company. In other words. Suppose all input data is stored within the memory prior to execution. Similarly. will be stored in the corresponding cache immediately after the computation. while the bandwidth between GPU cache and memory is 36GB/s.2. One day you got tired with your processor company. what is the maximum performance (Flops) you can get from your hardware in the optimal situation? What is the speedup compared with CPU only execution? State your assumptions. i. The CPU can achieve 1GFlops. [3 Points] In reality.e. There is no dependency among different parts of data. respectively.9GFlops (peak). What is the maximum performance (Flops) you can get from your hardware in the optimal situation if we take the data transfer time into consideration? What is the speedup in this case compared to CPU-only execution? What happens to Flops and speedup if the GPU cache is 20MB? GPU Core 1 CPU Core CPU Cache Memory GPU Cache .. The CPU and all the GPU cores can perform calculation simultaneously.4 byte output data on average. The bandwidth between CPU cache and memory is 6GB/s.

Since this is an R-Type instruction.e. R3 = memory[R1+4].R7} and LDMIA R1. . LDMIA R1.{R2-R7} will perform the following operations: R2 = memory[R1].g. If only 20% of load instructions can be eliminated by the new instructions. R6 = memory[R1+16].{R2.4(R1) ADD R5. we can replace LOAD R2. i.R7} and LDMIA R1. The first three can load 2 registers. Write the code sequence for D = B*(A+B-C) + A*D for four instruction-set architectures: i) Stack. if improved. by how much etc.1 on page A-4 of the Appendix A).3. 4(R6) by LDMIA R1. there are 6 bits reserved for opcode. iii) Register-memory and iv) Register-register (Load-Store). The second format can load all registers in the range. b. Assume that 26% of dynamic instructions are loads. a. R7 = memory[R1+20]. [3 Points] LDMIA is not currently supported by MIPS instruction set. R2 STORE R5..R7} ADD R5. R2 STORE R5.{R2R7}. ii) Accumulator.{R2-R7}.{R2. C and D reside in memory. R7 = memory[R1+4]. LDMIA R1. The new instruction affects only the clock speed and not the CPI. 4(R6) There are two different formats of LDMIA available: LDMIA R1. In other words.?). ARM instruction set offers an instruction to Load multiple registers. R7. R5 = memory[R1+12]. Suppose all instructions are still 32 bits.{R2. R7. [2 Points] Assume that the new instruction will cause the clock cycle to increase by 2. will the overall performance change? Indicate the change (e. Can you design the binary format (encoding) for this LDMIA instruction if we want to add it to MIPS? Please notice that your encoding should support instructions like LDMIA R1. [4 points] Assume that values of variables A.5%.0(R1) LOAD R7.R7} will perform the following two operations: R2 = memory[R1]. R4 = memory[R1+8]. Please do not perform any scheduling or other optimizations of the above code sequence! 4. (These four architectures are shown in Figure A.{R2. B.