You are on page 1of 20

DHANALAKSHMI SRINIVASAN ENGINEERING COLLEGE

PERAMBALUR 621 212


CYCLE TEST I NOVEMBER 2014
Part A

(10 X 2 = 20)

1. Define Amdahls Law?


Amdahls Law states that the performance improvement to be gained from using some faster
mode of execution is limited by the fraction of the time the faster mode can be used. Amdahls Law
defines the speedup that can be gained by using a particular feature. What is speedup? Suppose that
we can make an enhancement to a computer that will improve performance when it is used.
Speedup =

Performance for entire task using the enhancement when possible


Performance for entire task without using the enhancement

2. What is Instruction Level Parallelism?


Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer
program can be performed simultaneously. The potential overlap among instructions is called
instruction level parallelism.
3. Find the number of dies per 300mm (30cm) wafer for a die hat is 1.5cm on a side and for
a die that is 1.0cm on a side.

Define response time and throughput.


Response time
Also called execution time. The total time required for the computer to complete a task,
including disk accesses, memory accesses, I /O activities, operating system overhead, CPU
execution time, and so on.
Throughput
Also called bandwidth. A nother measure of performance, it is the number of tasks completed
per unit time.

List the various classes of computers.


o
o
o
o

Super computer
Mainframe Computer
Minicomputer
Microcomputer

6. What are the various types of dependencies?


There are 5 types of data dependencies. They are as follows:
(1)
Flow dependence
(2)
Anti-dependence
(3)
Output dependence
(4)
I/O dependence
(5)
Unknown dependence
7. What are the primary components of vector architecture?
Vector register
Vector functional units
Vector load/store unit
Set of scalar register
8. Define strip mining with example.
When loops are shorter, vector architectures use a register that reduces the length of vector
operations. When loops are larger, we add bookkeeping code to iterate full-length vector operations
and to handle the leftovers. This latter process is called strip mining
9. Define gather-scatter.
Gather and scatter find their addresses in another vector register: Think of it as register
indirect addressing for vector computers. From a vector perspective, in contrast, these short-vector
SIMD computers support only unit strided accesses: Memory accesses load or store all elements at
once from a single wide memory location. Since the data for multimedia applications are often
streams that start and end in memory, strided and gather/scatter addressing modes are essential to
successful vectorization.

10. Consider the loop :

for(i=0;i<100;i=i+1)
{
A[i+1]=A[i]+C[i]; /*S1*/
B[i+1]=B[i]+A[i]; /*S2*/
}
What are the dependences between S1 and S2 in the loop?
Answer
There are two different dependences:
1. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1],
which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].
2. S2 uses the value, A[i+1], computed by S1 in the same iteration.
These two dependences are different and have different effects. To see how they differ, lets assume
that only one of these dependences exists at a time. Because the dependence of statement S1 is on an
earlier iteration of S1, this dependence is loop carried. This dependence forces successive iterations
of this loop to execute in series.
The second dependence (S2 depending on S1) is within an iteration and is not loop carried. Thus, if
this were the only dependence, multiple iterations of the loop could execute in parallel, as long as
each pair of statements in an iteration were kept in order. We saw this type of dependence in an
example in Section 3.2, where unrolling was able to expose the parallelism.
It is also possible to have a loop-carried dependence that does not prevent parallelism, as the next
example shows.

Part B
Answer All the Questions.

(5X16=80)

11. a. Explain the concepts and challenges of Instruction Level Parallelism (ILP).
Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer
program can be performed simultaneously. The potential overlap among instructions is called
instruction level parallelism.
There are two largely separable approaches to exploiting ILP: an approach that relies on
hardware to help discover and exploit the parallelism dynamically, and an approach that relies on
software technology to find parallelism, statically at compile time. Processors using the dynamic,
hardware-based approach, including the Intel Pentium series, dominate in the market; those using the
static approach, including the Intel Itanium, have more limited uses in scientific or applicationspecific environments.
The simplest and most common way to increase the ILP is to exploit parallelism among
iterations of a loop. This type of parallelism is often called loop-level parallelism. Here is a simple
example of a loop, which adds two 1000-element arrays, that is completely parallel:
for
(i=1; i<=1000; i=i+1)
x[i] = x[i] + y[i];
Every iteration of the loop can overlap with any other iteration, although within each loop iteration
there is little or no opportunity for overlap.
Data Dependences and Hazards
Determining how one instruction depends on another is critical to determining how much
parallelism exists in a program and how that parallelism can be exploited. In particular, to exploit
instruction-level parallelism we must determine which instructions can be executed in parallel. If two
instructions are parallel, they can execute simultaneously in a pipeline of arbitrary depth without
causing any stalls, assuming the pipeline has sufficient resources (and hence no structural hazards
exist). If two instructions are dependent, they are not parallel and must be executed in order, although
they may often be partially overlapped. The key in both cases
There are three different types of dependences: data dependences (also called true data dependences),
name dependences, and control dependences. An instruction j is data dependent on instruction I if
either of the following holds:
Data Hazards
A hazard is created whenever there is dependence between instructions, and they are close
enough that the overlap during execution would change the order of access to the operand involved
in the dependence. Because of the dependence, we must preserve what is called program order, that
is, the order that the instructions would execute in if executed sequentially one at a time as
determined by the original source program. The goal of both our software and hardware techniques is
to exploit parallelism by preserving program order only where it affects the outcome of the program.
Detecting and avoiding hazards ensures that necessary program order is preserved.
RAW (read after write) j tries to read a source before i writes it, so j incorrectly gets the old value.
This hazard is the most common type and corresponds to a true data dependence. Program order
must be preserved to ensure that j receives the value from i.

WAW (write after write)j tries to write an operand before it is written by i. The writes end up being
performed in the wrong order, leaving the value written by i rather than the value written by j in the
destination. This hazard corresponds to output dependence. WAW hazards are present only in
pipelines that write in more than one pipe stage or allow an instruction to proceed even
when a previous instruction is stalled.
11. b. What is multithreading? Discuss different types of multithreading in detail.

Multithreading is similar to multi-processing.


A multi-processing Operating System can run several processes at the same time
Each process has its own address/memory space
The OS's scheduler decides when each process is executed
Only one process is actually executing at any given time. However, the
system appears to be running several programs simultaneously
Separate processes to not have access to each other's memory space
Many OSes have a shared memory system so that processes can share memory space

In a multithreaded application, there are several points of execution within the same memory
space.

Each point of execution is called a thread

Threads share access to memory

hardware multithreading
I ncreasing utilization of a processor by switching to another thread when one thread is stalled.
thread
A thread includes the program counter, the register state, and the stack. I t is a lightweight process;
whereas threads commonly share a single address space, processes dont.
process
A process includes one or more threads, the address space, and the operating system state. Hence, a
process switch usually invokes the operating system, but not a thread switch.
fine-grained multithreading
A version of hardware multithreading that implies switching between threads after every instruction.
coarse-grained multithreading
A version of hardware multithreading that implies switching between threads only after significant
events, such as a last-level cache miss.
simultaneous multithreading (S M T )
A version of multithreading that lowers the cost of multithreading by utilizing the resources needed
for multiple issue, dynamically scheduled microarchitecture.

12. a. What is multicore processor? Explain how a multicore processor works.

A multi-core processor is a processing system composed of two or more independent cores


(or CPUs). The cores are typically integrated onto a single integrated circuit die (known as a chip
multiprocessor or CMP), or they may be integrated onto multiple dies in a single chip package.
A multi-core processor implements multiprocessing in a single physical package. Cores in a
multi-core device may be coupled together tightly or loosely. For example, cores may or may not
share caches, and they may implement message passing or shared memory inter-core communication
methods. Common network topologies to interconnect cores include: bus, ring, 2-dimentional mesh,
and crossbar.
All cores are identical in symmetric multi-core systems and they are not identical in
asymmetric multi-core systems. Just as with single-processor systems, cores in multi-core systems
may implement architectures such as superscalar, vector processing, or multithreading.

In this design, each core has its own execution pipeline. And each core has the resources
required to run without blocking resources needed by the other software threads.
While the example in Figure 2 shows a two-core design, there is no inherent limitation in the number
of cores that can be placed on a single chip. Intel has committed to shipping dual-core processors in
2005, but it will add additional cores in the future. Mainframe processors today use more than two
cores, so there is precedent for this kind of development.
The multi-core design enables two or more cores to run at somewhat slower speeds and at much
lower temperatures. The combined throughput of these cores delivers processing power greater than
the maximum available today on single-core processors and at a much lower level of power
consumption. In this way, Intel increases the capabilities of server platforms as predicted by Moores
Law while the technology no longer pushes the outer limits of physical constraints.

12. b . Discuss Amdahls Law and how Processor Speedup is calculated explain with an
example.
Amdahls Law states that the performance improvement to be gained from using some faster
mode of execution is limited by the fraction of the time the faster mode can be used. Amdahls Law
defines the speedup that can be gained by using a particular feature. What is speedup? Suppose that
we can make an enhancement to a computer that will improve performance when it is used.
Amdahls Law defines the speedup that can be gained by using a particular feature. What is speedup?
Suppose that we can make an enhancement to a computer that will improve performance when it is
used. Speedup is the ratio
Speedup =

Performance for entire task using the enhancement when possible


Performance for entire task without using the enhancement
Alternatively

Speedup =

Execution time for entire task without using the enhancement


Execution time for entire task using the enhancement when possible

Speedup tells us how much faster a task will run using the computer with the enhancement as
opposed to the original computer.
Amdahls Law gives us a quick way to find the speedup from some enhancement, which depends on
two factors:
1. The fraction of the computation time in the original computer that can be converted to take
advantage of the enhancementFor example, if 20 seconds of the execution time of a
program that takes 60 seconds in total can use an enhancement, the fraction is 20/60. This
value, which we will call Fraction enhanced, is always less than or equal to 1.
2. The improvement gained by the enhanced execution mode; that is, how much faster the task would
run if the enhanced mode were used for the entire program This value is the time of the original
mode over the time of the enhanced mode. If the enhanced mode takes, say, 2 seconds for a portion
of the program, while it is 5 seconds in the original mode, the improvement is 5/2. We will call this
value, which is always greater than 1, Speedup enhanced.
The execution time using the original computer with the enhanced mode will be the time spent using
the unenhanced portion of the computer plus the time spent using the enhancement:

13. a. Explain trends in power, energy, cost and technology in integrated circuits with example.
Energy and Power within a Microprocessor
For CMOS chips, the traditional primary energy consumption has been in switching transistors, also
called dynamic energy. The energy required per transistor is proportional to the product of the
capacitive load driven by the transistor and the square of the voltage:

This equation is the energy of pulse of the logic transition of 010 or 101. The energy of a
single transition (01 or 10) is then:

The power required per transistor is just the product of the energy of a transition multiplied by the
frequency of transitions:

For a fixed task, slowing clock rate reduces power, but not energy. Clearly, dynamic power and
energy are greatly reduced by lowering the voltage, so voltages have dropped from 5V to just under
1V in 20 years. The capacitive load is a function of the number of transistors connected to an output
and the technology, which determines the capacitance of the wires and the transistors.

Example Some microprocessors today are designed to have adjustable voltage, so a 15%
reduction in voltage may result in a 15% reduction in frequency. What would be the impact on
dynamic energy and on dynamic power?
Answer Since the capacitance is unchanged, the answer for energy is the ratio of the voltages
since the capacitance is unchanged:

Integrated circuit costs are becoming a greater portion of the cost that varies between computers,
especially in the high-volume, cost-sensitive portion of the market. Indeed, with personal mobile
devices increasing reliance of whole systems on a chip (SOC), the cost of the integrated
circuits is much of the cost of the PMD. Thus, computer designers must understand the costs of chips
to understand the costs of current computers. Although the costs of integrated circuits have dropped
exponentially, the basic process of silicon manufacture is unchanged: A wafer is still tested and
chopped into dies that are packaged). Thus, the cost of a packaged integrated circuit is

In this section, we focus on the cost of dies, summarizing the key issues in testing and packaging at
the end. Learning how to predict the number of good chips per wafer requires first learning how
many dies fit on a wafer and then learning how to predict the percentage of those that will work.
From there it is simple to predict cost:

13. b.i). Explain in detail the various types of dependencies with suitable example.
There are 5 types of data dependencies. They are as follows:
Flow dependence:
A statement S2 is flow-dependent on the statement S1 if an execution
path exists from S1 to s2 and if at least one output of S1 feeds in as input to
S2.
Ex: S1:
S2:

load R1, A
Add R2, R1

S
S
Anti-dependence:
1
Statement S2 is anti-dependent on statement S12 if S2 follows S1 in
program order and if the output of S2 overlaps the input to S1.
Ex:
S1:
add R2, R1
S2:
move R1, R3

S
S
1
1
Two statements are output dependent if they produce the same output variable.

Output dependence:

Ex:

S1:

load R1, A

S
1

S
1

S2:

move R1, R3

I/O dependence:

Read and write are I/O statements. I/O dependence occurs not because the same
variable is involved but because the same file is referenced by both I/O statements.
Unknown dependence:

The dependence relation between two statements cannot be determined in the


following situations.

The subscript of a variable itself subscribed.


The subscript does not contain the loop index variable.
A variable appears more than once with subscripts having different
coefficients of the loop variable.
The subscript is nonlinear in the loop index variable.

When one or more of these conditions exists, a conservative assumption is to claim


unknown dependence among the statements involved.
ii). Find all the true dependences, output dependences and antidependences and eliminate the
output dependences and antidependences by renaming.
for(i=0;i<100;i=i+1){
Y[i]=X[i] / C; /*S1*/
X[i]=X[i] + C; /*S2*/
Z[i]=Y[i] + C; /*S3*/
Y[i]= C - Y[i]; /*S4*/
}

Answer
The following dependences exist among the four statements:
1. There are true dependences from S1 to S3 and from S1 to S4 because of Y[i]. These are not
loop carried, so they do not prevent the loop from being considered parallel. These
dependences will force S3 and S4 to wait for S1 to complete.
2. There is an antidependence from S1 to S2, based on X[i].
3. There is an antidependence from S3 to S4 for Y[i].
4. There is an output dependence from S1 to S4, based on Y[i].
The following version of the loop eliminates these false (or pseudo) dependences:

After the loop, the variable X has been renamed X1. In code that follows the loop, the compiler can
simply replace the name X by X1. In this case, renaming does not require an actual copy operation
but can be done by substituting names or by register allocation. In other cases, however, renaming
will require copying.

14. a. Explain vector architecture with neat diagram and give the suitable example
We begin with a vector processor consisting of the primary components that Fig shows. This
processor, which is loosely based on the Cray-1, is the foundation for discussion throughout this
section. We will call this instruction set architecture VMIPS; its scalar portion is MIPS, and its vector
portion is the logical vector extension of MIPS. The rest of this subsection examines how the basic
architecture of VMIPS relates to other processors.

The primary components of the instruction set architecture of VMIPS are the following:

Vector registersEach vector register is a fixed-length bank holding a single vector. VMIPS
has eight vector registers, and each vector register holds 64 elements, each 64 bits wide. The
vector register file needs to provide enough ports to feed all the vector functional units. These
ports will allow a high degree of overlap among vector operations to different vector
registers. The read and write ports, which total at least 16 read ports and 8 write ports, are
connected to the functional unit inputs or outputs by a pair of crossbar switches.

Vector functional unitsEach unit is fully pipelined, and it can start a new operation on
every clock cycle. A control unit is needed to detect hazards, both structural hazards for
functional units and data hazards on register accesses. Fig shows that VMIPS has five
functional units. For simplicity, we focus exclusively on the floating-point functional units.
Vector load/store unitThe vector memory unit loads or stores a vector to or from memory.
The VMIPS vector loads and stores are fully pipelined, so that words can be moved between
the vector registers and memory with a bandwidth of one word per clock cycle, after an initial
latency. This unit would also normally handle scalar loads and stores.
A set of scalar registersScalar registers can also provide data as input to the vector
functional units, as well as compute addresses to pass to the vector load/store unit. These are
the normal 32 general-purpose registers and 32 floating-point registers of MIPS. One input of
the vector functional units latches scalar values as they are read out of the scalar register file.

14. b. i) Explain the innovations of Fermi architecture in detail.


The Fermi architecture is the most significant leap forward in GPU architecture since the
original G80. G80 was our initial vision of what a unified graphics and computing parallel processor
should look like. GT200 extended the performance and functionality of G80. With Fermi, we have
taken all we have learned from the two prior processors and all the applications that were written for
them, and employed a completely new approach to design to create the worlds first computational
GPU. When we started laying the groundwork for Fermi, we gathered extensive user feedback on
GPU computing since the introduction of G80 and GT200, and focused on the following key areas
for improvement:
Improve Double Precision Performancewhile single precision floating point performance was
on the order of ten times the performance of desktop CPUs, some GPU computing applications
desired more double precision performance as well.
ECC supportECC allows GPU computing users to safely deploy large numbers of GPUs in
datacenter installations, and also ensure data-sensitive applications like medical imaging and
financial options pricing are protected from memory errors.
True Cache Hierarchysome parallel algorithms were unable to use the GPUs shared memory,
and users requested a true cache architecture to aid them.
More Shared Memorymany CUDA programmers requested more than 16 KB of SM shared
memory to speed up their applications.
Faster Context Switchingusers requested faster context switches between application
programs and faster graphics and compute interoperation.
Faster Atomic Operationsusers requested faster read-modify-write atomic operations for

their parallel algorithms.


With these requests in mind, the Fermi team designed a processor that greatly
increases raw
compute horsepower, and through architectural innovations, also offers dramatically
increased
programmability and compute efficiency. The key architectural highlights of Fermi are:
Third Generation Streaming Multiprocessor (SM)
32 CUDA cores per SM, 4x over GT200
8x the peak double precision floating point performance over GT200
Dual Warp Scheduler simultaneously schedules and dispatches instructions
from two independent warps
64 KB of RAM with a configurable partitioning of shared memory and L1 cache
Second Generation Parallel Thread Execution ISA
Unified Address Space with Full C++ Support
Optimized for OpenCL and DirectCompute
Full IEEE 754-2008 32-bit and 64-bit precision
Full 32-bit integer path with 64-bit extensions
Memory access instructions to support transition to 64-bit addressing
Improved Performance through Predication
Improved Memory Subsystem
NVIDIA Parallel DataCacheTM hierarchy with Configurable L1 and Unified L2
Caches
First GPU with ECC memory support
Greatly improved atomic memory operation performance
NVIDIA GigaThreadTM Engine
10x faster application context switching
Concurrent kernel execution
Out of Order thread block execution
Dual overlapped memory transfer engines

ii). How multiple lanes used for beyond one element per clock and explain how to handling
loops not equal to 64.
Beyond One Element per Clock Cycle
A critical advantage of a vector instruction set is that it allows software to pass a large amount of
parallel work to hardware using only a single short instruction. A single vector instruction can
include scores of independent operations yet be encoded in the same number of bits as a
conventional scalar instruction. The parallel semantics of a vector instruction allow an
implementation to execute these elemental operations using a deeply pipelined functional unit, as in
the VMIPS implementation weve studied so far; an array of parallel functional units; or a
combination of parallel and pipelined functional units. Figure 4.4 illustrates how to improve vector
performance by using parallel pipelines to execute a vector add instruction.

Handling Loops Not Equal to 64


A vector register processor has a natural vector length determined by the number of elements
in each vector register. This length, which is 64 for VMIPS, is unlikely to match the real vector
length in a program. Moreover, in a real program the length of a particular vector operation is often
unknown at compile time. In fact, a single piece of code may require different vector lengths. For
example, consider this code:.

The size of all the vector operations depends on n, which may not even be known until run
time! The value of n might also be a parameter to a procedure containing the above loop and
therefore subject to change during execution.
The solution to these problems is to create a vector-length register (VLR). The VLR controls
the length of any vector operation, including a vector load or store. The value in the VLR, however,
cannot be greater than the length of the vector registers. This solves our problem as long as the real
length is less than or equal to the maximum vector length (MVL). The MVL determines the number
of data elements in a vector of an architecture. This parameter means the length of vector registers
can grow in later computer generations without changing the instruction set; as we shall see in the
next section, multimedia SIMD extensions have no equivalent of MVL, so they change the
instruction set every time they increase their vector length.
What if the value of n is not known at compile time and thus may be greater than the MVL?
To tackle the second problem where the vector is longer than the maximum length, a technique called
strip mining is used. Strip mining is the generation of code such that each vector operation is done
for a size less than or equal to the MVL. We create one loop that handles any number of iterations
that is a multiple of the MVL and another loop that handles any remaining iterations and must be less
than the MVL. In practice, compilers usually create a single strip-mined loop that is parameterized to
handle both portions by changing the length. We show the strip-mined version of the DAXPY loop in
C:

15.a.i). Explain in detail about Graphics Processing unit.


GPU with hundreds of parallel floating-point units, which makes high-performance
computing more accessible. The interest in GPU computing blossomed when this potential was
combined with a programming language that made GPUs easier to program. Hence, many
programmers of scientific and multimedia applications today are pondering whether to use GPUs or
CPUs.
Programming the GPU
CUDA is an elegant solution to the problem of representing parallelism in algorithms, not all
algorithms, but enough to matter. It seems to resonate in some way with the way we think and code,
allowing an easier, more natural expression of parallelism beyond the task level.
The GPU hardware handles parallel execution and thread management; it is not done by
applications or by the operating system. To simplify scheduling by the hardware, CUDA requires that
thread blocks be able to execute independently and in any order. Different thread blocks cannot
communicate directly, although they can coordinate using atomic memory operations in Global
Memory.
As we shall soon see, many GPU hardware concepts are not obvious in CUDA. That is a
good thing from a programmer productivity perspective, but most programmers are using GPUs
instead of CPUs to get performance. Performance programmers must keep the GPU hardware in
mind when writing in CUDA. For reasons explained shortly, they know that they need to keep
groups of 32 threads together in control flow to get the best performance from multithreaded SIMD
Processors, and create many more threads per multithreaded SIMD Processor to hide latency to
DRAM. They also need to keep the data addresses localized in one or a few blocks of memory to get
the expected memory performance.

15.b. i).How will you detect and enhance loop level parallelism?
Loop-level parallelism is normally analyzed at the source level or close to it, while most
analysis of ILP is done once instructions have been generated by the compiler. Loop-level analysis
involves determining what dependences exist among the operands in a loop across the iterations of
that loop
The analysis of loop-level parallelism focuses on determining whether data accesses in later
iterations are dependent on data values produced in earlier iterations; such dependence is called a
loop-carried dependence. Most of the examples we considered in Section 3.2 have no loop-carried
dependences and, thus, are loop-level parallel. To see that a loop is parallel, let us first look at the
source representation:

In this loop, there is a dependence between the two uses of x[i], but this dependence is within a
single iteration and is not loop carried. There is a dependence between successive uses of i in
different iterations, which is loop carried, but this dependence involves an induction variable and can
be easily recognized and eliminated
Because finding loop-level parallelism involves recognizing structures such as loops, array
references, and induction variable computations, the compiler can do this analysis more easily at or
near the source level, as opposed to the machine-code level
Consider the loop :
for(i=0;i<100;i=i+1)
{
A[i+1]=A[i]+C[i]; /*S1*/
B[i+1]=B[i]+A[i]; /*S2*/
}
What are the dependences between S1 and S2 in the loop?
Answer
There are two different dependences:
1. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1],
which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].
2. S2 uses the value, A[i+1], computed by S1 in the same iteration.

These two dependences are different and have different effects. To see how they differ, lets assume
that only one of these dependences exists at a time. Because the dependence of statement S1 is on an
earlier iteration of S1, this dependence is loop carried. This dependence forces successive iterations
of this loop to execute in series.
The second dependence (S2 depending on S1) is within an iteration and is not loop carried. Thus, if
this were the only dependence, multiple iterations of the loop could execute in parallel, as long as
each pair of statements in iteration were kept in order. We saw this type of dependence in an example
in Section 3.2, where unrolling was able to expose the parallelism.
It is also possible to have a loop-carried dependence that does not prevent parallelism, as the next
example shows