You are on page 1of 7

5"rvt>r

www.ignousite.com SiWIt ftkMWA


Course Code : MCSE-011
Course Title : Parallel Computing
Assignment Number : MCA(5)/E011/Assign/2021-2022
Maximum Marks : 100
Weightage: 25% www.ignousite.com
Last Dates for Submission : 31th October, 2021 (For July, 2021 Session)
: ISth April, 2022 (For January, 2022 Session)

QI: Explain how hypernets integrate positive features of hypercube and tree based topologies into one combined
architecture.
Ans. Hypernet is building an application programming interface (API) that takes your code and runs it across a cluster of
internet connected devices, anything from servers to refrigerators (the "industrial internet"), saving you time and money.
Many of you have been asking how we can make this claim, and what makes Hypernet a new and useful tool. By the end of
this post, you will have a technical understanding of how Hypernet is changing the computing landscape, how you can get
involved, and how you can benefit.

Hypernet solves all the problems posed by servers and the cloud. As devices join our network, you don't have to pay for the
cost of those devices. Furthermore, the combined power of those devices is greater than that of a server system. Also,
Hypernet distributes the computational workload across a cluster of internet connected devices, starting with the ones
nearest to you (subject to your constraints for processor speed, RAM, etc.) This means you don't pay a hardware premium
and you don't have to worry about having too few computers or too much latency.

If you're excited about the implications of Hypernet technology and are wondering how it works along with more technical
details, stay tuned for our next post. We'll reveal exactly how our algorithms run and how they are different from anything
else available.

A new class of modular networks is proposed for hierarchically constructing massively parallel computer systems for
distributed supercomputing and Al applications. These networks are called hypernets. They are constructed incrementally
with identical cubelets, treelets, or buslets that are well suited for VLSI implementation. Hypernets integrate positive
features of both hypercubes and tree-based topologies, and maintain a constant node degree when the network size
increases. This paper presents the principles of constructing hypernets and analyzes their architectural potentials in terms of
message routing complexity, cost-effective support for global as well as localized communication, I/O capabilities, and fault
tolerance. Several algorithms are mapped onto hypernets to illustrate their ability to support parallel processing in a
hierarchically structured or data-dependent environment. The emulation of hypercube connections using less hardware is
shown. The potential of hypernets for efficient support of connectionist models of computation is also explored.

Hypercube I
In computer networking, hypercube networks are a type of network topology used to connect
multiple processors with memory modules and accurately route data. Hypercube networks consist of 2m nodes, which form
the vertices of squares to create an internetwork connection. A hypercube is basically a multidimensional mesh
network with two nodes in each dimension. Due to similarity, such topologies are usually grouped into a k-ary d-dimensional
mesh topology family, where d represents the number of dimensions and k represents the number of nodes in each
dimension.

Ignou Study Helper-Sunil Poonia Page 1


www.ignousite.com SiWIt PdEMWA

Model of the space invariant free space optical five cube interconnection network architecture. The connections of two
nodes, one from each plane, are shown as an example. The shift rule defines the amount of row-wise and column-wise shifts
to be performed by the RSSM.

A binary n-cube or hypercube network is a network with 2 n nodes arranged as the vertices of a ndimensional cube. A
hypercube is simply a generalization of an ordinary cube, the three-dimensional shape which you know. Although you
probably think of a cube as a rectangular prism whose edges are all equal length, that is not the only way to think about it.
To start, a single point can be thought of as a O-cube. Suppose its label is 0. Now suppose that we replicate this O-cube,
putting the copy at a distance of one unit away, and connecting the original and the copy by a line segment of length 1. We
will give the duplicate node the label, 1. We extend this idea one step further. We will replicate the 1-cube, putting the copy
parallel to the original at a distance of one unit away in an orthogonal direction, and connect corresponding nodes in the
copy to those in the original. We will use binary numbers to label the nodes, instead of decimal. The nodes in the copy will
be labeled with the same labels as those of the original except for one change: the most signicant bit in the original will be
changed from 0 to 1 in the copy. Now we repeat this procedure to create a 3-cube: we replicate the 2-cube, putting the copy
parallel to the original at a distance of 1 unit away in the orthogonal direction, connect nodes in the copy to the
corresponding nodes in the original, and relabel all nodes by adding another significant bit, 0 in the original and 1 in the
copy. It is now not hard to see how we can create hypercubes of arbitrary dimension, though drawing them becomes a bit
cumbersome. The node labels will play an important role in our understanding of the hypercube. Observe that

• The labels of two nodes differ by exactly one bit change if and only if they are connected by an edge.

• In an n-dimensional hypercube, each node label is represented by n bits. Each of these bits can be inverted (0->l or l->0),
implying that each node has exactly n incident edges. In the 4D hypercube, for example, each node has 4 neighbors. Thus
the degree of an n-cube is n.

Ignou Study Helper-Sunil Poonia Page 2


www.ignousite.com SiWIt ftkMWA

• The diameter of an n-dimensional hypercube is n. To see this, observe that a given integer represented
with n bits can be transformed to any other n-bit integer by changing at most n bits, one bit at a time. This
corresponds to a walk across n edges in a hypercube from the first to the second label.
• The bisection width of an n-dimensional hypercube is 2n l. One way to see this is to realize that all nodes can be
thought of as lying in one of two planes: pick any bit position and call it b. The nodes whose b-bit = 0 are in one
plane, and those whose b-bit = 1 are in the other. To split the network into two sets of nodes, one in each plane, one
has to delete the edges connecting the two planes. Every node in the O-plane is attached to exactly one node in the
1-plane by one edge. There are 2"'1 such pairs of nodes, and hence 2n l edges. No smaller set of edges can be cut to
split the node set.
• The number of edges in an n-dimensional hypercube is n .2n l. To see this, note that it is true when n = 0, as there
are 0 edges in the O-cube. Assume it is true for all k < n. A hypercube of dimension n consists of two hypercubes of
dimension n-1 with one edge between each pair of corresponding nodes in the two smaller hypercubes. There are
2n l such edges. Thus, using the inductive hypothesis, the hypercube of dimension n has 2.(n-l).2n'2+2n*1 = (n-l).2n‘
1+2n‘1 = (n-l+l)^"*1 = n.2n ledges. By the axiom of induction, it is proved.

The bisection width is very high (one half the number of nodes), and the diameter is low. This makes the hypercube an
attractive organization. Its primary drawbacks are that (1) the number of edges per node is a (logarithmic) function of
network size, making it difficult to scale up, and the maximum edge length increases as network size increases.

Tree base topology


Hypercube interconnection network is formed by connecting N nodes that can be expressed as a power of 2. This means if
the network has n nodes it can be expressed as :
N=2m

where m is the number of bits that are required to label the nodes in the network. So, if there are 4 nodes in the network, 2
bits are needed to represent all the nodes in the network. The network is constructed by connecting the nodes that just
differ by one bit in their binary representation. This is commonly referred to as Binary labelling. A 3D hypercube
internetwork would be a cube with 8 nodes and 12 edges. A 4D hypercube network can be created by duplicating
two 3D networks, and adding a most significant bit. The new added bit should be *0' for one 3D hypercube and '1' for the

Ignou Study Helper-Sunil Poonia Page 3


www.ignousite.com fWIL P66MA
other 3D hypercube. The corners of the respective one-bit changed MSBs are connected to create the higher hypercube
network. This method can be used to construct any m-bit represented hypercube with (m-l)-bit represented hypercube.

Q2: Explain how instruction set, compiler technology, CPU implementation and control, and cache and memory hierarchy
affect the CPU performance and justify the effects in terms of program length, clock rate, and effective CPI.
Ans. There are many factors that affect processor performance. Understanding some of these factors will help you make the
proper choices when designing your homebuilt computer.

The most important factors affecting processor performance are:

Instruction Set:
This is the processor's built-in code that tells it how to execute its duties. It's something that's coded into the chip when it's
manufactured and that you can't change. But together with processor architecture, it does affect performance across a
given line of CPU's. The processor's architecture and instruction set determine how many cycles, or ticks, are needed to
execute a given instruction.

In other words, some instruction sets are more efficient than others, enabling the processor to do more useful work at a
given speed. This is why just looking at the numbers doesn't always tell the whole story of how well a processor will function
in the real world. It's also why when choosing a processor, benchmark tests that measure the chip's abilities to do real-world
work can be very useful.

Compiler technology:
Compiler optimization technology can speed-up execution of a program on the same hardware by rewriting the program to
be more efficient. Various compiler analysis and transformations remove unnecessary computations, use more efficient
instructions to perform a computation, replace algorithms with ones more suited to the hardware, etc. Some compilers also
automatically parallelize your program, getting the work done faster by performing multiple parts of the computation
simultaneously. Designing the hardware architecture and compiler technology together can dramatically improve
performance of the system.
Like most aspects of computer technology, the performance gained by compiler technology is exponentially increasing over
time but the exponent isn't very big. For example, performance of supercomputers

has been doubling every year, and Moore's law says the number of transistors on a chip can double every 18-24 months,
but doubling performance by compiler technology alone takes more like 18 years (a number that's poorly documented, but
often informally quoted). On the other hand, unlike hardware, once a compiler technology has been developed, there is no
per-machine cost to obtain the better performance it offers; this makes developing compiler technology a very good long­
term investment.

CPU implementation and control:


Hardwired into a CPU's circuitry is a set of basic operations it can perform, called an instruction set. Such operations may
involve, for example, adding or subtracting two numbers, comparing two numbers, or jumping to a different part of a
program. Each instruction is represented by a unique combination of bits, known as the machine language opcode. While
processing an instruction, the CPU decodes the opcode (via a binary decoder) into control signals, which orchestrate the
behavior of the CPU. A complete machine language instruction consists of an opcode and, in many cases, additional bits that
specify arguments for the operation (for example, the numbers to be summed in the case of an addition operation). Going
up the complexity scale, a machine language program is a collection of machine language instructions that the CPU
executes.

Ignou Study Helper-Sunil Poonia Page 4


www.ignousite.com 5UML P66/JIA
The actual mathematical operation for each instruction is performed by a combinational logic circuit within the CPU's
processor known as the arithmetic logic unit or ALU. In general, a CPU executes an instruction by fetching it from memory,
using its ALU to perform an operation, and then storing the result to memory. Beside the instructions for integer
mathematics and logic operations, various other machine instructions exist, such as those for loading data from memory and
storing it back, branching operations, and mathematical operations on floating-point numbers performed by the CPU's
floating-point unit (FPU).

The control unit (CU) is a component of the CPU that directs the operation of the processor. It tells the computer's memory,
arithmetic and logic unit and input and output devices how to respond to the instructions that have been sent to the
processor.

It directs the operation of the other units by providing timing and control signals. Most computer resources are managed by
the CU. It directs the flow of data between the CPU and the other devices. John von Neumann included the control unit as
part of the von Neumann architecture. In modern computer designs, the control unit is typically an internal part of the CPU
—r

with its overall role and operation unchanged since


unchanged its its
since introduction./*^-?^
introduction. /^r^V

Cache and memory hierarchy:


Cache memory in computer systems is used to improve system performance. Cache memory operates in the same way as
RAM in that it is volatile. When the system is shutdown the contents of cache memory are cleared. Cache memory allows for
faster access to data for two reasons:
• cache uses Static RAM whereas Main Memory (RAM) uses dynamic RAM
• cache memory stores instructions the processor may require next, which can then be retrieved faster than if they
were held in RAM

The use of static RAM means that the access time is faster when retrieving data from Cache over RAM. Static RAM does not
need to be refreshed in the same way that dynamic RAM does. The process of refreshing RAM means that it takes longer to
retrieve data from main memory.

Cache memory will copy the contents of some data held in RAM. To simplify the process, it works on the understanding that
most programs store data in sequence. If the processor is currently processing data from locations 0- 32, cache memory will
copy the contents of locations 33- 64 in anticipation that they would be needed next.

When the processor initiates a memory read, it will check cache memory first. When checking it will either encounter a
cache hit or a cache miss. In the case of this example, if the processor was hoping to receive the contents of the data that
had been in RAM location 37 then it would find those contents in cache memory. This is a cache hit.

If it were trying to find the contents of location 112 it would encounter a cache miss, meaning it would attempt to read the
data from RAM after it had unsuccessfully tried cache memory. Cache memory will also store frequently used instructions
that can be accessed faster than they could be if held in main memory (RAM).

Caching Data
While the above example focusses on cache memory, the term 'caching' refers to the use of software to identify commonly
used features that can be allocated space in RAM or cache memory for efficient retrieval. It is common for web servers and
browser software to cache pre-compiled webpages or scripts. If these scripts are identified by software and copied to cache
memory or RAM, they can be retrieved faster than if they had to be loaded into memory again every time they are needed.
CSS files are a good example, where an external style sheet can be cached by browser software so that any subsequent
pages making use of the style sheet can retrieve if from RAM quickly without having to wait for the file to download again.
Caching data in this way leads to better response times.

Ignou Study Helper-Sunil Poonia Page 5


www.ignousite.com 5UML P66/JIA
Similarly, the operating system will create what is known as a disk cache, which can be used to improve access speeds to
previously accessed files. For example, as long as there is enough available space in RAM, a video file that has been opened
and then closed may still be held in the page cache in case you access it again before shutting the computer down. The
second time you access the file it is retrieved from the page cache in RAM rather than slower backing storage. Caching data
in these ways lead to better response times.

Clock rate: V-”"/


Most CPUs are synchronous circuits, which means they employ a clock signal to pace their sequential operations. The clock
signal is produced by an external oscillator circuit that generates a consistent number of pulses each second in the form of a
periodic square wave. The frequency of the clock pulses determines the rate at which a CPU executes instructions and,
consequently, the faster the clock, the more instructions the CPU will execute each second.
To ensure proper operation of the CPU, the clock period is longer than the maximum time needed for all signals to
propagate (move) through the CPU. In setting the clock period to a value well above the worst-case propagation delay, it is
possible to design the entire CPU and the way it moves data around the "edges" of the rising and falling clock signal. This has
the advantage of simplifying the CPU significantly, both from a design perspective and a component-count perspective.
However, it also carries the disadvantage that the entire CPU must wait on its slowest elements, even though some portions
of it are much faster. This limitation has largely been compensated for by various methods of increasing CPU parallelism (see
below).

However, architectural improvements alone do not solve all of the drawbacks of globally synchronous CPUs. For example, a
clock signal is subject to the delays of any other electrical signal. Higher clock rates in increasingly complex CPUs make it
more difficult to keep the clock signal in phase (synchronized) throughout the entire unit. This has led many modern CPUs to
require multiple identical clock signals to be provided to avoid delaying a single signal significantly enough to cause the CPU
to malfunction. Another major issue, as clock rates increase dramatically, is the amount of heat that is dissipated by the CPU.
The constantly changing clock causes many components to switch regardless of whether they are being used at that time. In
general, a component that is switching uses more energy than an element in a static state. Therefore, as clock rate
increases, so does energy consumption, causing the CPU to require more heat dissipation in the form of CPU cooling
solutions.

One method of dealing with the switching of unneeded components is called clock gating, which involves turning off the
clock signal to unneeded components (effectively disabling them). However, this is often regarded as difficult to implement
and therefore does not see common usage outside of very low-power designs. One notable recent CPU design that uses
extensive clock gating is the IBM PowerPC-based Xenon used in the Xbox 360; that way, power requirements of the Xbox
360 are greatly reduced. Another method of addressing some of the problems with a global clock signal is the removal of the
clock signal altogether. While removing the global clock signal makes the design process considerably more complex in many
ways, asynchronous (or clockless) designs carry marked advantages in power consumption and heat dissipation in
comparison with similar synchronous designs. While somewhat uncommon, entire asynchronous CPUs have been built
without using a global clock signal. Two notable examples of this are the ARM compliant AMULET and the MIPS R3000
compatible MiniMIPS.

Rather than totally removing the clock signal, some CPU designs allow certain portions of the device to be asynchronous,
such as using asynchronous ALUs in conjunction with superscalar pipelining to achieve some arithmetic performance gains.
While it is not altogether clear whether totally asynchronous designs can perform at a comparable or better level than their
synchronous counterparts, it is evident that they do at least excel in simpler math operations. This, combined with their
excellent power consumption and heat dissipation properties, makes them very suitable for embedded computers.

Ignou Study Helper-Sunil Poonia Page 6


www.ignousite.com
Clocks Per Instruction:
Clocks per instruction (CPI) is an effective average. It is averaged over all of the instruction executions in a program.

CPI is affected by instruction-level parallelism and by instruction complexity. Without instruction-level parallelism, simple
instructions usually take 4 or more cycles to execute. Instructions that execute loops take at least one clock per loop
iteration. Pipelining (overlapping execution of instructions) can bring the average for simple instructions down to near 1
clock per instruction. Superscalar pipelining (issuing multiple instructions per cycle) can bring the average down to a fraction
of a clock per instruction.
For computing clocks per instruction as an effective average, the cases are categories of instructions, such as branches,
loads, and stores. Frequencies for the categories can be extracted from execution traces. Knowledge of how the architecture
handles each category yields the clocks per instruction for that category.

Ignou Study Helper-Sunil Poonia Page 7

You might also like