You are on page 1of 6

313 Computer Design

Answers to tutorial exercises on slide set #1: introduction

Q1. (Revision) Given the following truth table, draw the corresponding state diagram
and the logic equations for the corresponding circuit. (The logic equations are tex-
tual descriptions of circuits, and are of the form "signal1 = signal2 & !signal3". "!"
stands for NOT, "&" stands for AND, and "|" stands for OR.)

cur_state inputs | outputs next_state

00 0 01 01

00 1 00 01

01 x 1x 10

10 0 1x 10

10 1 00 00

Call the signals cs1, cs2, in1, out1, out2, ns1 and ns2. ‘x’ stands for "don't care".

A1. The diagram has three states, labelled 00, 01 and 10. The transitions are read off di-
rectly from the truth table. The system is simple. First invent one symbol for each row, and
derive an equation to make each symbol true only for the input pattern on the correspond-
ing row:

row1 = !cs1 & !cs2 & !in1

row2 = !cs1 & !cs2 & in1

row3 = !cs1 & cs2

row4 = cs1 & !cs2 & !in1

row5 = cs1 & !cs2 & in1

Then write down the equation for each output by ORing together the symbols for the lines
in which the output is set to true:

out1 = row3 | row4

out2 = row1

ns1 = row3 | row4

ns2 = row1 | row2


You can choose to make the equation for out2 be either "out2 = row1 | row3", or "out2 =
row1 | row4" or "out2 = row1 | row3 | row4" or the equation above, since we don't care
about the value of this bit in rows 3 an 4. The equation above is the most efficient, since it
does not need any gates.

This circuit would in real life be optimised further. You can delete the circuit computing
row5, since row5 is never used, and you could use the same circuit to compute out1 and
ns1, since they always have the same value. In addition, one can share gates between the
circuits computing row1 and row4 (!cs2 & !in1), those computing row2 and row5 (!cs2 &
in1) etc.

There are tools that take the truth table form and produce a set of logic equations opti-
mized for a given technology (they would make all the optimizations suggested above).
There are also tools that convert the logic equations into a description of where the transis-
tors and wires should go on the part of the chip that implements the desired circuit.

Q2. Here are the execution times in seconds for the Linpack benchmarks and 10,000
iterations of the Dhrystone benchmark on two VAX models:

model year Linpack time Dhrystone time

VAX-11/780 1978 4.90 5.69

VAX 8550 1987 0.69 0.96

(a) How much faster is the 8550 compared to the 780 using Linpack? How about us-
ing Dhrystone?

(b) What is the performance growth per year between the 780 and the 8550 using
Linpack? How about using Dhrystone?

A2.
(a)

To answer this question you must make use of the equation:


timeonY n
=1+
timeonX 100

4.90
= 7.10
Linpack: 0.69 , hence (7.10 − 1) ∗ 100 = 610% faster.

5.69
= 5.93
Dhrystone: 0.96 , hence (5.93 − 1) ∗ 100 = 493% faster.
(b)

Above, we have calculated what the performance increase was over the nine years. The
questions here is what it was each year in order to achieve the overall gain. The answer to
the question is basic maths, rather than computer design related theory. After some loga-
rithmic calculations you can arrive at the following:

Linpack: 7.10 = x(1987−1978) , therefore x = 1.24 .

Dhrystone: 5.93 = x(1987−1978) , therefore x = 1.22 .

Linpack is floating-point intensive while Dhrystone is integer.


Since about 1985, improvements have been more rapid, about 60% per year, which leads
to performance doubling about every 18 months.

Q3. We are considering enhancing a machine by adding a vector mode to it. When a
computation is run in vector mode it is 20 times as fast as the normal mode of exe-
cution. We call the percentage of time currently spent in operations that could be
done in vector mode in the new design the percentage of vectorization.

Draw a graph that plots the speedup against percentage of vectorization. What per-
centage of vectorization is needed to achieve a speedup of 2? What percentage of
vectorization is needed to achieve one half of the maximum speedup attainable from
using vector mode?

A3. To answer this question one does not need to know exactly how vector machines
work. All you need to know is that under the right circumstances they can achieve some
tasks faster than their scalar counterparts. How much faster? The speedup is predicted by
1
y=
Amdahl's law. If x is fraction of vectorization, and y is speedup, then (1 − x + 20
x
) and
20
y=
thus (20 − 19x) .

Some points on
the graph:

x = 0.1, y = 1.10
x = 0.2, y = 1.23
x = 0.3, y = 1.39
x = 0.4, y = 1.61
x = 0.5, y = 1.90
x = 0.6, y = 2.33
x = 0.7, y = 2.99
x = 0.8, y = 4.17
x = 0.9, y = 6.90
x = 1.0, y = 20.00
You need slightly more than 50% vectorization for a speedup of 2; you need slightly less
than 95% vectorization for a speedup of 10. As you can see from the graph, to achieve
considerable speedup in the execution of a program, you must make sure that as much as
possible of the program can be executed using the vector mode. Otherwise the whole
thing is not worth the trouble. Vector machines are not cost effective unless the percentage
of vectorization is high. The problem is that the achievement of this goal usually requires
significant rewriting of the program.

Q4. Consider a vector machine as in question 3, but assume that the vector mode is
100 times as fast as the normal (scalar) mode. What percentage of vectorization is
needed to achieve a speedup of 2?

A4. At 50% vectorization, the speedup on this machine is 1.98, so the answer is still
"slightly more than 50%", although the term "slightly" more now covers somewhat fewer
percentage points.

The implication is that the maximum speedup you can get from vectorization, or any other
technique that speeds up only some parts of programs, is not as important as the fraction
of the program to which the optimization is applicable.

Q5. The Sun 4/110 is about five times as fast on most things than the Sun 3/50.
However, the 3/50 has multiply hardware while the 4/110 does not, so multiplication
is four times as fast on the 3/50 than on the 4/110.

Assume that the relative performance of the two machines when the application
program spends none of its time on multiplication is 5 to 1. What is their relative
performance when the application spends 5% of its time on multiplication on the 3/
50? How about 10%, 20%, 50%?

A5. The required formula is a variant of the one associated with Amdahl's law; you can de-
rive it quite easily simply by writing down the time required for each part of the program.

If the 3/50 spends a fraction f of its time doing things other than multiplication, then the
f
time taken by the 4/110 to do the same will be 5 . Given that we are measuring time as a
fraction of the total time taken for the application to complete, the remainder of the time,
that is the time doing multiplication, is 1 − f . During that time the 3/50 is actually four times
faster, hence the time for the 4/110 to do the same is (1 − f ) ∗ 4 .
The overall time for the 4/110 to complete the same application as the 3/50 is
f
(1 − f ) ∗ 4 +
5 , hence

old time 1
speedup = = f
new time (1 − f ) ∗ 4 + 5 .

When f = 1 , the speedup is of course 5 (400% improvement). On the other hand, if f = 0 ,


that is only multiplication is done, we get speedup = 0.25, or in other words it takes 4 times
longer to do things. This is because as the question stated, the 4/110 is four time slower
doing multiplication than the 3/50.
To answer the rest of the question we need to consider that f is the fraction of time not do-
ing multiplication, whereas we are asked in the question in terms of the time in doing mul-
tiplication. Hence, if the application spends 5% doing multiplication, then f = 0.95 . The rest
is simply a matter of substituting in the above equation and calculating the answers.

When f = 0.95 , the speedup is


1 1 1
0.95 = 0.2 + 0.19 = 0.39 = 2.56
0.05 ∗ 4 + 5 (156% improvement).

When f = 0.9 , the speedup is


1 1 1
0.9 = 0.4 + 0.18 = 0.58 = 1.72
0.1 ∗ 4 + 5 (72% improvement).

When f = 0.8 , the speedup is


1 1 1
0.8 = 0.8 + 0.16 = 0.96 = 1.04
0.2 ∗ 4 + 5 (4% improvement).

When f = 0.5 , the speedup is


1 1 1
0.5 = 2.0 + 0.10 = 2.10 = 0.48
0.5 ∗ 4 + 5 (52% degradation).

You see from the case of f = 0.8 (20% multiplication) that speeding up 80+% of the pro-
gram by a certain factor at the expense of slowing down the remaining 20-% by a similar
factor is not worth it.

The lesson from questions 4 and 5 is that the slowest component will usually determine
the overall speed of a system.

We can also calculate the relative performance of the 3/50 in terms of the 4/110. In this
case the fraction f is the time the 4/110 is doing things other than multiplication. This time
the equation is:

1
speedup = 1−f
4 + 5f

This will gives us a curve which is sloping down, and is the inverse of the previously dis-
cussed case, as expected. On the graph below, you can determine in which case the
processors will perform identically, by identifying the point of intersection.
1 old time 1
speedup = 1−f speedup = =
4 + 5f new time (1 − f ) ∗ 4 + f
5