You are on page 1of 5

Computer Architecture Unit 9

Activity 1:
Find out more about a recent vector thread processor which comes in
two parts: the control processor, known as Rocket, and the vector unit,
known as Hwacha.

9.4 Vector Length and Stride Issues


This section will discuss two issues that occur in real programs. First is the
case when the vector length in a program is not precisely 64.Second is the
way non-adjacent elements in vectors that reside in memory are dealt with.
First, let us study the issue of vector length.
9.4.1 Vector length
In our study till now, we have not stated anything about the real vector size.
We just supposed that the size of the vector register is similar to the size of
the vector we hold. But this may not turn out to be always true. Particularly,
we have two cases in our hands:
 One in which the vector size is less than the vector register size, and
 The second in which the vector size is larger than the vector register
size.
To be more concrete, we assume 64-element vector registers as offered by
the Cray systems. Let’s observe the easier of these two problems.
Handling smaller vectors: In case the vector size is less than 64, we have
to permit the system to be aware that it should not function on all the 64
elements in the vector registers. This can be simply done by utilising the
vector length register. The Vector Length (VL) register carries the
appropriate vector length. The entire vector operations are conducted on the
first VL elements (in other words, elements in the series 0 to VL - 1). The
following two instructions are needed to load values into the VL register:
VL 1 (VL = 1)
VL Ak (VL = Ak where k ≠ 0)
For instance, in case the vector length is equivalent to 40, the code given
below can be utilised to include two vectors in registers V3 and V4:
A1 40 (A1 = 40)
VL A1 (VL = 40)
V2 V3+FV4 (V2 = V3 + V4)

Manipal University of Jaipur B1648 Page No. 206


Computer Architecture Unit 9

As we cannot write
VL 40,
We must utilise the two-instruction order for loading 40 into the VL register.
The last instruction indicates floating-point addition of vectors V3 and V4. As
the VL is 40, just the first 40 elements are included. Table 9.1 below depicts
a sample of Cray X-MP instructions.
Table 9.1: Sample Cray X-MP Instructions

Handling larger vectors: Smaller vector sizes can be handled by the VL


register, but this does not apply to vectors of larger sizes. For instance, we
Manipal University of Jaipur B1648 Page No. 207
Computer Architecture Unit 9

possess 200-element vectors (i.e., N = 200), in which way the vector


instructions can be used to total two such vectors? The instance of larger
vectors is handled by a method called strip mining.
In strip mining, the vector is segregated into strips of 64 elements. In this
way, a single odd-size piece remains which may be less than 64 elements.
The size of such a piece is provided by N mod 64. Every strip is thereafter
loaded into a vector register. Later on the vector addition instruction is put
into operation. Now, the number of strips can be portrayed by (N /64) + 1.
For this case, the 200 elements are segregated into four pieces:
 64 elements are contained in three pieces.
 8 elements are contained in one odd piece.
Thereafter a loop is utilised which iterates four times: VL is adjusted to 8 in
one of the iterations, and the rest of the three iterations will adjust the VL
register to 64.
9.4.2 Vector stride
We have to know the way in which elements are stored in memory in order
to understand vector stride. Let’s first observe vectors. Because vectors are
one-dimensional groups, saving a vector in memory is considerably easy:
vector elements are saved as sequential words in memory. In case, we wish
to fetch 40 elements, 40 contiguous words from memory have to be read.
Such elements are said to contain a stride of 1, i.e., to connect with the
subsequent element, we must add 1 to the recent element. It’s necessary to
observe that the distance between consecutive elements is measured in
number of elements and not in bytes.
We will require non-unit vector strides for multidimensional ranges. In order
to find out the reason, we should concentrate on two-dimensional matrices.
In case we want to save a two-dimensional matrix in memory, we must
linearise it. We are able to work on this in one of two ways: column-major or
row-major sequence. Majority of the languages with the exception of
FORTRAN, utilise the row-major order. In such a way of sequencing,
elements are saved in row order: row 0, row 1, row 2, and so on. Elements
are saved column by column: column 0, column 1, and so on in the column-
major order, which is utilised by FORTRAN. For instance, consider the 4 x 4
matrix below:

Manipal University of Jaipur B1648 Page No. 208


Computer Architecture Unit 9

Figure 9.3: Memory Layout of Vector A.

Such a matrix is saved in memory as depicted in figure 9.3. Presuming row-


major order for saving, we should search for a way to reach all elements of
column 0. It’s obvious that such elements are not saved alongside. We are
forced to reach 0, 4, 8, and 12 elements in the memory array.
Since successive elements are divided on the basis of 4 elements, it can be
said that the stride is 4. Vector machines provide load and store instructions
that make an allowance for the stride. It can be noted from Table 9.1 that
Cray X-MP machine assists both unit as well as non-unit stride access. For
instance, the instruction
Vi, A0, Ak
loads vector register Vi along with stride Ak. As unit stride is quite usual, a
particular instruction
Vi, A0,1
is given. Alike instructions exist for storing vectors in memory.
Self Assessment Questions
8. The instance of larger vectors is dealt with by a method called
_______________.
9. Vector elements are saved in the form of ______________ in memory.

Manipal University of Jaipur B1648 Page No. 209


Computer Architecture Unit 9

9.5 Compiler Effectiveness in Vector Processors


A program can be run in vector mode successfully with the help of two
factors. The program structure is the first factor. It should be able to judge
whether the loops comprise of true data dependences or can they be
restructured in such a way that they have no such dependences. This factor
is affected by the algorithms selected and, to some degree, by the manner
in which they are coded. The second factor is the ability of the compiler.
Although no compiler is able to vectorise a loop which does not contain
parallelism among the loop iterations, however there is huge variation in the
capability of compilers to decide if a loop can be vectorised.
The techniques utilised for vectorising programs are similar to revealing ILP;
here we just review how well such techniques perform. Let's look at the
vectorisation levels noted for the Perfect Club benchmarks, as a sign of the
vectorisation level which can be achieved in scientific programs. These
benchmarks are huge and actual scientific applications. Figure 9.4 depicts
the percentage of operations implemented in vector mode for two versions
of the code performing on the Cray Y-MP.

Figure 9.4: Level of Vectorisation among the Perfect Club Benchmarks

when executed on the Cray Y-MP


The first version is that acquired with simply compiler optimisation on the
original code, whereas the second version has been considerably hand-
optimised by a team of Cray Research programmers. The extensive

Manipal University of Jaipur B1648 Page No. 210

You might also like