You are on page 1of 50

Multiple Output Complex Instruction Matching and

Instruction Selection for VLIW Architectures

A Thesis
submitted in partial fulfillment
of the requirements for the degrees
of
Bachelor of Technology
and
Master of Technology
by
Puneet Goyal

Under the guidance


of
Prof. Anshul Kumar
Department of Computer Science and Engineering
Indian Institute of Technology Delhi

Department of Computer Science and Engineering


Indian Institute of Technology Delhi
June 2006
Certificate

This is to certify that the thesis titled “Multiple-Output Complex Instruction Matching and In-
struction Selection for VLIW Architectures” submitted by Puneet Goyal for the award of Bachelor
of Technology and Master of Tecnology is a record of bonafide work carried out by him under my
guidance and supervision at the Department of Computer Science & Engineering, Indian Institute
of Technology Delhi. The work presented in this thesis has not been submitted elsewhere, either
in part or full, for the award of any other degree or diploma.

Prof. Anshul Kumar


Department of Computer Science and Engineering
Indian Institute of Technology Delhi
Acknowledgements

I express my deepest gratitude to all the people who have supported me and encouraged me
during the course of this project. I are thankful to Prof. Anshul Kumar for extending his guidance,
support and invaluable time without which this project could not have been accomplished. He
encouraged me to think in a scientific and methodical manner.
I am thankful to Prof. M. Balakrishnan, Dr. Preeti Ranjan Panda and Dr. Kolin Paul for their
vision and encouragement. As a member of Embedded Systems Group I have learned a lot from
the guidance of our teachers and fellow students. I would also like to thank Nagaraju Potheneni
and Neeraj Goel for providing invaluable insights into the subject wherever possible.

Puneet Goyal
Abstract

Application-specific extensions to the computational capabilities of a processor provide an efficient


mechanism to meet the growing performance and power demands of embedded applications. Hard-
ware, in the form of new Functional Units (or Coprocessors), and the corresponding instructions
are added to a baseline processor to meet the critical computational demands of a target appli-
cation. In both generating the complex instructions (or ISEs) and during code generation on this
processor with extended Instruction Set Architecture (ISA) one of the challenging tasks for the
compiler is to perform fast and efficient instruction matching and selection. In this project, we
developed a novel and efficient algorithm for matching the multiple-output complex Functional
Units (FU’s) and also implemented an architecture driven algorithm for instruction-set selection,
which can effectively utilize the ILP processor and complex FUs for minimizing the execution time
of the given application. Using the Trimaran Compiler Infrastructure, we also experimentally prove
that instruction-set customization is an effective way to improve the performance of VLIW archi-
tectures. A set of instructions are also generalized which enable the application-specific hardware
to be more effectively used across a domain.
Contents

List of Figures ii

List of Tables iii

List of Algorithms iv

1 Introduction 1
1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Report Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Review and Motivation 5


2.1 Instruction Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Algorithm for Instruction Matching and Selection 13


3.1 Instruction Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Two algorithms for partial match identification . . . . . . . . . . . . . . . . 16
3.2 Instruction Selection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Evaluating the matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Sorting the graph outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Choosing a cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Evaluation and Validation Framework 23


4.1 MachSuif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Trimaran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Framework used for validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Analysis and Results 27


5.1 Case Study: adpcmdecode bench . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Conclusion and Future Work 35


6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Bibliography 37

i
List of Figures

1.1 Gsub Subject Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.2 Set of Complex Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Automatic instruction mappping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7


2.2 Cover selected when purpose is to minimize number of patterns used to cover . . . 8
2.3 Cover selected when purpose is to minimize the data ready time . . . . . . . . . . 9
2.4 Cover selected when instruction level parallelism information is not taken into con-
sideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Cover selected when instruction level parallelism information is taken into consider-
ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Phases of the matching process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


3.2 An example Gsub (a) and Gpat (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 An example illustrating the Covering Algorithm [4] . . . . . . . . . . . . . . . . . . 20
3.4 DFG of a CFU which matches with a subgraph in adpcmdecode . . . . . . . . . . 22

4.1 The Trimaran Compiler Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 24


4.2 Framework used for validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 An example pattern library CF U -lib . . . . . . . . . . . . . . . . . . . . . . . . . . 28


5.2 DFG of basic block 3 in bubblesort . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 DFG of basic block 5 in histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 DFG of basic block 25 in newlife . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 CFU1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

ii
List of Tables

5.1 Matches found . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29


5.2 Results: Effects of Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Ratio of Experimental to Theoretical number of comparisons in Eligibility check . 29
5.4 Algorithm1(3.4) Vs Algorithm2(3.5) . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.5 Predicated AdpcmDecode Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.6 Predicated AdpcmDecode Results with Varying Number of Integer ALUs . . . . . 34

iii
List of Algorithms

3.1 Reachability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 find match(Gsub , Gpat ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Instruction Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Partial Matching Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Partial matching algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Choosing a cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.7 CoverOutputs Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

iv
Chapter 1

Introduction

In recent years, the markets for cellular phones, digital cameras, network routers, and other high
performance but special purpose devices have grown explosively. In these systems, application-
specific hardware design is used to meet the challenging cost, performance, and power demands. The
most popular strategy is to build a system consisting of a number of highly specialized application
specific integrated circuits (ASICs) coupled with a low cost core processor, such as an ARM [16].
The ASICs are specially designed hardware accelerators to execute the computationally demanding
portions of the application that would run too slowly if implemented on the core processor. While
this approach is effective, ASICs are costly to design and offer only a hardwired solution that
permits almost no postprogrammability.
An alternative design strategy is to augment the core processor with special-purpose hardware
to increase its computational capabilities in a cost-effective manner. The instruction set of the
core processor is extended to feature an additional set of operations. Hardware support is added to
execute these operations in the form of new function units or coprocessor subsystems. The Tensilica
Xtensa is an example commercial effort in this area [7].
In order to deliver high performance, an ASIP must exploit the instruction level parallelism
(ILP) available in the given application. This brings up superscalar and VLIW as the two archi-
tectural choices. Though for general purpose computing superscalar is the architecture of choice
due to code compatibility issues, for the embedded systems and ASIP domain, VLIW architecture
appears more attractive as it offers a better possibility of customization [17].
we consider a VLIW architecture which consists of a core set of FUs augmented with application
specific coarse grain FUs. Specializing or customizing FUs for operations (or group of operations)
occurring in a given application can potentially lead to high performance gains because of the
following considerations [8, 11]:
• If the operands of an operation have a limited resolution (bit width), FU hardware can be
simplified and made faster.
• By chaining a sequence of operations in an FU, the computation time can be reduced.
• Concurrent operations within a group can be more easily parallelized than parallelization
across the FUs.
• By mapping a group of operations to an FU, access to register file for the intermediate results
is avoided. This reduction in register pressure has a beneficial effect on performance.

1
2 Introduction

• Also, the computationally intensive portions of applications from the same domain (e.g.,
encryption) are often similar in structure. Thus, the customized instructions can often be
generalized in small ways to make their use have applicability across a set of applications.
Hence a considerable effort has been put in this area. But as far as considering the multiple-out
complex instructions are concerned; and making a selection amongst them for better performance
in VLIW architectures, considering the architecture constraints (like limited issue width, limited
number of register ports, etc.); this is still to be explored. The present work focuses on that.
This project is a part of the automation framework, which is getting developed in IIT Delhi
Embedded Systems Group. Corresponding to a given application, once the custom instructions (or
complex functional units) are identified, we want:
• to develop techniques for effectively matching these complex functional units in the data-flow
segments of the control-data flow graph (CDFG), and

• to construct the data-paths that exploit at best the given library (this library not only has
basic elements such as adders and multipliers, but also complex functional units), taking into
consideration the architecture constraints for VLIW processors.

1.1 Problem Definition


Let, Gsub be a data flow graph (DFG) within the control flow graph of a C application, CF U -lib
be a library of Complex Functional Units (may be multiple-output) and Gpati be a data flow graph
corresponding to the ith customized instruction (or CFU) in the CF U -lib.
Instruction Matching: Given Gsub and CF U -lib, find for each Gpati in CF U -lib, all those data
flow segments of Gsub that match with the Gpati .

Instruction Selection: Given subject graph, Gsub with full matches found for all CFUs, find the
subset of full matches that when implemented, minimizes the data ready time of the slowest
subject graph output in the VLIW processor.
Following is an example of a Gsub (Fig 1.1) and a set of Instructions (Fig 1.2). In Fig 1.2, number
of cycles corresponds to the latency of the CFU.

1.2 Report Organization


In this report we present a novel instruction matching algorithm and implements an already exist-
ing instruction selection algorithm (with some modifications) in Trimaran,the compiler infrastruc-
ture for VLIW processors. Chapter 2 discusses about the present instruction matching and selec-
tion algorithm and their limitations. The motivation examples are also presented in this chapter.
Chapter 3 explains the proposed instruction matching algorithm. It also discusses the instruction
selection algorithm and the modifications introduced in the instruction selection algorithm. In
Chapter 4, we have presented the implementation issues and it gives the overview of the method-
ology adopted. Chapter 5 analyzes the efficiency of instruction matching algorithm and evaluates
the performance of extending the compiler infrastructure with customized instructions. Chapter 6
summarizes the work and discusses the possible future directions.

c 2006, Indian Institute of Technology Delhi


°
Introduction 3

Figure 1.1: Gsub Subject Graph

c 2006, Indian Institute of Technology Delhi


°
4 Introduction

Figure 1.2: Set of Complex Instructions

c 2006, Indian Institute of Technology Delhi


°
Chapter 2

Literature Review and Motivation

Matching and covering algorithms are well-known in the fields of code generation and logic syn-
thesis. Keutzer [6] was the first to recognize the similarity between the software compiler’s task of
generating code and the technology mapping problem in automated VLSI design. Both problems
can be handled with a matching algorithm, to find all possible instantiations of patterns (instruc-
tions or standard cells), followed by a covering algorithm to make a selection of matches that
optimizes some criterion (execution time, code size, VLSI area or latency, etc.).
For code generation, we need to make the selection from the given set of instructions (including
the complex ones) in a way that effectively utilizes the VLIW processor and Application
Specific Functional Units (AFU). The goal is minimizing the schedule length. The complexity
arises due to the fact that we want covering to be architecture driven and the I/O time-shape of the
AFU could be distributed, it gives us to efficiently use the resources (Functional units and registers),
thereby reducing the execution time. Current methods, generally aim to find the optimal cover
within each basic block that minimizes the number of selected matches. Fewer matches translate
to fewer operations for the schedule, and it is expected that the increased scheduling freedom leads
to better (shorter) schedule. This assumption is the basis of the most of the current methods. We
will illustrate that this assumption, which is the basis of the most of the current methods, may not
always hold true.

2.1 Instruction Matching


As described in [10], there are two main approaches to handling the matching problem when
performing technology mapping: the Boolean and the structural approach.
The Boolean approach can only be applied to networks of Boolean functions, since it operates
by checking the equivalence of functional representations of the patterns and functions representing
portions of the network. This equivalence is detected using (Ordered) Binary Decision Diagrams,
which makes this technique unsuitable for networks of non-Boolean functions.
Structural matching will work on networks containing nodes of any type of function. Since this
approach focuses on the identification of common patterns, the subject and pattern graphs have
to be written in terms of the same types of function nodes. Most approaches [6, 14] require all
graphs to be trees (single-output, acyclic, non-reconvergent graphs). It is therefore necessary to
decompose the subject graph into a set of disjoint trees. For solving this decomposition problem

5
6 Literature Review and Motivation

only heuristical methods exist.


In a more recent work by Kukimoto et al. [9], a structural matching method was introduced
that can handle DAG-shaped subject graphs, allowing reconvergence within the graph. However,
patterns are still restricted to single output, tree-shaped graphs. When this method is used for
technology mapping, rather than tree-mapping, faster solutions are found (up to 67% faster), at
the cost of a significant increase of area (up to 126% larger). This area increase is due to the fact
that the DAG mapping approach freely duplicates subject nodes whenever a pattern match covers
a multiple-fanout point. This may not be a problem in VLSI synthesis, where extra cells are cheap
in terms of chip area (although it may be desirable to have some control over the node duplication
aggressiveness). In code generation, however, node duplication leads to extra instructions. On a
machine with limited execution resources this may cause the schedule length to increase and hurt
the execution speed of the code rather than helping it.
None of the previously described matching algorithms allow pattern graphs to have more than
one output. Peymandoust et. al. [12] employ symbolic algebra using commercial symbolic computer
algebra systems like Maple and Mathematica, which can perform matching for even multiple-
output pattern graphs. But it works only for arithmetic data flow segments, making this approach
infeasible for most of the embedded applications. Matching algorithm proposed by Corporael et.
al. [5] does not have this restriction, making it possible to exploit a larger family of pattern graphs.
They perform a detailed search space exploration. We will experimentally show that the instruction
matching algorithm proposed by us is highly efficient in comparison to them.

Instruction Selection

Instruction selection can be formulated as directed acyclic graph (DAG) covering. Conventional
methods [3] for instruction selection use heuristics that break up the DAG into forest of trees,
which are then covered optimally but independently. Independent covering of the trees may result
in a suboptimal solution for the original DAG. But, Trees, as a heuristic formulation, inherently
preclude the use of complex instructions in cases where internal nodes are shared.
Paolo Ienne et. al. [8] make use of symbolic algebra tools for instruction selection. Polynomial
representations of the Basic Blocks of the Software Application and the new Instruction set of
the ASIP are available to the Symbolic Mapping Algorithm. As opposed to tree covering based
Algorithms, their algorithm, mapping is performed simultaneously with algebraic manipulations.
But as mentioned earlier this is restricted to only arithmetic data flow segments.
Alternatively, the DAG covering problem can be formulated as a binate covering problem and
solved exactly or heuristically using branch-and-bound methods (Liao et. al. [15]). Kastner et. al.
[13] proposed a greedy approach for solving this problem. But they aim to find the optimal cover
within each basic block that minimizes the number of selected matches. Arnold et. al. [5] proposes
an algorithm for covering which directly aims at reducing the data ready time of the slowest
subject graph output, but it does not efficiently handles the overlapping cases. In this project, this
algorithm is improved to handle the overlapping cases and is successfully implemented in Trimaran
framework.

c 2006, Indian Institute of Technology Delhi


°
Literature Review and Motivation 7

Figure 2.1: Automatic instruction mappping

2.2 Motivating Examples


Example 2.1. This example will illustrate that the common assumption, (heuristic used by many
current methods) that lesser the number of patterns used, better will be the scheduling time, may
not always hold true.

In Fig 2.2, number of FUs used = 4 but it takes 9 cycles.


In Fig 2.3, number of FUs used = 5 but it takes 7 cycles only.

Example 2.2. This example will illustrate that we can improve the covering algorithm if we know
the instruction level parallelism and can make use of this information while doing covering.

In Fig 2.4, it takes 11 cycles to complete the given task.


So, if we know that instruction level parallelism is 2 and if we can make use of this information
while covering, then we can complete the given task within 8 cycles, which otherwise takes 11
cycles.
The above two examples motivate us to come up with a covering scheme which aims at mini-
mizing the execution time (rather than minimizing the number of FUs used to cover) and which
can take architectural constraints into consideration.
So, if we know that instruction level parallelism is 2 and if we can make use of this information
while covering, then we can complete the given task within 8 cycles, which otherwise takes 11
cycles. The above two examples motivate us to come up with a covering scheme which aims at
minimizing the execution time (rather than minimizing the number of FUs used to cover) and
which can take architectural constraints into consideration.

c 2006, Indian Institute of Technology Delhi


°
8 Literature Review and Motivation

Figure 2.2: Cover selected when purpose is to minimize number of patterns used to cover

c 2006, Indian Institute of Technology Delhi


°
Literature Review and Motivation 9

Figure 2.3: Cover selected when purpose is to minimize the data ready time

c 2006, Indian Institute of Technology Delhi


°
10 Literature Review and Motivation

Figure 2.4: Cover selected when instruction level parallelism information is not taken into
consideration

c 2006, Indian Institute of Technology Delhi


°
Literature Review and Motivation 11

Figure 2.5: Cover selected when instruction level parallelism information is taken into con-
sideration

c 2006, Indian Institute of Technology Delhi


°
12 Literature Review and Motivation

c 2006, Indian Institute of Technology Delhi


°
Chapter 3

Algorithm for Instruction Matching


and Selection

Let, Gsub be a data flow graph (DFG) within the control flow graph of a C application, CF U -lib
be a library of Complex Functional Units (may be multiple-output) and Gpati be a data flow graph
corresponding to the ith customized instruction (or CFU) in the CF U -lib.
Given Gsub and CF U -lib, we must try to find all matches between pattern graphs Gpat (from
pattern library CF U -lib) and subgraphs of Gsub – The strategy for this is described in Section 3.1.
After all matches are found, we try to find the best cover, or selection of matches, that is, the
set of matches that, when implemented, minimizes the data-ready time of the longest path through
the subject graph. The covering approach is described in Section 3.2.
Note: In Fig 3.2, showing Gpat and Gsub , each node corresponds to an operation-node and an
edge corresponds to the data flow edge.

3.1 Instruction Matching


Firstly we will define a few terms.
Primary nodes. These are those nodes of Gpat , whose all input operands belong to the set of
source operands of the customized instruction, corresponding to that particular Gpat . For
example in Fig 3.2(b), p1 and p2 are primary nodes. Let us consider Gpat has p primary
nodes. In Gpat corresponding to Fig 3.2(b), p = 2.
Reachability matrix. From the adjacency matrix of a directed graph, we compute the Reacha-
bility matrix where an element Rij in Reachability matrix is 1, if there exists a path from
node i to node j. Reachability[i][i] is considered to be 1. Knowing that graph in our case
will be a DFG corresponding to a C-application, where IDs are associated with each node
in such a way that if an element Aij of adjacency matrix is 1, i must be less than j. This
condition helps us to efficiently compute the Reachability matrix, although the worst case
complexity is still O(n3 ). Floyd Warshall Algorithm makes n3 computations whereas no. of
computations in our algorithm is (n3 )/6. See Alg 3.1.
CommonSink matrix. CommonSink[i][j] = 1 if there exist a k such that Reachability[i][k] = 1
and Reachability[j][k] = 1. We will attach an attribute degree(> 0) to this Commonsink

13
14 Algorithm for Instruction Matching and Selection

compute reachability ( ) {
for (i = 0; i < num nodes; i + +)
Reachability[i][i] = 1
for (j = 1; j < num nodes; j + +)
for (i = j − 1; i >= 0; i − −) {
Reachability[i][j] = Adj matrix[i][j]
if (Reachability[i][j] == 0)
for (k = j; k > i; k − −)
Reachability[i][j]+ = (Adj matrix[i][k] ∗ Adj matrix[k][j])
}
}

Algorithm 3.1: Reachability

matrix.
CommonSink degree 1[ ][ ] = CommonSink[ ][ ].
CommonSink degree d[ ][ ] = (CommonSink[ ][ ])d .
For example, In Fig 5.5, Commonsink degree 2[p1 ][p3 ] = 1.

Below is the informal description of the instruction matching algorithm: find match(Gsub , Gpat ).
We begin with identifying the matches for primary nodes. If {y1 , y2 , y3 , . . . , yp } is a set of p nodes
(operations) of Gsub ,that matches with the p primary nodes of the Gpat , {x1 , x2 , x3 , . . . , xp }, then
we call this set {yi } as partial match. We will first identify all the partial matches for the given
Gpat and Gsub . For each partial match, We need to expand it along the DFG edges to validate
this partial match. But before doing this traversal along the DFG edges, we will first apply some
heuristiscs/constraints that must be satisfied, to make this partial match an eligible candidate to
be considered for validation. We check that there is no two nodes within the partial match, that are
identical or having any dataflow edge between them. We will also check the outdegree constraints
for each node in the partial match. We can also employ certain heuristics like commonsink (i.e.
every pair of nodes in a partial match must be reachable to some common node, the common sink)
depending upon the Gpat graph. For example, the Gpat shown in Fig 3.2(b), the nodes p1 and p2
has a common sink. The check for outdegree constraint and commonsink (heuristic) will require us
to compute the the adjacency matrix, the reachability matrix and commonsink matrix. As this is
to be done only once for each basic block and Gpat , we can compute these matrices for cutting off
the traversal along the DFG edges in many cases, and thus effectively pruning the search space and
reducing the complexity. The degree associated with CommonSink is a heuristic parameter and it
is selected depending upon the the size and topology of the Gpat . Generally even with considerably
big patterns, we can effectively prune the search space by associating degree 2 in commonsink
matrix computation.

Heuristics/Constraints
1. No data dependency among the nodes (y1 , y2 , . . . , yp ) matching with primary input nodes
(x1 , x2 , . . . , xp ).

c 2006, Indian Institute of Technology Delhi


°
Algorithm for Instruction Matching and Selection 15

Find match (Gsub , Gpat ) {


for i = 1 to p,
find nodes in Gsub matching with primary input nodes xi
//If primary node xi matches with ki nodes in Gsub
//then num partial matches = k1 ∗ k2 ∗ k3 ∗ . . . ∗ kp
//(partial matches)
for i = 1 to num partial matches {
//pruning the search space by applying heuristics
//check if partial matchi is an eligible candidate
if (partial match.Iseligible) num eligible candidates + +
}
for each eligible candidate {
//validate this eligible candidate for full match by traversing along
the data-flow edges.
if (eligible candidate match.Isvalid) {
n valid matches + +
// update the match vector list
}
}
}

Algorithm 3.2: find match(Gsub , Gpat )

Figure 3.1: Phases of the matching process

2. yi = yj only if i = j.

3. yi and yj must have some common descendent. Check this for every pair {(i, j)|i, j < p} by
computing Commonsink matrix. This is a heuristic.

4. If (xi is output node)


then (outdegree(yi ) = outdegree(xi ))
else (outdegree(yi ) and outdegree(xi ) must be same)

match validate(Gpat , Gsub , partialmatch)


This function is called only for eligible candidates amongst the partial matches identified. Here we
will be visiting each non primary node(j) in Gpat and 4 cases possible here are:

c 2006, Indian Institute of Technology Delhi


°
16 Algorithm for Instruction Matching and Selection

1. Node j has two operands, both internal. For example, node p4 in Fig 3.2(b) .
2. Node j has two operands, one internal and one constant.
3. Node j has two operands, one external and one internal. For this case, we also need to check
for the convexity constraint.For example, node p3 in Fig 3.2(b).
4. Node j has one operand, definitely internal. (Otherwise it would be a primary node.)

(a) (b)

Figure 3.2: An example Gsub (a) and Gpat (b)

In each case, we will find whether the Gsub nodes matched with predecessors of this node j
has any successor, that can be matched with this node j. (We also make a check for the outdegree
constraint and convexity constraint here). If yes, we will visit the next non primary node, until all
the nodes in Gpat successfully matches or the eligible partial match gets invalidated.
Alg 3.3 identifies all the matches and completes our instruction matching algorithm.
For example, in Fig 3.2, Gpat is a subgraph of Gsub . Matched nodes are marked with same
colors. p1 and p2 are primary input nodes of Gpat . One of the partial match for them is {n1 , n2 }
and this partial match passes the eligibility criteria. After traversing in Gpat , we find that p3 and
p4 matches with n4 and n5 respectively. Note that, if p3 would not be an output node, the match
would have failed because of not satisfying outdegree constraints at node n4 .

3.1.1 Two algorithms for partial match identification


Let us consider for analysis that Gsub and Gpat has n and m number of total nodes. And no
of primary nodes in each Gpat is p. Let us compare our algorithm (Algorithm 3.4) for identifying

c 2006, Indian Institute of Technology Delhi


°
Algorithm for Instruction Matching and Selection 17

insn matching ( ) {
for each basicblock Gsub {
compute Adjacency, Reachability and Commonsink matrices for Gsub
for each CFU graph, Gpat
f ind match(Gsub , Gpat )
}
}

Algorithm 3.3: Instruction Matching

partial matches with the partial match identification algorithm(Algorithm 3.5) proposed by Arnold
[4].

FindPartialMatches algorithm1 ( ) {
for each pattern Gpat in pattern library
for each pattern node Nprime pat of Gpat {
For each node Nsub of Gsub
//check for opcode and outdegree constraint
if N odematch(Nsub , Nprime pat )
//create a new match
new M atch(Nsub , Nprime pat )
//let us say, ith prime node matched with ki nodes in Gsub
}
M erge matches( )
//This identify all partial matches.
// N um partial matches = Πi ki , i.e of O(np )
}

Algorithm 3.4: Partial Matching Algorithm 1

We observed that even for large sized customized instruction that could be identified in medi-
abench and other applications, the value of p is generally 4 (for example in figure 5.5 p = 3 and
m = 13). Also, as we know that the number of external operands for the customized instruction is
at least 2p. So higher the value of p implies, more is the number of (external) source operands for
the Customized instruction, more difficult it would become to satisfy the register read/write port
constraints. So p is considerably smaller and therefore our algorithm (Algorithm 3.4) for finding
partial matches will be much efficient, in comparison to Algorithm 3.5 where number of partial
matches is of O(nm ).

c 2006, Indian Institute of Technology Delhi


°
18 Algorithm for Instruction Matching and Selection

FindPartialMatches algorithm2 ( ) {
for each pattern Gpat in pattern library
for each Nsub of Gsub
for each Npat of Gpat
if N odematch(Nsub , Npat )
//create a new match
new M atch(Nsub , Npat )
//let us say, ith node (Npat,i) matched with ki nodes in Gsub
M erge matches( )
//This identify all partial matches.
// N um partial matches = Πi ki , i.e of O(nm )
}

Algorithm 3.5: Partial matching algorithm 2

3.2 Instruction Selection Algorithm


Let, Gsub be a data flow graph (DFG) within the control flow graph of a C application, CF U -lib
be a library of Complex Functional Units (may be multiple-output) and Gpati be a data flow graph
corresponding to the ith customized instruction (or CFU) in the CF U -lib.
Given subject graph, Gsub with full matches found for all CFUs, we are required to the subset
of full matches that when implemented, minimizes the data ready time of the slowest subject graph
output in the VLIW processor.
There are several possible covers, we are interested in the ones that minimize the latency of the
longest path in subject graph. Covering algorithm given by Arnold [4] can be used for this purpose,
but it was restricted to selecting only one match when there is overlapping. We have modified it to
handle the overlapping cases efficiently, by replicating some operations that are there in overlapping
part of two matches. We can use some parameterized metric in order to make replication decision.
Following is the algorithm proposed by Arnold[4]:

Cover ( ) {
EvaluateMatches()
SortGraphOutputs()
CoverOutputs()
}

Algorithm 3.6: Choosing a cover

3.2.1 Evaluating the matches


For each operand node in the subject graph, determine the data-ready time for all output matches
(A match is an output match for an operand node if that operand corresponds to an output of the

c 2006, Indian Institute of Technology Delhi


°
Algorithm for Instruction Matching and Selection 19

pattern that is matched) that apply to it. Sort those matches by data-ready time, ascending.

Determining output matches


The first thing to do is determine which matches are so calledoutput matches for which operand
nodes. A match is an output match for an operand node if that operand corresponds to an output
of the pattern that is matched. In our example, m4 is an output match of nodes e and f , but not
of c and d. In Fig 3.3, pointers are shown from each operand node to all its output matches.

Determining data ready times


The next step is to determine the ready time of each operand for each of its output matches For
now, we assume a ready time of 0 for each of the subject graph inputs a, b and d. The ready time
of the other operand nodes is determined by the latency through the match under consideration
and the best (lowest) possible ready time of its inputs, which is in turn determined by the best
applicable match there. In our example, the ready time on node e is 2 for m2 (the maximum of the
latency from any input of pattern Gpat4 to output o5 , added to the ready time of that input, in
this case 0). The ready time for m3 is the latency through pattern Gpat2 (from o1 to o3 ) added to
the ready time of its input o1 (corresponding to node c). The data-ready time of c in turn is found
by taking the lowest possible value for all the matches found there (in this case there is only one
match, m1 ) which yields a ready time of 2. This results in a ready time of 3 on e for m3 . In Fig 3.3,
the latencies for all output matches on all operand nodes are annotated on the dashed edges.

3.2.2 Sorting the graph outputs


Sort all subject graph output operands (operands that have no uses in the subject graph) by order
of data-ready time of their best match, descending. More is the DRT, higher is the precedence for
covering.

3.2.3 Choosing a cover


After all output matches on all subject graph operand nodes have been evaluated and sorted in
terms of best attainable data-ready time, the time has now come to make a selection from those
output matches: the cover. This is done by calling the top-level function CoverOutputs.

CoverOutputs
This function iterates over the subject graph output operands, which have been sorted by projected
data-ready time, descending, in the previous phase. It calls the recursive function ImplementBest-
Match for each of them. This function implements the match with the lowest possible data-ready
time. Starting at the subject graph output operand with the highest data-ready time for its best
match, it implements the match.

ImplementBestMatch
First we check if the current node has already been covered as part of a earlier covering iteration.
If this is the case, we return the attained data-ready time from that cover.

c 2006, Indian Institute of Technology Delhi


°
20 Algorithm for Instruction Matching and Selection

Figure 3.3: An example illustrating the Covering Algorithm [4]

The best available matches on the subject sub-graph rooted in node Osub are implemented,
recursively. The best available match is the first one on Osub that would not introduce any depen-
dency cycles into the subject graph.
To guarantee that the critical path to the current subject operand node is covered first, we have
to determine the precedence with which the subject operand nodes corresponding to the match
inputs are covered. This is done by calculating the slack for each one of these nodes. The slack
is defined as the difference between the time a match input should be ready (the current match
output’s data-ready time minus the pattern latency) and the data-ready time of the input’s best
match. The match inputs are sorted by slack, ascending, before covering them in that order.

c 2006, Indian Institute of Technology Delhi


°
Algorithm for Instruction Matching and Selection 21

CoverOutputs ( ) {
foreach subject graph output operand Osub (sorted)
ImplementBestMatch(Osub );
}

Algorithm 3.7: CoverOutputs Algorithm

The attained data-ready time is calculated and marked on the subject operand nodes. All nodes
covered by the selected matches are marked as having been covered by the function SelectMatch.
For example, in Fig 3.3, starting at e, we choose the match that causes the lowest latency, in
this case m2 (latency 2). All matches that overlap m2 have to be invalidated (m1 , m3 , m4 ) since
they can no longer be implemented (not that we would want to in this case). Now we proceed to
cover all operands that are inputs to match m2 , but these are subject graph inputs a, b and d so
we have finished covering along this path. Now we start covering on the last subject graph output:
f . The best available match is m5 , which is still available, so we can implement it. We have now
finished constructing a cover for the subject graph, consisting of matches m2 and m5 .

SelectMatch
The match selected for implementation probably overlaps several other matches. These can never
be implemented anymore, so they must be deleted. This function removes all other matches that
are referenced on the operand nodes in the selected match.
Note that any of the removed matches may have been the best match for another operand node.
This means that the best implementation, the one that match evaluations for other operands are
based on, is no longer available there. The impact of removing any match may resonate throughout
the subject graph. Keeping track of these effects is not trivial and would require a lot of extra
computational effort (each time a match is implemented, we have to re-evaluate all matches on all
nodes). We therefore settle for staying with our old match evaluation results and implement the
next best available match, as determined previously, everywhere.
This is where we modify their algorithm, in order to handle the overlapping cases differently
and efficiently. In modified algorithm, all the overlapping matches are not deleted.
Let us consider the following case. Match A and B are overlapping (Fig 3.4). Let ai and bi be
the internal data edges of pattern A and B, respectively. n1 , n2 , n3 be the number of operations
in the Match A, B, and overlapping area respectively.
sp1 , sp2 and sp3 are their corresponding gains values. We are choosing to replicate some nodes
only if following criteria is met:

• n3 should not be more than 2,

• (n2 − n3 ) should be at least 2 and also

• The gain (sp2 − sp3 ) should be more than some threshold value.

We will be analyzing the performance under the different replication criteria, to study how repli-
cation criteria influences the performance.

c 2006, Indian Institute of Technology Delhi


°
22 Algorithm for Instruction Matching and Selection

Figure 3.4: DFG of a CFU which matches with a subgraph in adpcmdecode

There are some more conditions that are also to be taken into consideration, while selecting a
match and making the replication decision. Without loss of generality, we can assume that Pattern
A is definitely selected, and decision of choosing Pattern B is taken, depending upon selection
criteria and following conditions:

- if (α = aj ) then we cannot select Pattern B

- if (α = oj ) then we can select Pattern B but it should be scheduled after A and selection
criteria should be satisfied.

- if (β = ij ) this case is not possible.

- if ((α = ij ) ∨ (β = oj ) ∨ (β = aj )), we can select Pattern B and it can be scheduled in parallel


with Pattern A.

c 2006, Indian Institute of Technology Delhi


°
Chapter 4

Evaluation and Validation Framework

For evaluating the efficiency of the Instruction Matching Algorithm, we need to compile the given
application code into an intermediate representation where the C-code is reduced into a DFG/CFG
representation closer to assembler, although still largely architecture independent. Dataflow nodes
should resemble generic assembler operations. This process would ease the process of instruction
matching, since the algorithm is completely dependent on the topological characteristics of the
DAG constructed. Any of the two compiler infrastructures, MachSuif[1] and Trimaran[2], could
be selected for this purpose. But because of inherent complexities involved in Trimaran and being
already familiar with Machsuif framework, Machsuif is selected for the purpose of establishing the
efficiency of proposed instruction matching algorithm. Afterwards, it is adopted in Trimaran frame-
work with some changes in data-structures, used for traversing along the data-flow edges. Also,
for evaluating and validating our instruction selection algorithm, we need VLIW compiler support
which can simulate the architecture extended with Customized instructions. Hence the Trimaran
Framework. Is usead as natural choice to validate instruction-selection algorithm. Architectural
constraints (issue width, number of register read/write ports, etc.) can also be easily incorporated
in Trimaran.

4.1 MachSuif
MACHINE SUIF [1] is a flexible, extensible, and a usable infrastructure for constructing compiler
back ends. It is a part of the DARPA and NSF-funded National Compiler Infrastructure project.
It has been used to develop machine-specific optimizations, construct profile-driven optimizations,
and evaluate architectural models. With it we can readily construct and manipulate machine-level
intermediate forms and emit assembly language, binary object, or C code. The system comes with
backs ends for the Alpha and x86 architectures. The control-flow and data-flow analysis libraries
provided , ease the building of Machine SUIF passes, and provide abstractions that aid in the
coding of certain kinds of optimizations.

4.2 Trimaran
Trimaran[2] is a compiler infrastructure for supporting state of the art research in compiling for
Instruction Level Parallel (ILP) architectures. The system is oriented towards EPIC (Explicitly

23
24 Evaluation and Validation Framework

Parallel Instruction Computing) architectures, and supports compiler research in what is typically
considered to be ”back end” techniques such as instruction scheduling, register allocation, and
machine-dependent optimizations. The Trimaran system is based on the HPL-PlayDoh architecture
which is a parametric processor architecture conceived for research in instruction-level parallelism.
The HPL-PD opcode repertoire, at its core, is similar to that of a RISC-like load/store architecture,
with standard integer, floating point (including fused multiply-add type of operations) and memory
operations. We map the core part of our target architecture to the HPL-PD architecture.
The Trimaran compiler infrastructure, as shown in Fig 4.1, consists of a compiler front-end,
(IMPACT), compiler back-end, (Elcor) and a simulator generator. The framework is parameterized
using a machine description facility, (HMDES). We briefly describe each of these tools.

Figure 4.1: The Trimaran Compiler Infrastructure

The IMPACT compiler system, is used by the Trimaran system as its front end. This
front-end performs, ANSI C parsing, code profiling, classical code optimizations along with block
formation. The High Level Machine Description Facility or HMDES is the machine de-
scription language used in Trimaran system. This language describes a processor architecture from
the compiler’s point of view. To this end it specifies the instruction format, resource usages and
reservation tables, latency information, operation information and some compiler specific informa-
tion. The instruction format conveys what operands are allowed by each type of operation, resource
usages specify how operations use processors resources as they execute and latency information
specifies how to calculate dependence distances between operations. Finally, operation information
specifies the operations supported by the architecture. and describes each of them in terms of there
Scheduling Alternatives which includes the format, resource usage and latency.
Elcor is Trimaran’s back-end for the HPL-PD architecture. It performs three tasks:

c 2006, Indian Institute of Technology Delhi


°
Evaluation and Validation Framework 25

1. code selection and scheduling,

2. register allocation, and

3. machine dependent code optimizations.

Elcor is parameterized by the machine description facility to a large extent. As shown in Fig 4.1, it
takes as input the bridge code produced by front-end along with a HMDES machine specification
and produces an Elcor IR file. The IR is annotated with HPL-PD assembly instructions. The
internal representation of Elcor IR consists of a set of C++ objects. All optimization modules in
the Elcor IR use the interface provided by these objects to carry out optimizations. Optimizations
are simply IR to IR transformations.
The Trimaran framework also consists of a simulator which is used to generate various sta-
tistics such as compute cycles, total number of operations, etc.
The limitations of the Trimaran framework are that firstly, it is built around the HPL-PD
architectural domain. Hence, it only supports operations which are a subset of HPL-PD operations.
Secondly, the Trimaran framework does not completely support clustered VLIW architecture. It
has a single register file of each type (e.g. integer regfile, floating point regfile etc). Each integer FU
accesses the same integer regfile. Hence, we cannot evaluate performance for clustered architectures.

4.3 Framework used for validation


The framework for our customised architecture synthesis methodology is described in Fig 4.2
Given the mdes description of Complex Functional units, we enhance the hmdes (used by elcor
and simulaor) with the new Functional units. Simulator generator is provided with the new AFU
description to generate the simulator which can support the execution of new operations with
correct functionality. We identify the matches for the customized instruction in 3 phases (find
partial matches, apply eligibility criteria,validate the match) , discussed earlier.
This information is provided to the instruction selection unit, to select the instances of Cus-
tomized instruction and replace the matched subgraph with the Customized instruction. This
selection is done, to reduce the execution time for VLIW processors. Finally how much reduc-
tion is actually obtained is calculated by simulating the modified application code on extended
simulator.

c 2006, Indian Institute of Technology Delhi


°
26 Evaluation and Validation Framework

Figure 4.2: Framework used for validation

c 2006, Indian Institute of Technology Delhi


°
Chapter 5

Analysis and Results

For evaluating the performance of proposed instruction matching algorithm, we used the bitwise
benchmarks newlife, histogram and bubblesort. The application C code along with the description
of CFU-library is given as input to the Machsuif. Our CF U -lib consists of 6 patterns (or customized
instructions) as shown in Fig 5.1. In Fig 5.2, Fig 5.4 and Fig 5.3, we have shown the DFGs (Gsub )
that appear in these benchmarks.
Table 5.1 shows the number of complete (valid) matches found in a particular Basic block of
that application.
Number of partial matches is of O(np ) but experimentally we observed that it is much less
than O(np ). Also we analyzed the effect of heuristics in pruning the search space.
The Table 5.2 shows the fraction of eligible matches for different values of p and for differ-
ent benchmarks. The eligible matches are those partial matches that qualify the outdegree con-
straint, Commonsink constraint, etc. Fraction of eligible matches is computed as Number of eligible
matches / number of partial matches.Best case = 0% for a certain benchmark and p, means that
for a certain basic block (DFG) in that benchmark, we observed that after applying the heuristics,
none of the partial match become eligible candidate. Without going through the complicated task
of traversing through the data flow edges, we have filtered out lot of unsuitable partial matches.
It is important to note that for p = 3, number of matches that would be considered for full match
are only 0.1 to 2% of the partial matches identified in the stage1 of our algorithm. It means that
for higher values of p, the eligibility criteria imposed is very effective in pruning the search space.
For applying the eligibly criteria, it is required to compute the Reachability matrix, the Com-
monsink matrix before hand.The overhead involved is Reachability matrix and Commonsink matrix
computation time but as it is to be done for each basic block only once and also, it is very helpful
in effectively pruning the search space, we can easily choose to pay for this overhead.
Theoretically, the number of comparisons performed while checking for eligibility criteria is
num partial matches ∗ (p ∗ (p − 1)/2). We compare this with the number of comparisons actu-
ally done. On an average, For p = 2, we found the ratio of actual number of comparisons and
num partial matches ∗ (p ∗ (p − 1)/2) to be about 2.5 and for p = 3, this ratio is 1 (Table 5.3).
Comparing two algorithms of finding out partial matches: We have described in Section 3.1.1
page 16) the two algorithms for finding partial matches. Table 5.4 shows for different benchmarks,
the comparison between the two algorithms in terms of the number of partial matches (that are
to be evaluated for eligibility and validity) found.
We observed that the Algorithm 3.4 to be very efficient than Algorithm3.5. The number of par-

27
28 Analysis and Results

(a) CFU 0, m = 2, (b) CFU 1, m = 3, (c) CFU 2, m = 4,


p=1 p=2 p= 2

(d) CFU 3, m = 5, p = 2 (e) CFU 4, m = 5, p = 2 (f) CFU 5, m = 5, p = 3

Figure 5.1: An example pattern library CF U -lib

c 2006, Indian Institute of Technology Delhi


°
Analysis and Results 29

(a) Bubblesort (b) Histogram


CFUid BB valid matches CFUid BB valid matches
0 3 2 1 5 1
0 8 2 0 9 1
0 9 2 1 14 1
0 14 2 0 5 2
0 18 2
0 14 3
1 23 3
0 23 7
(c) Newlife
CFUid BB valid matches CFUid BB valid matches
1 32 1 0 5 4
1 5 2 5 19 4
1 11 2 5 25 4
0 32 2 0 11 5
3 19 3 1 19 10
4 19 3 1 25 10
3 25 3 0 19 20
4 25 3 0 25 20

Table 5.1: Matches found

P Best Case (%) Worst Case (%) Average (%)


2 0 84 41.67
Bubblesort
3 0 11.11 2.23
2 0 67 32.97
Histogram
3 0 1 0.12
2 10.74 75 31.5
Newlife
3 0 5.56 0.205

Table 5.2: Results: Effects of Heuristics

Avg Max Avg P=2 Max P=2 Avg P=3 Max P=3
Bubblesort 2.20 3.5 2.581 3.5 0.984 1.285
Histogram 2.07 3.33 2.524 3.33 0.956 1.382
Newlife 2.01 3.5 1.567 3.5 1.29 1.64

Table 5.3: Ratio of Experimental to Theoretical number of comparisons in Eligibility check

c 2006, Indian Institute of Technology Delhi


°
30 Analysis and Results

(a) Bubblesort
CFUid BB matches Alg1 matches Alg2
0 3 2 10
0 8 2 6
0 9 2 4
0 14 2 6
(b) Histogram
CFUid BB matches Alg1 matches 2
0 9 1 2
0 5 2 6
0 14 3 15
0 18 3 15
1 5 6 18
0 23 7 63
1 14 15 75
1 23 63 567
(c) Newlife
CFUid BB matches Alg1 matches Alg2 CFUid BB matches Alg1 matches Alg2
0 32 2 8 1 19 680 23120
0 5 4 20 3 19 680 10e6
0 11 5 30 4 19 680 10e6
1 32 8 32 1 25 680 23120
1 5 20 100 3 25 680 10e6
0 19 20 680 4 25 680 10e6
0 25 20 680 5 19 39304 10e6
1 11 30 180 5 25 39304 10e6

Table 5.4: Algorithm1(3.4) Vs Algorithm2(3.5)

c 2006, Indian Institute of Technology Delhi


°
Analysis and Results 31

Figure 5.2: DFG of basic block 3 in bubblesort

Figure 5.3: DFG of basic block 5 in histogram

tial matches identified in Algorithm3.5 is much more than the number of partial matches identified
in algorithm Algorithm 3.4 by us (Algorithm 3.4).

c 2006, Indian Institute of Technology Delhi


°
32 Analysis and Results

Figure 5.4: DFG of basic block 25 in newlife

c 2006, Indian Institute of Technology Delhi


°
Analysis and Results 33

5.1 Case Study: adpcmdecode bench


We validated instruction selection algorithm on adpcmdecode. Fig 5.5 shows the DFG of one of
the two CFUs identified in this application. CFU2 performs the following computation.
dest1 = (src1&src2)k(src3&src4)
Most of the operations are binary. So, Latency of both the CFUs is taken to be 1, considering the
chaining of the operations.
The results reveal about 20% reduction in the cycles required for execution. Though we would
have expected a larger improvement as the latency of the CFUs considered have been drastically
reduced, the observations can be analyzed as following:

Figure 5.5: CFU1

Function Without special FU (cycles) With special FU (cycles) new framework


Adpcmdecode 6736368 5389767

Table 5.5: Predicated AdpcmDecode Results

Since the VLIW compiler itself can exploit spatial computation, some of the instructions consid-
ered to be a part of the CFU are originally scheduled in parallel along with the other instructions.

c 2006, Indian Institute of Technology Delhi


°
34 Analysis and Results

No. of ALUs W/o AFUs With AFUs 1 & 2 With AFU2


N cycles ASL ESL N cycles ASL ESL N cycles ASL ESL
1 7966080 54 20 3835520 26 14 4868160 33 16
2 4530560 28 20 2212800 15 14 2507840 17 16
3 3550400 20 20 2212800 15 14 2360320 16 16
4 3550400 20 20 2212800 15 14 2360320 16 16
5 3550400 20 20 2212800 15 14 2360320 16 16

Table 5.6: Predicated AdpcmDecode Results with Varying Number of Integer ALUs

Hence, the reduction is not too drastic. This observation reveals the fact that there is a trade
off between the amount of ILP present and the granularity of the CFUs considered. Thus better
results can be obtained if CFUs have a large critical path and are composed of instructions most
of which are not scheduled in parallel with other instructions, of the rest of the basic block.
The above observation also allows us to question, if a VLIW based processor augmented with
large number of fine grained FUs forming a part of its core to extract maximum parallelism available
in the application, but with no special FUs , can perform as good as one with special FUs.
Hence an experiment was performed, by varying the number of Integer ALUs forming a part of
the core of the machine. The Machine Architecture assumed was the default architecture provided
by the Trimaran framework. Only the number of Integer ALUs varied to evaluate the performance.
Experimentation helped us to compare the performance of the original processor with different
number of integer ALUs , a processor augmented with one CFU1 & one CFU2 Functional unit and
a processor augmented with three CFU2 Functional units. The following results are of the basic
block containing the CFU (considering issuewidth=4).
The above tabulated results reveal that

1. Although by increasing the number of integer ALUs, the processor performance becomes
better, still the processor augmented with special FUs outperforms it. This is because, once
the VLIW based architecture has extracted maximum parallelism out of the application,
increasing the number of integer ALUs does not in anyway enhance its performance. Thus
further performance enhancement can be achieved only by shortening the time taken to
execute the critical part of the application. Hence by mapping a part of the critical part
of the application on to a dedicated hardware unit, further reduction in execution time is
achieved as shown by the results.

2. ASL corresponds to the Actual Scheduled length of tyhe basic block and ESL corresponds
to the Estimated Scheduled length by the instruction selection algorithm mentioned earlier.
The estimate is found to be extremely close to the actual scheduled length, (observed from
the execution statics given by the extended Trimaran Simulator) whenever the instructions
are allowed to be executed in parallel. Scheduled Length is very poorly estimated only when
number of ALUs is 1, as it was also expected because the instruction selection algorithm
consider that there is sufficient parallelism allowed.

c 2006, Indian Institute of Technology Delhi


°
Chapter 6

Conclusion and Future Work

6.1 Conclusion
We have presented a novel and very efficient algorithm for instruction matching. It successfully
matches even multi-output complex Function units with a subgraph in the DFG of an application.
We observed that the concept of Commonsink (or commondescendent) plays very significant role
in effectively pruning the search space. Matching only primary input nodes of the Graph Gpat
corresponding to customized instruction and the concept of commonsink constitute the crux of the
algorithm. We evaluated the performance of the matching algorithm with many benchmarks and
compared its efficiency with some already existing algorithms.
We have implemented successfully the instruction matching and instruction selection algo-
rithm for complex functional units (may be multiple-output), in the Trimaran and this gives us a
framework to quickly evaluate the performance gain obtained when architecture is extended with
customized instructions. The framework can be used for extensive design space exploration and
the user can experiment by mapping various compute intensive parts of the application to special
FUs in hardware and comparing the relative performance estimates under accurate implementation
constraints. This can ease the synthesis of ASIPs corresponding to these set of applications.
We have automated the framework, and the user just has to provide the machine description
corresponding to that particular operation. All the other things are automatically executed and
the statistics obtained can easily be compared with the original implementation with the help of
Trimaran GUI(if supported by Trimaran version used). At present Trimaran 3.7 does not give
support for GUI but provides the statistical information in DYN STATS.O.
We have taken many small benchmarks which form a part of Trimaran benchmark and me-
diabench application adpcmdecode for the validation of our approach and the results obtained
show that the custom FUs in the architecture result in a large scale performance gain. The gain is
proportional to the amount of computation performed by the FU. The performance gain achieved
is dependent on the machine architecture assumed and the nature of the application. A VLIW
processor augmented with coarse grained FUs is proved to perform better than a simple VLIW
processor having enough resources to extract maximum parallelism, provided the special FUs are
capable of reducing the critical path delay of the application dictated by the machine architecture.

35
36 Conclusion and Future Work

6.2 Future work


1. Analysis of permormance-area trade off
The modeling does not take into account the relative cost of the special FU. A possible
extension can be an introduction of cost model which evaluates the area corresponding to
each special FU so that one can even evaluate the performance-area trade off.

2. Matching the functionality


At present we did not include arithmetic-logic reduction in our instruction matching algo-
rithm. We provide support for handling commutative cases in instruction matching algorithm
but not for complex arithmetic-logic reductions. The algorithm can be extended in future to
match the Customized instruction (Gpat ) with a subgraph(Gsub ) in DFG, where the same
computation is performed as in Gpat but the topology of Gpat and Gsub may not be the same.

3. Evaluating performance for clustered VLIW architectures


We do not evaluate the performance for Clustered VLIW Architectures, since the Trimaran
framework has no notion of multiple register files and clustered architectures as explained
in one of the previous chapters. A possible extension to this work can be the introduction
of multiple register files in the Trimaran infrastructure and further, simulation of clustered
VLIW architectures.

4. Control Flow in AFUs


Identification and modeling of AFUs with control is another interesting research direction
that can be pursued. The question which needs to be answered is, that once we attain the
limits of processor specialization by mapping dataflow section of the application code, can
the current framework be further adapted to map complex control flow parts on to dedicated
hardware units. An investigative effort will be required to determine if the ASIPs synthesized
will significantly achieve in terms of power and performance, with the complex design AFUs.

c 2006, Indian Institute of Technology Delhi


°
Bibliography

[1] Machine suif. http://www.eecs.harvard.edu/hube/software.

[2] The trimaran compiler infrastructure. http://www.trimaran.org.

[3] R. Sethi A. aho and J. Ullman. Compilers: Principles, Techniques and Tools. Addison-Wesley,
1986.

[4] Marnix Arnold. Matching and covering with multiple output patterns. Technical Report
1-68340-44(1999)-01, Delft University of Technology, January 1999.

[5] Marnix Arnold and Henk Corporaal. Designing domain-specific processors. ACM, 2001.

[6] K. K. Dagon. Technology binding and local optimzation by dag matching. In Proceedings of
the design Automation Conference, pages 617–623, May 1987.

[7] R. E. Gonzalez. Xtensa: A configurable and extensible processor. IEEE Micro, 2000.

[8] Paolo Ienne, Laura Pozzi, and M. Vuletic. On the limits of processor specialisation by mapping
data ow sections on ad-hoc functional units. Technical Report CS Technical Report 01/376,
LAP, EPFL, Lausanne, December 2001.

[9] Y. Kukimoto, R. K. Brayton, and P. Sawkar. Delay-optimal technology mapping by dag


covering. In Proceedings of the Design Automation Conference, 1998.

[10] Giovanni De Micheli. Synthesis and optimizationof digital circuits. McGraw Hill, 1994.

[11] Scott A. Mahlke Nathan T. Clark, Hongtao Zhong. Automated custom instruction generation
for domain-specific processor acceleration. IEEE Transactions on Computers, October 2005.

[12] Armita Peymandoust and Giovanni De Micheli. Symbolic algebra and timing driven data-flow
synthesis. In Proceedings of International Conference on ComputerAided Design, 2001.

[13] S. Mazid R. Kastner. Instruction generation for hybrid reconfigurable systems. In Proceedings
of International Conference on ComputerAided Design, Novemeber 2001.

[14] R. Rudell. Logic synthesis for vlsi design. Techincal Report UCB/ERL M89/49, University
of California at Berkeley, April 1989.

[15] K. Keutzer S. Tjiang S. Liao, S. Devadas. Instruction selection using binate covering for code
size optimization. In Proceedings of International Conference on ComputerAided Design, 1995.

37
38 BIBLIOGRAPHY

[16] D. Seal. ARM Architecture Reference Manual. Addison Wesley, 2000.

[17] B.R. Rau Shail Aditya and V. Kathail. Automatic architectural synthesis of vliw and epic
processors. In Proceedings of 12th ISSS, 1999.

c 2006, Indian Institute of Technology Delhi


°

You might also like