Professional Documents
Culture Documents
A Thesis
submitted in partial fulfillment
of the requirements for the degrees
of
Bachelor of Technology
and
Master of Technology
by
Puneet Goyal
This is to certify that the thesis titled “Multiple-Output Complex Instruction Matching and In-
struction Selection for VLIW Architectures” submitted by Puneet Goyal for the award of Bachelor
of Technology and Master of Tecnology is a record of bonafide work carried out by him under my
guidance and supervision at the Department of Computer Science & Engineering, Indian Institute
of Technology Delhi. The work presented in this thesis has not been submitted elsewhere, either
in part or full, for the award of any other degree or diploma.
I express my deepest gratitude to all the people who have supported me and encouraged me
during the course of this project. I are thankful to Prof. Anshul Kumar for extending his guidance,
support and invaluable time without which this project could not have been accomplished. He
encouraged me to think in a scientific and methodical manner.
I am thankful to Prof. M. Balakrishnan, Dr. Preeti Ranjan Panda and Dr. Kolin Paul for their
vision and encouragement. As a member of Embedded Systems Group I have learned a lot from
the guidance of our teachers and fellow students. I would also like to thank Nagaraju Potheneni
and Neeraj Goel for providing invaluable insights into the subject wherever possible.
Puneet Goyal
Abstract
List of Figures ii
List of Algorithms iv
1 Introduction 1
1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Report Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Bibliography 37
i
List of Figures
ii
List of Tables
iii
List of Algorithms
3.1 Reachability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 find match(Gsub , Gpat ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Instruction Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Partial Matching Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Partial matching algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Choosing a cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.7 CoverOutputs Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
iv
Chapter 1
Introduction
In recent years, the markets for cellular phones, digital cameras, network routers, and other high
performance but special purpose devices have grown explosively. In these systems, application-
specific hardware design is used to meet the challenging cost, performance, and power demands. The
most popular strategy is to build a system consisting of a number of highly specialized application
specific integrated circuits (ASICs) coupled with a low cost core processor, such as an ARM [16].
The ASICs are specially designed hardware accelerators to execute the computationally demanding
portions of the application that would run too slowly if implemented on the core processor. While
this approach is effective, ASICs are costly to design and offer only a hardwired solution that
permits almost no postprogrammability.
An alternative design strategy is to augment the core processor with special-purpose hardware
to increase its computational capabilities in a cost-effective manner. The instruction set of the
core processor is extended to feature an additional set of operations. Hardware support is added to
execute these operations in the form of new function units or coprocessor subsystems. The Tensilica
Xtensa is an example commercial effort in this area [7].
In order to deliver high performance, an ASIP must exploit the instruction level parallelism
(ILP) available in the given application. This brings up superscalar and VLIW as the two archi-
tectural choices. Though for general purpose computing superscalar is the architecture of choice
due to code compatibility issues, for the embedded systems and ASIP domain, VLIW architecture
appears more attractive as it offers a better possibility of customization [17].
we consider a VLIW architecture which consists of a core set of FUs augmented with application
specific coarse grain FUs. Specializing or customizing FUs for operations (or group of operations)
occurring in a given application can potentially lead to high performance gains because of the
following considerations [8, 11]:
• If the operands of an operation have a limited resolution (bit width), FU hardware can be
simplified and made faster.
• By chaining a sequence of operations in an FU, the computation time can be reduced.
• Concurrent operations within a group can be more easily parallelized than parallelization
across the FUs.
• By mapping a group of operations to an FU, access to register file for the intermediate results
is avoided. This reduction in register pressure has a beneficial effect on performance.
1
2 Introduction
• Also, the computationally intensive portions of applications from the same domain (e.g.,
encryption) are often similar in structure. Thus, the customized instructions can often be
generalized in small ways to make their use have applicability across a set of applications.
Hence a considerable effort has been put in this area. But as far as considering the multiple-out
complex instructions are concerned; and making a selection amongst them for better performance
in VLIW architectures, considering the architecture constraints (like limited issue width, limited
number of register ports, etc.); this is still to be explored. The present work focuses on that.
This project is a part of the automation framework, which is getting developed in IIT Delhi
Embedded Systems Group. Corresponding to a given application, once the custom instructions (or
complex functional units) are identified, we want:
• to develop techniques for effectively matching these complex functional units in the data-flow
segments of the control-data flow graph (CDFG), and
• to construct the data-paths that exploit at best the given library (this library not only has
basic elements such as adders and multipliers, but also complex functional units), taking into
consideration the architecture constraints for VLIW processors.
Instruction Selection: Given subject graph, Gsub with full matches found for all CFUs, find the
subset of full matches that when implemented, minimizes the data ready time of the slowest
subject graph output in the VLIW processor.
Following is an example of a Gsub (Fig 1.1) and a set of Instructions (Fig 1.2). In Fig 1.2, number
of cycles corresponds to the latency of the CFU.
Matching and covering algorithms are well-known in the fields of code generation and logic syn-
thesis. Keutzer [6] was the first to recognize the similarity between the software compiler’s task of
generating code and the technology mapping problem in automated VLSI design. Both problems
can be handled with a matching algorithm, to find all possible instantiations of patterns (instruc-
tions or standard cells), followed by a covering algorithm to make a selection of matches that
optimizes some criterion (execution time, code size, VLSI area or latency, etc.).
For code generation, we need to make the selection from the given set of instructions (including
the complex ones) in a way that effectively utilizes the VLIW processor and Application
Specific Functional Units (AFU). The goal is minimizing the schedule length. The complexity
arises due to the fact that we want covering to be architecture driven and the I/O time-shape of the
AFU could be distributed, it gives us to efficiently use the resources (Functional units and registers),
thereby reducing the execution time. Current methods, generally aim to find the optimal cover
within each basic block that minimizes the number of selected matches. Fewer matches translate
to fewer operations for the schedule, and it is expected that the increased scheduling freedom leads
to better (shorter) schedule. This assumption is the basis of the most of the current methods. We
will illustrate that this assumption, which is the basis of the most of the current methods, may not
always hold true.
5
6 Literature Review and Motivation
Instruction Selection
Instruction selection can be formulated as directed acyclic graph (DAG) covering. Conventional
methods [3] for instruction selection use heuristics that break up the DAG into forest of trees,
which are then covered optimally but independently. Independent covering of the trees may result
in a suboptimal solution for the original DAG. But, Trees, as a heuristic formulation, inherently
preclude the use of complex instructions in cases where internal nodes are shared.
Paolo Ienne et. al. [8] make use of symbolic algebra tools for instruction selection. Polynomial
representations of the Basic Blocks of the Software Application and the new Instruction set of
the ASIP are available to the Symbolic Mapping Algorithm. As opposed to tree covering based
Algorithms, their algorithm, mapping is performed simultaneously with algebraic manipulations.
But as mentioned earlier this is restricted to only arithmetic data flow segments.
Alternatively, the DAG covering problem can be formulated as a binate covering problem and
solved exactly or heuristically using branch-and-bound methods (Liao et. al. [15]). Kastner et. al.
[13] proposed a greedy approach for solving this problem. But they aim to find the optimal cover
within each basic block that minimizes the number of selected matches. Arnold et. al. [5] proposes
an algorithm for covering which directly aims at reducing the data ready time of the slowest
subject graph output, but it does not efficiently handles the overlapping cases. In this project, this
algorithm is improved to handle the overlapping cases and is successfully implemented in Trimaran
framework.
Example 2.2. This example will illustrate that we can improve the covering algorithm if we know
the instruction level parallelism and can make use of this information while doing covering.
Figure 2.2: Cover selected when purpose is to minimize number of patterns used to cover
Figure 2.3: Cover selected when purpose is to minimize the data ready time
Figure 2.4: Cover selected when instruction level parallelism information is not taken into
consideration
Figure 2.5: Cover selected when instruction level parallelism information is taken into con-
sideration
Let, Gsub be a data flow graph (DFG) within the control flow graph of a C application, CF U -lib
be a library of Complex Functional Units (may be multiple-output) and Gpati be a data flow graph
corresponding to the ith customized instruction (or CFU) in the CF U -lib.
Given Gsub and CF U -lib, we must try to find all matches between pattern graphs Gpat (from
pattern library CF U -lib) and subgraphs of Gsub – The strategy for this is described in Section 3.1.
After all matches are found, we try to find the best cover, or selection of matches, that is, the
set of matches that, when implemented, minimizes the data-ready time of the longest path through
the subject graph. The covering approach is described in Section 3.2.
Note: In Fig 3.2, showing Gpat and Gsub , each node corresponds to an operation-node and an
edge corresponds to the data flow edge.
13
14 Algorithm for Instruction Matching and Selection
compute reachability ( ) {
for (i = 0; i < num nodes; i + +)
Reachability[i][i] = 1
for (j = 1; j < num nodes; j + +)
for (i = j − 1; i >= 0; i − −) {
Reachability[i][j] = Adj matrix[i][j]
if (Reachability[i][j] == 0)
for (k = j; k > i; k − −)
Reachability[i][j]+ = (Adj matrix[i][k] ∗ Adj matrix[k][j])
}
}
matrix.
CommonSink degree 1[ ][ ] = CommonSink[ ][ ].
CommonSink degree d[ ][ ] = (CommonSink[ ][ ])d .
For example, In Fig 5.5, Commonsink degree 2[p1 ][p3 ] = 1.
Below is the informal description of the instruction matching algorithm: find match(Gsub , Gpat ).
We begin with identifying the matches for primary nodes. If {y1 , y2 , y3 , . . . , yp } is a set of p nodes
(operations) of Gsub ,that matches with the p primary nodes of the Gpat , {x1 , x2 , x3 , . . . , xp }, then
we call this set {yi } as partial match. We will first identify all the partial matches for the given
Gpat and Gsub . For each partial match, We need to expand it along the DFG edges to validate
this partial match. But before doing this traversal along the DFG edges, we will first apply some
heuristiscs/constraints that must be satisfied, to make this partial match an eligible candidate to
be considered for validation. We check that there is no two nodes within the partial match, that are
identical or having any dataflow edge between them. We will also check the outdegree constraints
for each node in the partial match. We can also employ certain heuristics like commonsink (i.e.
every pair of nodes in a partial match must be reachable to some common node, the common sink)
depending upon the Gpat graph. For example, the Gpat shown in Fig 3.2(b), the nodes p1 and p2
has a common sink. The check for outdegree constraint and commonsink (heuristic) will require us
to compute the the adjacency matrix, the reachability matrix and commonsink matrix. As this is
to be done only once for each basic block and Gpat , we can compute these matrices for cutting off
the traversal along the DFG edges in many cases, and thus effectively pruning the search space and
reducing the complexity. The degree associated with CommonSink is a heuristic parameter and it
is selected depending upon the the size and topology of the Gpat . Generally even with considerably
big patterns, we can effectively prune the search space by associating degree 2 in commonsink
matrix computation.
Heuristics/Constraints
1. No data dependency among the nodes (y1 , y2 , . . . , yp ) matching with primary input nodes
(x1 , x2 , . . . , xp ).
2. yi = yj only if i = j.
3. yi and yj must have some common descendent. Check this for every pair {(i, j)|i, j < p} by
computing Commonsink matrix. This is a heuristic.
1. Node j has two operands, both internal. For example, node p4 in Fig 3.2(b) .
2. Node j has two operands, one internal and one constant.
3. Node j has two operands, one external and one internal. For this case, we also need to check
for the convexity constraint.For example, node p3 in Fig 3.2(b).
4. Node j has one operand, definitely internal. (Otherwise it would be a primary node.)
(a) (b)
In each case, we will find whether the Gsub nodes matched with predecessors of this node j
has any successor, that can be matched with this node j. (We also make a check for the outdegree
constraint and convexity constraint here). If yes, we will visit the next non primary node, until all
the nodes in Gpat successfully matches or the eligible partial match gets invalidated.
Alg 3.3 identifies all the matches and completes our instruction matching algorithm.
For example, in Fig 3.2, Gpat is a subgraph of Gsub . Matched nodes are marked with same
colors. p1 and p2 are primary input nodes of Gpat . One of the partial match for them is {n1 , n2 }
and this partial match passes the eligibility criteria. After traversing in Gpat , we find that p3 and
p4 matches with n4 and n5 respectively. Note that, if p3 would not be an output node, the match
would have failed because of not satisfying outdegree constraints at node n4 .
insn matching ( ) {
for each basicblock Gsub {
compute Adjacency, Reachability and Commonsink matrices for Gsub
for each CFU graph, Gpat
f ind match(Gsub , Gpat )
}
}
partial matches with the partial match identification algorithm(Algorithm 3.5) proposed by Arnold
[4].
FindPartialMatches algorithm1 ( ) {
for each pattern Gpat in pattern library
for each pattern node Nprime pat of Gpat {
For each node Nsub of Gsub
//check for opcode and outdegree constraint
if N odematch(Nsub , Nprime pat )
//create a new match
new M atch(Nsub , Nprime pat )
//let us say, ith prime node matched with ki nodes in Gsub
}
M erge matches( )
//This identify all partial matches.
// N um partial matches = Πi ki , i.e of O(np )
}
We observed that even for large sized customized instruction that could be identified in medi-
abench and other applications, the value of p is generally 4 (for example in figure 5.5 p = 3 and
m = 13). Also, as we know that the number of external operands for the customized instruction is
at least 2p. So higher the value of p implies, more is the number of (external) source operands for
the Customized instruction, more difficult it would become to satisfy the register read/write port
constraints. So p is considerably smaller and therefore our algorithm (Algorithm 3.4) for finding
partial matches will be much efficient, in comparison to Algorithm 3.5 where number of partial
matches is of O(nm ).
FindPartialMatches algorithm2 ( ) {
for each pattern Gpat in pattern library
for each Nsub of Gsub
for each Npat of Gpat
if N odematch(Nsub , Npat )
//create a new match
new M atch(Nsub , Npat )
//let us say, ith node (Npat,i) matched with ki nodes in Gsub
M erge matches( )
//This identify all partial matches.
// N um partial matches = Πi ki , i.e of O(nm )
}
Cover ( ) {
EvaluateMatches()
SortGraphOutputs()
CoverOutputs()
}
pattern that is matched) that apply to it. Sort those matches by data-ready time, ascending.
CoverOutputs
This function iterates over the subject graph output operands, which have been sorted by projected
data-ready time, descending, in the previous phase. It calls the recursive function ImplementBest-
Match for each of them. This function implements the match with the lowest possible data-ready
time. Starting at the subject graph output operand with the highest data-ready time for its best
match, it implements the match.
ImplementBestMatch
First we check if the current node has already been covered as part of a earlier covering iteration.
If this is the case, we return the attained data-ready time from that cover.
The best available matches on the subject sub-graph rooted in node Osub are implemented,
recursively. The best available match is the first one on Osub that would not introduce any depen-
dency cycles into the subject graph.
To guarantee that the critical path to the current subject operand node is covered first, we have
to determine the precedence with which the subject operand nodes corresponding to the match
inputs are covered. This is done by calculating the slack for each one of these nodes. The slack
is defined as the difference between the time a match input should be ready (the current match
output’s data-ready time minus the pattern latency) and the data-ready time of the input’s best
match. The match inputs are sorted by slack, ascending, before covering them in that order.
CoverOutputs ( ) {
foreach subject graph output operand Osub (sorted)
ImplementBestMatch(Osub );
}
The attained data-ready time is calculated and marked on the subject operand nodes. All nodes
covered by the selected matches are marked as having been covered by the function SelectMatch.
For example, in Fig 3.3, starting at e, we choose the match that causes the lowest latency, in
this case m2 (latency 2). All matches that overlap m2 have to be invalidated (m1 , m3 , m4 ) since
they can no longer be implemented (not that we would want to in this case). Now we proceed to
cover all operands that are inputs to match m2 , but these are subject graph inputs a, b and d so
we have finished covering along this path. Now we start covering on the last subject graph output:
f . The best available match is m5 , which is still available, so we can implement it. We have now
finished constructing a cover for the subject graph, consisting of matches m2 and m5 .
SelectMatch
The match selected for implementation probably overlaps several other matches. These can never
be implemented anymore, so they must be deleted. This function removes all other matches that
are referenced on the operand nodes in the selected match.
Note that any of the removed matches may have been the best match for another operand node.
This means that the best implementation, the one that match evaluations for other operands are
based on, is no longer available there. The impact of removing any match may resonate throughout
the subject graph. Keeping track of these effects is not trivial and would require a lot of extra
computational effort (each time a match is implemented, we have to re-evaluate all matches on all
nodes). We therefore settle for staying with our old match evaluation results and implement the
next best available match, as determined previously, everywhere.
This is where we modify their algorithm, in order to handle the overlapping cases differently
and efficiently. In modified algorithm, all the overlapping matches are not deleted.
Let us consider the following case. Match A and B are overlapping (Fig 3.4). Let ai and bi be
the internal data edges of pattern A and B, respectively. n1 , n2 , n3 be the number of operations
in the Match A, B, and overlapping area respectively.
sp1 , sp2 and sp3 are their corresponding gains values. We are choosing to replicate some nodes
only if following criteria is met:
• The gain (sp2 − sp3 ) should be more than some threshold value.
We will be analyzing the performance under the different replication criteria, to study how repli-
cation criteria influences the performance.
There are some more conditions that are also to be taken into consideration, while selecting a
match and making the replication decision. Without loss of generality, we can assume that Pattern
A is definitely selected, and decision of choosing Pattern B is taken, depending upon selection
criteria and following conditions:
- if (α = oj ) then we can select Pattern B but it should be scheduled after A and selection
criteria should be satisfied.
For evaluating the efficiency of the Instruction Matching Algorithm, we need to compile the given
application code into an intermediate representation where the C-code is reduced into a DFG/CFG
representation closer to assembler, although still largely architecture independent. Dataflow nodes
should resemble generic assembler operations. This process would ease the process of instruction
matching, since the algorithm is completely dependent on the topological characteristics of the
DAG constructed. Any of the two compiler infrastructures, MachSuif[1] and Trimaran[2], could
be selected for this purpose. But because of inherent complexities involved in Trimaran and being
already familiar with Machsuif framework, Machsuif is selected for the purpose of establishing the
efficiency of proposed instruction matching algorithm. Afterwards, it is adopted in Trimaran frame-
work with some changes in data-structures, used for traversing along the data-flow edges. Also,
for evaluating and validating our instruction selection algorithm, we need VLIW compiler support
which can simulate the architecture extended with Customized instructions. Hence the Trimaran
Framework. Is usead as natural choice to validate instruction-selection algorithm. Architectural
constraints (issue width, number of register read/write ports, etc.) can also be easily incorporated
in Trimaran.
4.1 MachSuif
MACHINE SUIF [1] is a flexible, extensible, and a usable infrastructure for constructing compiler
back ends. It is a part of the DARPA and NSF-funded National Compiler Infrastructure project.
It has been used to develop machine-specific optimizations, construct profile-driven optimizations,
and evaluate architectural models. With it we can readily construct and manipulate machine-level
intermediate forms and emit assembly language, binary object, or C code. The system comes with
backs ends for the Alpha and x86 architectures. The control-flow and data-flow analysis libraries
provided , ease the building of Machine SUIF passes, and provide abstractions that aid in the
coding of certain kinds of optimizations.
4.2 Trimaran
Trimaran[2] is a compiler infrastructure for supporting state of the art research in compiling for
Instruction Level Parallel (ILP) architectures. The system is oriented towards EPIC (Explicitly
23
24 Evaluation and Validation Framework
Parallel Instruction Computing) architectures, and supports compiler research in what is typically
considered to be ”back end” techniques such as instruction scheduling, register allocation, and
machine-dependent optimizations. The Trimaran system is based on the HPL-PlayDoh architecture
which is a parametric processor architecture conceived for research in instruction-level parallelism.
The HPL-PD opcode repertoire, at its core, is similar to that of a RISC-like load/store architecture,
with standard integer, floating point (including fused multiply-add type of operations) and memory
operations. We map the core part of our target architecture to the HPL-PD architecture.
The Trimaran compiler infrastructure, as shown in Fig 4.1, consists of a compiler front-end,
(IMPACT), compiler back-end, (Elcor) and a simulator generator. The framework is parameterized
using a machine description facility, (HMDES). We briefly describe each of these tools.
The IMPACT compiler system, is used by the Trimaran system as its front end. This
front-end performs, ANSI C parsing, code profiling, classical code optimizations along with block
formation. The High Level Machine Description Facility or HMDES is the machine de-
scription language used in Trimaran system. This language describes a processor architecture from
the compiler’s point of view. To this end it specifies the instruction format, resource usages and
reservation tables, latency information, operation information and some compiler specific informa-
tion. The instruction format conveys what operands are allowed by each type of operation, resource
usages specify how operations use processors resources as they execute and latency information
specifies how to calculate dependence distances between operations. Finally, operation information
specifies the operations supported by the architecture. and describes each of them in terms of there
Scheduling Alternatives which includes the format, resource usage and latency.
Elcor is Trimaran’s back-end for the HPL-PD architecture. It performs three tasks:
Elcor is parameterized by the machine description facility to a large extent. As shown in Fig 4.1, it
takes as input the bridge code produced by front-end along with a HMDES machine specification
and produces an Elcor IR file. The IR is annotated with HPL-PD assembly instructions. The
internal representation of Elcor IR consists of a set of C++ objects. All optimization modules in
the Elcor IR use the interface provided by these objects to carry out optimizations. Optimizations
are simply IR to IR transformations.
The Trimaran framework also consists of a simulator which is used to generate various sta-
tistics such as compute cycles, total number of operations, etc.
The limitations of the Trimaran framework are that firstly, it is built around the HPL-PD
architectural domain. Hence, it only supports operations which are a subset of HPL-PD operations.
Secondly, the Trimaran framework does not completely support clustered VLIW architecture. It
has a single register file of each type (e.g. integer regfile, floating point regfile etc). Each integer FU
accesses the same integer regfile. Hence, we cannot evaluate performance for clustered architectures.
For evaluating the performance of proposed instruction matching algorithm, we used the bitwise
benchmarks newlife, histogram and bubblesort. The application C code along with the description
of CFU-library is given as input to the Machsuif. Our CF U -lib consists of 6 patterns (or customized
instructions) as shown in Fig 5.1. In Fig 5.2, Fig 5.4 and Fig 5.3, we have shown the DFGs (Gsub )
that appear in these benchmarks.
Table 5.1 shows the number of complete (valid) matches found in a particular Basic block of
that application.
Number of partial matches is of O(np ) but experimentally we observed that it is much less
than O(np ). Also we analyzed the effect of heuristics in pruning the search space.
The Table 5.2 shows the fraction of eligible matches for different values of p and for differ-
ent benchmarks. The eligible matches are those partial matches that qualify the outdegree con-
straint, Commonsink constraint, etc. Fraction of eligible matches is computed as Number of eligible
matches / number of partial matches.Best case = 0% for a certain benchmark and p, means that
for a certain basic block (DFG) in that benchmark, we observed that after applying the heuristics,
none of the partial match become eligible candidate. Without going through the complicated task
of traversing through the data flow edges, we have filtered out lot of unsuitable partial matches.
It is important to note that for p = 3, number of matches that would be considered for full match
are only 0.1 to 2% of the partial matches identified in the stage1 of our algorithm. It means that
for higher values of p, the eligibility criteria imposed is very effective in pruning the search space.
For applying the eligibly criteria, it is required to compute the Reachability matrix, the Com-
monsink matrix before hand.The overhead involved is Reachability matrix and Commonsink matrix
computation time but as it is to be done for each basic block only once and also, it is very helpful
in effectively pruning the search space, we can easily choose to pay for this overhead.
Theoretically, the number of comparisons performed while checking for eligibility criteria is
num partial matches ∗ (p ∗ (p − 1)/2). We compare this with the number of comparisons actu-
ally done. On an average, For p = 2, we found the ratio of actual number of comparisons and
num partial matches ∗ (p ∗ (p − 1)/2) to be about 2.5 and for p = 3, this ratio is 1 (Table 5.3).
Comparing two algorithms of finding out partial matches: We have described in Section 3.1.1
page 16) the two algorithms for finding partial matches. Table 5.4 shows for different benchmarks,
the comparison between the two algorithms in terms of the number of partial matches (that are
to be evaluated for eligibility and validity) found.
We observed that the Algorithm 3.4 to be very efficient than Algorithm3.5. The number of par-
27
28 Analysis and Results
Avg Max Avg P=2 Max P=2 Avg P=3 Max P=3
Bubblesort 2.20 3.5 2.581 3.5 0.984 1.285
Histogram 2.07 3.33 2.524 3.33 0.956 1.382
Newlife 2.01 3.5 1.567 3.5 1.29 1.64
(a) Bubblesort
CFUid BB matches Alg1 matches Alg2
0 3 2 10
0 8 2 6
0 9 2 4
0 14 2 6
(b) Histogram
CFUid BB matches Alg1 matches 2
0 9 1 2
0 5 2 6
0 14 3 15
0 18 3 15
1 5 6 18
0 23 7 63
1 14 15 75
1 23 63 567
(c) Newlife
CFUid BB matches Alg1 matches Alg2 CFUid BB matches Alg1 matches Alg2
0 32 2 8 1 19 680 23120
0 5 4 20 3 19 680 10e6
0 11 5 30 4 19 680 10e6
1 32 8 32 1 25 680 23120
1 5 20 100 3 25 680 10e6
0 19 20 680 4 25 680 10e6
0 25 20 680 5 19 39304 10e6
1 11 30 180 5 25 39304 10e6
tial matches identified in Algorithm3.5 is much more than the number of partial matches identified
in algorithm Algorithm 3.4 by us (Algorithm 3.4).
Since the VLIW compiler itself can exploit spatial computation, some of the instructions consid-
ered to be a part of the CFU are originally scheduled in parallel along with the other instructions.
Table 5.6: Predicated AdpcmDecode Results with Varying Number of Integer ALUs
Hence, the reduction is not too drastic. This observation reveals the fact that there is a trade
off between the amount of ILP present and the granularity of the CFUs considered. Thus better
results can be obtained if CFUs have a large critical path and are composed of instructions most
of which are not scheduled in parallel with other instructions, of the rest of the basic block.
The above observation also allows us to question, if a VLIW based processor augmented with
large number of fine grained FUs forming a part of its core to extract maximum parallelism available
in the application, but with no special FUs , can perform as good as one with special FUs.
Hence an experiment was performed, by varying the number of Integer ALUs forming a part of
the core of the machine. The Machine Architecture assumed was the default architecture provided
by the Trimaran framework. Only the number of Integer ALUs varied to evaluate the performance.
Experimentation helped us to compare the performance of the original processor with different
number of integer ALUs , a processor augmented with one CFU1 & one CFU2 Functional unit and
a processor augmented with three CFU2 Functional units. The following results are of the basic
block containing the CFU (considering issuewidth=4).
The above tabulated results reveal that
1. Although by increasing the number of integer ALUs, the processor performance becomes
better, still the processor augmented with special FUs outperforms it. This is because, once
the VLIW based architecture has extracted maximum parallelism out of the application,
increasing the number of integer ALUs does not in anyway enhance its performance. Thus
further performance enhancement can be achieved only by shortening the time taken to
execute the critical part of the application. Hence by mapping a part of the critical part
of the application on to a dedicated hardware unit, further reduction in execution time is
achieved as shown by the results.
2. ASL corresponds to the Actual Scheduled length of tyhe basic block and ESL corresponds
to the Estimated Scheduled length by the instruction selection algorithm mentioned earlier.
The estimate is found to be extremely close to the actual scheduled length, (observed from
the execution statics given by the extended Trimaran Simulator) whenever the instructions
are allowed to be executed in parallel. Scheduled Length is very poorly estimated only when
number of ALUs is 1, as it was also expected because the instruction selection algorithm
consider that there is sufficient parallelism allowed.
6.1 Conclusion
We have presented a novel and very efficient algorithm for instruction matching. It successfully
matches even multi-output complex Function units with a subgraph in the DFG of an application.
We observed that the concept of Commonsink (or commondescendent) plays very significant role
in effectively pruning the search space. Matching only primary input nodes of the Graph Gpat
corresponding to customized instruction and the concept of commonsink constitute the crux of the
algorithm. We evaluated the performance of the matching algorithm with many benchmarks and
compared its efficiency with some already existing algorithms.
We have implemented successfully the instruction matching and instruction selection algo-
rithm for complex functional units (may be multiple-output), in the Trimaran and this gives us a
framework to quickly evaluate the performance gain obtained when architecture is extended with
customized instructions. The framework can be used for extensive design space exploration and
the user can experiment by mapping various compute intensive parts of the application to special
FUs in hardware and comparing the relative performance estimates under accurate implementation
constraints. This can ease the synthesis of ASIPs corresponding to these set of applications.
We have automated the framework, and the user just has to provide the machine description
corresponding to that particular operation. All the other things are automatically executed and
the statistics obtained can easily be compared with the original implementation with the help of
Trimaran GUI(if supported by Trimaran version used). At present Trimaran 3.7 does not give
support for GUI but provides the statistical information in DYN STATS.O.
We have taken many small benchmarks which form a part of Trimaran benchmark and me-
diabench application adpcmdecode for the validation of our approach and the results obtained
show that the custom FUs in the architecture result in a large scale performance gain. The gain is
proportional to the amount of computation performed by the FU. The performance gain achieved
is dependent on the machine architecture assumed and the nature of the application. A VLIW
processor augmented with coarse grained FUs is proved to perform better than a simple VLIW
processor having enough resources to extract maximum parallelism, provided the special FUs are
capable of reducing the critical path delay of the application dictated by the machine architecture.
35
36 Conclusion and Future Work
[3] R. Sethi A. aho and J. Ullman. Compilers: Principles, Techniques and Tools. Addison-Wesley,
1986.
[4] Marnix Arnold. Matching and covering with multiple output patterns. Technical Report
1-68340-44(1999)-01, Delft University of Technology, January 1999.
[5] Marnix Arnold and Henk Corporaal. Designing domain-specific processors. ACM, 2001.
[6] K. K. Dagon. Technology binding and local optimzation by dag matching. In Proceedings of
the design Automation Conference, pages 617–623, May 1987.
[7] R. E. Gonzalez. Xtensa: A configurable and extensible processor. IEEE Micro, 2000.
[8] Paolo Ienne, Laura Pozzi, and M. Vuletic. On the limits of processor specialisation by mapping
data ow sections on ad-hoc functional units. Technical Report CS Technical Report 01/376,
LAP, EPFL, Lausanne, December 2001.
[10] Giovanni De Micheli. Synthesis and optimizationof digital circuits. McGraw Hill, 1994.
[11] Scott A. Mahlke Nathan T. Clark, Hongtao Zhong. Automated custom instruction generation
for domain-specific processor acceleration. IEEE Transactions on Computers, October 2005.
[12] Armita Peymandoust and Giovanni De Micheli. Symbolic algebra and timing driven data-flow
synthesis. In Proceedings of International Conference on ComputerAided Design, 2001.
[13] S. Mazid R. Kastner. Instruction generation for hybrid reconfigurable systems. In Proceedings
of International Conference on ComputerAided Design, Novemeber 2001.
[14] R. Rudell. Logic synthesis for vlsi design. Techincal Report UCB/ERL M89/49, University
of California at Berkeley, April 1989.
[15] K. Keutzer S. Tjiang S. Liao, S. Devadas. Instruction selection using binate covering for code
size optimization. In Proceedings of International Conference on ComputerAided Design, 1995.
37
38 BIBLIOGRAPHY
[17] B.R. Rau Shail Aditya and V. Kathail. Automatic architectural synthesis of vliw and epic
processors. In Proceedings of 12th ISSS, 1999.