You are on page 1of 26

1

ERM PAPER CSE318

HARDWARE COMPILER
System Software

Submitted to: Submitted by: MS.RAMANPREET KAUR LAMBA RAHUL ABHISHEK BTECH(CSE) +MBA roll no RK2r05A21

Regd. no 11006646

ACKNOWLEDGMENT
I have the honor to express our gratitude to our Miss.Ramanpreet Kaur Lamba Mam for their timely help,support encouragement,valuable comments, suggestions and many innovative ideas in carrying out this project. It is great for me to gain out of their experience. I also owe overwhelming debt to my classmates for their constant encouragement and support throughout the execution of this project. Above all, consider it a blessing of the Almighty God that i have been able to complete this project. His contribution towards every aspect of ours throughout our life would always be the supreme and unmatched.

Rahul Abhishek

ABSTRACT

This term paper presents the co development of the instruction set, the hardware, and the compiler for the Vector IRAM media processor. A vector architecture is used to exploit the data parallelism of multimedia programs, which allows the use of highly modular hardware and enables implementations that combine high performance, low power consumption, and reduced design complexity. It also leads to a compiler model that is efficient both in terms of performance and executable code size. The memory system for the vector processor is implemented using embedded DRAM technology, which provides high bandwidth in an integrated, cost-effective manner.

The hardware and the compiler for this architecture make complementary contributions to the efficiency of the overall system. This paper explores the interactions and tradeoffs between them, as well as the enhancements to a vector architecture necessary for multimedia processing. We also describe how the architecture, design, and compiler features come together in a prototype systemon-a chip, able to execute 3.2 billion operations per second per watt.

CONTENTS
1. INTRODUCTION 2. Software and Hardware Compilers Compared 3. The Structure of the Compiler 4. Syntax Directed Translation
5.

STRUCTURE OF HARDWARE COMPILER

6. . Global Optimization 7. Silicon Code Generation


7.1. Compiling the Tree Matcher 7.2. CODE generation

7.3. Local Optimizations for Speed

8. Peephole Optimization 9. Work-In-Progress: Global Speed Optimization 10.Evaluation of the Silicon Instruction Set 11.Working of hardware compiler 12.CONCLUSION

13.REFRENCES

1. Introduction
The direct generation of hardware from a program more precisely: the automatic translation of a program specified in a programming language into a digital circuit has been a topic of interest for a long time. So far it appeared to be a rather uneconomical method of largely academic but hardly practical interest. It has therefore not been pursued with vigour and consequently remained an idealist's dream. It is only through the recent advances of semiconductor technology that new interest in the topic has reemerged, as it has become possible to be rather wasteful with abundantly available resources. In particular, the advent of programmable components devices representing circuits that are easily and rapidly reconfigurable has brought the idea closer to practical realization by a large measure. Hardware and software have traditionally been considered as distinct entities with little in common in their design process. Focusing on the logical design phase, perhaps the most significant difference is that programs are predominantly regarded as ordered sets of statements to be interpreted sequentially, one after the other, whereas circuits again simplifying the matter are viewed as sets of subcircuits operating concurrently ("in parallel"), the same activities reoccurring in each clock cycle forever. Programs "run", circuits are static. Another reason for considering hard and software design as different is that in the latter case compilation ends the design process (if one is willing to ignore "debugging"), whereas in the former compilation merely produces an abstract circuit. This is to be followed by the difficult, costly, and tedious phase of mapping the abstract circuit onto a physical medium, taking into account the physical properties of the selected parts, propagation delays, fanout, and other details. With the progressing introduction of hardware description languages, however, it has become evident that the two subjects have several traits in common. Hardware description languages let specifications of circuits assume textual form like programs, replacing the traditional circuit diagrams by texts. This development may be regarded as the analogy of the replacement of flowcharts by program texts several decades ago. Its main promoter has been the rapidly growing complexity of programs, exhibiting the limitations of diagrams spreading over many pages. In the light of textual specifications, variables in programs have their counterparts in circuits in the form of clocked registers. The counterparts of expressions are combinational circuits of gates. The fact that programs mostly operate on numbers, whereas circuits feature binary signals, is of no further significance. It is well understood how to represent numbers in terms of arrays of binary signals (bits), and how to implement arithmetic operations by combinational circuits.

Direct compilation of hardware is actually fairly straightforward, at least if one is willing to ignore aspects of economy (circuit simplification). We follow in his footsteps and formulate this essay in the form of a tutorial. As a result, we recognize some principal limitation of the translation process, beyond which it may still be applicable, but not be realistic. Thereby we perceive more clearly what should better be left to software. We also recognize an area where implementation by hardware is beneficial even necessary: parallelism. Hopefully we end up in obtaining a better understanding of the fact that hardware and software design share several important aspects, such as structuring and modularization that may well be expressed in a common notation. Subsequently we will proceed step by step, in each step introducing another programming concept expressed by a programming language construct.

2. Software and Hardware Compilers Compared


Hardware compilers produce a hardware implementation from some specification of the hardware. There are, however, a number of types of hardware compilers which start from different sources and compile into different targets. Architectural synthesis tools create a microarchitecture from a programlike description of the machines function. These systems jointly design the microcode to implement the given program and the datapath on which the microcode is to be executed; adding a bus to the data path, for example, can allow the microcode to be redesigned to reduce the number of cycles required to execute a critical operation. In general, architectural compilers that accept a fully general program input synthesize relatively con-ventional machine architectures. For this reason, architectural compilers display a number of relatively direct applications of programming-language compilation techniques. Architectural compilers design a machines structure, paying relatively little attention to the physical design of transistors and wires on the chip. Silicon compilers concentrate on the chips physical design. While these programs may take a high-level description of the chips function, they map that description onto a restricted architecture and do little architectural optimization. Silicon compilers assemble a chip from blocks that implement the architectures functional units. They must create custom versions of the function blocks, place the blocks, and route wires between them to implement the architecture. Perhaps the most ambitious silicon compiler project is Cathedral, a family of compilers that create digital filter chips directly from the filter specifications(pass band, ripple, etc .). Register-transfer compilers, in contrast, implement an architecture: they take as input a description of the machines memory elements and the data computation and transfer between those memory elements; they produce an optimized version of the hardware that implements the architecture. The target of a register-transfer compiler is a set of logic gates and memory elements; its job is to compile the most efficient hardware for the architecture. A register transfer compiler can be used as the back-end of an architectural compiler; furthermore, many interesting chips-protocols, microcontrollers, etc.-are best described directly as register-transfer machines. Even though register-transfer descriptions bear less resemblance to programming languages than do the source languages to architectural compilers, we can still apply a number of programming-language compiler techniques to register-transfer compilation.

3. The Structure of the Compiler

Our hardware compiler he performs a series of translations to transform the FSM description into a hardware implementation. Figure 3 shows the compilers structure. The compiler represents the design as a graph; successive compilation steps transform the graph to increase the level of detail in the design and to perform optimizations. The first step, syntax-directed translation, translates the source description into two directed acyclic graphs (DAGS): Gc for the combinational functions, and Gm for the connections between the combinational functions and the memory elements. Global optimization transforms Gc into a Gc that computes the same functions but is smaller and faster than Gc.Gc and Gm are composed into a single intermediate form Gi analogous to the intermediate code generated by a programming-language compiler. The silicon code generator transforms the intermediate form into hardware implemented in the available logic gates and registers. This implementation is cleaned up by a peephole optimizer; the result can be given to a layout program to create a final chip design.

4. Syntax Directed Translation


Translation of the source into Boolean gates plus registers is syntax-directed, much as in programming language compilers, but with some differences because the target is a purely structural description of the hardware. The source description is parsed into a pure abstract syntax tree (AST). Leaf nodes in the tree represent either signals or memory elements, which consist of an input signal and an output signal. Subtrees describe the combinational logic functions. Translation turns the AST into the graphs Gc and G V%* Simple Boolean functions of signals are directly translatable; if statements (and by implication, switch statements) also represent Boolean functions: if (cond) z=a; else z=b; becomes the function2 = (cond&a) I(!cond&b).Syntax-directed translation is a good implementation strategy because global optimization methods for combinational logic are good enough to wipe out a wide variety of inefficiencies introduced by a naive translation of the source into Boolean functions. The simplest way to translate the source into logic is to reproduce the AST directly in the structure of the combinational function graph Gc, creating functions that are too fragmented for good global optimization. We use syntax-directed partitioning after translation to simplify the logic into fewer and larger functions: collapsing all subexpression trees into one large function, for example, takes much less CPU time than using general global optimization functions to simplify the subexpressions. This strategy is akin to simple programming-language compiler optimizations like in-line function expansion.

5.

STRUCTURE OF HARDWARE COMPILER

FSM MODEL SYNTAX DIRECTED TRANSLATION COMBITION AL LOGIC MEMORY ELEMENTS

GLOBAL OPTIMIZATION

INTERMEDI ATE FORM

SILICON CODE GENERATION

PEEPHOLE OPTIMIZATION AUTOMATIC LAYOUT

Figure 3: How to compile register-transfer descriptions into layout.

MASK FOR MANUFACT URE

solve both the minimization and common subexpression elimination problems, but these methods can become stuck in local minima. We use other global optimization techniques very different from programming-language compiler methods that take advantage the restrictions of the Boolean domain. We reduce the combinational logic to a canonical sum-of products form and use modern extensions [8] to the classical Quine-McCluskey optimization procedures. We eliminate common subexpressions using a technique[9] tuned to the canonical sum-of-products representation. Because Boolean logic is a relatively simple theory, our global optimizer can deduce many legal transformations of the combinational functions and explore a large fraction of the total design space. Global optimization commonly reduces total circuit area by 15%.Global logic optimization is similar to machine independent code optimization in programming-language compilers in that it is performed on an intermediate form-in this case, networks of Boolean functions in sum-of-products form. As in programming language compilation, many optimizations can be performed on this implementation-independent representation; but, as in programming languages, some optimizations cannot be identified until the target-in this case, the hardware gates and registers available-are known. For example, if the available hardware includes a register that provides its output in both true and complemented form, we can use that complemented form directly rather than compute it with an external inverter. Global optimization algorithms for Boolean functions are not good at finding these implementationdependent optimizations; the code-generation phase transforms the intermediate form into efficient hardware.

6 . Global Optimization
As in programming-language compilation, global optimization tries to reduce the resources required to implement the machine. The resources in this case are the combinational logic functions-we want to compute the machines outputs and next state using the smallest, fastest possible functions. Combinational logic optimization combines minimization, which eliminates obvious redundancies in the combinational logic (analogous to dead code elimination) and simplifies the logic (analogous to strength reduction and constant folding), and common subezpression elimination, which reduces the size of the logic functions. Compiler-oriented methods have been used with success by the IBM Logic Synthesis System to that attracted interest used a rule-based expert system. But silicon code generation is so similar to tree-matching methods for retargetable code generation that we use twig [2] , a tree manipulator used for constructing code generators for programming language compilers, to

10

create the DAGON silicon code generator [21]. DAGON can implement hardware optimized for speed, area, or some function of both. Figure 4 shows how silicon code generation works. For each new set of silicon instructions we must generate a tree-matching program. Twig compiles a set of patterns and actions that describe the code transformations into a tree matcher for that library. The code generation phase of a hardware compilation run starts from the intermediate form Gi that includes both the Boolean functions for the combinational logic and the memory elements. That graph is parsed and broken into a forest of trees which are fed to the code generator. As we will describe later, code generation can be performed iteratively to optimize the silicon code for speed. At the end of code generation the intermediate form has been compiled into an equivalent piece of hardware built from the available library components.

7. Silicon Code Generation


The portion of our system that makes the most innovative use of compiler methods is the silicon code generator. Our circuits are implemented a library of registers and gates that implement a subset of the Boolean functions-NAND, NOR, AND-ORINVERT(AOI), etc. This library forms the silicon instruction set-it is an exact analog of a machines instruction set. The silicon code generators job is to translate the intermediate form produced by the global optimizer into the best silicon code-networks of the gates and registers available in the library. The first approach to silicon code generation was ad hoc reminiscent of the do what you have to do era of code generation. Another approach

Figure 4: The silicon code generation process. that attracted interest used a rule-based expert system[19]. But silicon code generation is so similar to tree-matching methods for retarget able code generation that we use twig [2] [30], a tree manipulator used for constructing code generators for programming language compilers, to create the DAGON silicon code generator [21]. DAGON can implement hardware optimized for speed, area, or some function of both. Figure 4 shows how silicon code generation works. For each new set of silicon instructions we must generate a tree-matching program. Twig compiles a set of patterns and actions that describe the code transformations into a tree matcher for that library. The code generation phase of a hardware compilation run starts from the intermediate form Gi that includes both the Boolean functions for the combinational logic and the memory elements. That graph is parsed and broken into a forest of trees which are fed to the code generator.

11

As we will describe later, code generation can be performed iteratively to optimize the silicon code for speed. At the end of code generation the intermediate form has been compiled into an equivalent piece of hardware built from the available library components.

7.1 Compiling the Tree Matcher Code generation finds the best mapping from the intermediate form onto the object code by tree matching. We want an efficient implementation of the matching program, but we also want to be able to matching automaton from the code patterns and the actions. The patterns supplied to twig describe either direct mappings of the intermediate form onto the target code or may describe a mapping that contains a simple local optimization. Local optimizations may be thought of as clever code sequences that are not natural matches of the instruction set. A twig pattern statement has three parts. The first is the pat tern itself, writ ten in a form like that used by parser generators. A pattern can be used to describe a class of simple pattern instances by parameterizing the pattern. For instance, five patterns for AOIxxx actually describe sixty-four pattern instances from an A01444 to an AOI211. Many of these patterns are symmetric-an A01114 is equivalent to, and would be output as, an AO1411. The second part of a pattern statement is a formula that defines the cost of applying the pattern. Separately we supply twig with functions for adding, computing, and comparing costs. The tree matcher computes the cost for a tree using the user-supplied routine uses these costs to choose by dynamic programming the minimum cost match over the entire tree. The third part of a pattern statement is either a rewrite command or an action routine. Some matches cause the input circuit to be transformed and reprocessed to find new matches, while others cause immediate actions. DAGONS action is always to write out the code chosen by the match. The patterns, actions, and cost functions are compiled by twig into a tree-matching automaton.

12

Figure 5: Partitioning the intermediate form DAG into trees

The automaton finds the optimum match for a tree in two phases: first, finding all the patterns that match subtrees; next, by finding the set of matches that cover the tree at the lowest cost. We will describe these steps in detail below. We have written 60 parameterised patterns to describe the AT&T Cell Library. These patterns expand into over seven hundred pattern instances that represent permutations of our hardware library of approximately 165 gates and a few thousand unique local optimisations. The modularity afforded by describing the code generator as patterns, actions, and cost functions makes it easy to port our code generator to a new hardware library, just as code generators can be ported to new instruction sets. To date we have created code generators for six different hardware libraries.

13

Figure 6: An intermediate form tree and all its possible matches. 7.2 Code Generation The intermediate form Gi is a DAG, so it must be broken into a forest of trees before being processed by the tree matcher produced by twig. Figure 5 shows an intermediate form DAG partitioned into three trees; the Xs show where the DAG has been cut. The partitioned DAG is sent one tree at a time to the tree matcher. It first determines all possible matches for each vertex u. The matching automaton finds all possible matches using the Aho-Corasick algorithm[l], which works in time O(TREESIZE xMAX-MATCHES): the size of the tree times the maximum number of matches possible at any node. Figure 6 shows a tree and all its matches. Each node matched a pattern corresponding to either a two-input NAND gate (ND2) or an inverter (INV); in addition, the subtree formed by the top three ranks of nodes matched a larger pattern corresponding to an A0121 gate. The optimum code is generated by finding the least-cost match of patterns that covers the tree from among all the possible matches. The tree matcher generated by twig finds the optimum match by dynamic programming (41. Optimum tree matching can be solved by dynamic programming because the best match for the root of a tree is a function of the best matches for its subtrees. Figure 7 shows how the pattern-matching automaton finds the optimum match for our example tree. The tree matcher starts from the leaf nodes and successively matches larger subtrees. The nodes at the bottom of the tree each have only one match; the number shown with the name of the best match is the total cost of that subtree. The root node has two matches: the INV match gives a total tree cost of 13; the A012 1 match (which covers several previously matched nodes) gives a tree of cost 7. The A0121 match gives the smallest total cost, so the automaton labels the node with A012 1. This example is made simple because we tried to match only a few instructions;

14

matching even this simple DAG against our full instruction set would give four to five matches per node. Pure tree matching would not be able to handle two important cases: matching non-tree patterns, like exclusive-or gates; and matching patterns across tree boundaries. We handle these cases by associating at-tributes that describe global informationfan-out, inverse, and signal-with each vertex in the tree. Inverse tells if the inverted sense of a signal is available somewhere in the circuit. This is very useful because a number of area reducing optimizations can often be made if the inverted sense of a signal can be used in place of the signal itself. Signal gives the name of a signal, which can be used with inverse to represent a non-tree pattern such as an exclusive-or. Fan-out is used to determine timing costs. Because patterns are matched against a forest of trees rather than the original DAG, DAGON may not find the globally optimal match. Finding an optimal DAG-matching (DAG-covering) is an NP-complete [ 31]problem. However, the combination of treematching and peephole optimizations has proven to be an effective method for silicon code generation. The MIS system tried extending tree matching to DAG covering [17], but the results were disappointing. MIS now uses tree-matching for code generation.

after partial match

15

Figure 7: Finding an optimal match. 7.3 Local Optimizations for Speed While not all industrial designs have to run at high speed, there are always circuits that have to run very fast. Because path delay is a global property of the hardware it can be hard for designers to optimize large machines for speed. Furthermore, reducing the delay of a piece of hardware almost always costs additional area. We want to meet the designers speed requirement at a minimum area cost. As Figure 4 shows, code generation can be applied iteratively to jointly optimize for speed and area. The first pass through the tree matcher produces hardware strictly optimized for area. A static delay analyzer finds parts of the hardware that must be speeded up and produces speed constraints. The code generated by the previous pass is translated back into the intermediate form and sent through the tree matcher along with the constraints. The constraints cause the matcher to abort matches that cannot run at the required speed. Because many Boolean functions are represented by both lowspeed/low-power and high-speed/high-power gates in the library, the matcher often need only substitute the high-speed version of a gate to meet the speed constraint, but a constraint can force the entire tree to be rematched. By this procedure we have been able to routinely double the speed of circuits with an area penalty of only 10%. This is such a significant improvement that DAGON is being used for timing optimization of manually designed hardware. Speed optimization is sufficiently hard that DAGON has been able to similarly improve even these hand-crafted designs. of DAGs, not trees. We are presently investigating techniques for height minimization a DAG.

8. Peephole Optimization
Inverters are the bane of a silicon code generator. Due to the fact that trees are matched independently it is common for a signal to be inverted, due to a local optimization, in two or more different trees when a single inverter would do the job. We clean up these and other problems with a peephole optimization phase that minimizes the number of inverters used. As in programming-language compilers, the peephole optimizer looks at the circuit through a small window; but this window reveals a portion of the circuit graph, not a portion of code text as in an object code peep hole optimizer. Peephole optimization reduces combinational logic area by as much as 10%.

9. Work-In-Progress: Global Speed Optimization


While circuit timing problems related to very local circuit properties, such as high capacitive loading, are well solved by the techniques described in the previous sections, not all timing problems in digital hardware can be solved by a single application of local optimizations. Some timing problems are due to having too many gates in a particular path through the hardware. To solve this problem the hardware must often be drastically restructured. As local optimizations using tree matching are relatively inexpensive (linear time), our current approach is to iteratively apply the local optimizations to achieve more global transformations. This procedure is both aesthetically and practically unsatisfying.

16

We would prefer to make hardware transformations that more quickly result in hardware which meets timing requirements. Tree height reduction algorithms [lo], developed in the context of reducing the height of arithmetic expression trees to speed their execution on a parallel processor, suggest an approach to this problem. While this method, and its successors, has promise, we have found two limitations in its application to hardware: First, when compiling to a fixed architecture, resources are free up to the point of saturation and infinitely expensive after that. Thus it makes since to reduce tree height until all processors are saturated. When designing hardware, however, we are interested in speeding up the hardware to some limit specified by the user. We have no fixed resource limit in silicon code generation to provide a lower bound on tree height, but maximally reducing all trees would create hardware much larger than necessary. Secondly, we are concerned with minimizing the height

10. Evaluation of the Silicon Instruction Set


Highly portable compilers for fixed architectures encouraged the evaluation of instruction sets to determine the most cost-effective set of instructions: instruction sets that gave the fastest compiled code at the minimum cost in hardware. While a larger silicon instruction set does not have the unpleasant side effect of increasing the complexity of the implementation, a larger silicon instruction set does require more human resources to develop and maintain. The cost of developing and maintaining the instruction set must be weighed against the improvement in the speed and area of hardware that can be compiled. The availability of a highly portable silicon code generator made possible a long overdue experiment, that of evaluating the impact of the silicon instruction set on the quality of the final hardware. We evaluated five different silicon instruction sets ranging in complexity from 2 to 125 gates. Our results on the twenty standard benchmark examples show a monotonic, though exponentially decreasing, reduction in chip area, ranging from 26% (7 instructions) to 45% (125 instructions). These results were further verified against a number of industrial designs. If we compare chip area with size of object code, these results simply show that with the addition of more and more powerful instructions the size of object code is reduced. What was less predictable was the knee of the area reduction per number of instructions curve. Beyond an instruction set size of 37 instructions, which gave an area reduction of 40%, adding instructions gave little area reduction: an additional 88 instructions only reduced area by an additional 2%. A more surprising result was that all silicon instruction sets showed comparable speed.

11 .Working of hardware compiler


11.1. Variable Declarations All variables are introduced by their declaration. We use the notation VAR x, y, z: Type In the corresponding circuit, the variables are represented by registers holding their values. The output carries the name of the variable represented by the register, and the enable signal

17

determines when a register is to accept a new input. We postulate that all registers be clocked by the same clock signal, that is, all derived circuits will be synchronous circuits. Therefore, clock signals will subsequently not be drawn and assumed to be implicit. 11.2. Assignment Statements An assignment is expressed by the statement y := f(x) Where y is a variable, x is (a set of) variables and f is an expression in x. The assignment statement is translated into the circuit according to Fig. 1, with the function f resulting in a combinational circuit (i.e. a circuit neither containing feedback loops nor registers). e stands for enable; this signal is active when the assignment is to occur. Given a time t when the register(s) have received a value, the combinational circuit f yields the new value f(x) after a time span pd called the propagation delay of f. This requires the next clock tick to occur no before time t+pd. In other words, the propagation delay determines the clock frequency f < 1/pd. The concept of the synchronous circuit dictates, that the frequency is chosen according to the circuit f with the largest propagation delay among all function circuits. The simplest types of variables have only 2 possible values, say 0 and 1. They are said to be of type BIT (or Boolean). However, we may just as well also consider composite types, that is, variables consisting of several bits. A type Byte, for example, might consist of 8 bits, and the same holds for a type Integer. The translation of arithmetic functions (addition, subtraction, comparison, multiplication) into corresponding combinational circuits is well understood. Although they be of considerable complexity, they are still purely combinational circuits.

11.3. Parallel Composition We denote parallel composition of two statements S0 and S1 by S0, S1 The meaning is that the two component statements are executed concurrently. The characteristic of parallel composition is that the affected registers feature the same enable signal. Translation into a circuit is straightforward, as shown in Fig 1 (right side) for the example x := f(x, y), y := g(x, y)

18

The characteristic of parallel composition is that the registers feature the same enable signal. 11.4. Sequential Composition Traditionally, sequential composition of two statements S0 and S1 is denoted by S0; S1 The sequencing operator ";" signifies that S1 is to be executed after completion of S0. This necessitates a sequencing mechanism. The example of the three assignments y := f(x); z := g(y); x := h(z) is translated into the circuit shown if Fig. 2, with enable signals e corresponding to the statements.

The upper line contains the combinational circuits and registers corresponding to the assignments. The lower line contains the sequencing machinery assuring the proper sequential execution of the three assignments. With each statement is associated an individual enable signal e. It determines when the assignment is to occur, that is, when the register holding the respective variable is to be enabled. The sequencing part is a "onehot" state machine. The following table illustrates the signal values before and after each clock cycle. e0 e1 e2 e3 x y z 1 0 0 0 x0 ? ? 0 1 0 0 x0 f(x0) ? 0 0 1 0 x0 f(x0) g(f(x0)) 0 0 0 1 h(g(f(x0))) f(x0) g(f(x0)) The preceding example is a special case in the sense that each variable is assigned only a single value, namely y is assigned f(x) in cycle 0, z is assigned g(y) in cycle 1, and x is assigned h(z) in cycle 2. The generalized case is reflected in the following short example:

19

x := f(x); x := g(x) which is transformed into the circuit shown in Fig. 3. Since x receives two values, namely f(x) in cycle 0 and g(x) in cycle 1, the register holding x needs to be preceded by a multiplexer.

11.5. Conditional Composition This is expressed by the statement

20

IF b THEN S END where b is a Boolean variable and S a statement. The circuit shown in Fig. 4 on the left side is derived from it with S standing for y := f(x). The only difference to Fig. 1 is the derivation of the associated sequencing machinery yielding the enable signal for y.

It now emerges clearly that the sequencing machinery, and it alone, directly reflects the control statements, whereas the data parts of circuits reflect assignments and expressions. The conditional statement is easily generalized to its following form: IF b0 THEN S0 ELSE S1 END In its corresponding circuit on the right side of Fig. 4 we show the sequencing part only with the enable signals corresponding to the various statements. At any time, at most one of the statements S is active after being triggered by e0. 11.6. Repetitive Composition We consider repetitive constructs traditionally expressed as repeat and while statements. Since the control structure of a program statement is reflected by the sequencing machinery only, we may omit showing associated data assignments, and instead let each statement be represented by its associated enable signal only. We consider the statements

REPEAT S UNTIL b and WHILE b DO S END again standing for a Boolean variable, they translate into the circuits shown in Fig. 5, where e stands for the enable signal of statement S.

21

11.7. Selective Composition The selection of one statement out of several is expressed by the case statement, whose corresponding sequencing circuitry containing a decoder is shown in Fig. 6. CASE k OF 0: S0 | 1: S1 | 2: S2 | ... | n: Sn END

11.8. Preliminary Discussion

22

At this point we realize that arbitrary programs consisting of assignments, conditional and repeated statements can be transformed into circuits according to fixed rules, and therefore automatically. However, it is also to be feared that the complexity of the resulting circuits may quickly surpass reasonable bounds. After all, every expression occurring in a program results in an individual combinational circuit; every add symbol yields an adder, every multiply symbol turns into a combinational multiplier, potentially with hundreds or thousands of gates. We agree that for the present time this scheme is hardly practical except for toy examples. Cynics will remark, however, that soon the major problem will no longer be the economical use of components, but rather to find good use of the hundreds of millions of transistors on a chip. Inspite of such warnings, we propose to look at ways to reduce the projected hardware explosion. The best solution is to share subcircuits among various parts. This, of course, is equivalent to the idea of the subroutine in software. We must therefore find a way to translate subroutines and subroutine calls. We emphasize that the driving motivation is the sharing of circuits, and not the reduction of program text. Therefore, a facility for declaring subroutines and textually substituting their calls by their bodies, that is, handling them like macros, must be rejected. This is not to deny the usefulness of such a feature, as it is for example provided in the language Lola, where the calls may be parameterized [2, 3]. But with its Expansions of the circuit for each call it contributes to the circuit explosion rather than being a facility to avoid it. 11.9. Subroutines The circuits obtained so far are basically state machines. Let every subroutine translate into such a state machine. A subroutine call then corresponds to the suspension of the calling machine, the activation of the called machine, and, upon its completion, a resumption of the suspended activity of the caller. A first measure to implement subroutines is to provide every register in the sequencing part with a common enable signal. This allows to suspend and resume a given machine. The second measure is to provide a stack (a firstin lastout store) of identifications of suspended machines, typically numbers from 0 to some limit n. We propose the following implementation, which is to be regarded as a fixed base part of all circuits generated, analogous to a runtime subroutine package in software. The extended circuit is a pushdown machine. The stack of machine numbers (analogous to return addresses) consists of a (static) RAM of m words, each of n bits, and an up/down counter generating the RAMaddresses. Each of the n bits in a word corresponds to one of the circuits representing a subroutine. One of them has the value 1, identifying the circuit to be reactivated. Hence, the (latched) readout directly specifies the values of the enable signals of the n circuits. This is shown in Fig. 7. The stack is operated by the push signal, incrementing the counter and thereafter writing the applied input to the RAM, and the pop signal, decrementing the counter.

23

The push signal is activated whenever a state register belonging to a call statement becomes active. Since a subroutine may contain several calls itself, these register outputs must be ORed. A better solution would be a bus. The pop signal is activated when the last state of a circuit representing a subroutine is reached. An obvious enhancement with the purpose to enlarge the number of possible subcircuits without undue expansion of the RAM's width is to place an encoder at the input and a decoder at the output of the RAM. A width of n bits then allows for 2n subroutines. The scheme presented so far excludes recursive procedures. The reason is that every subcircuit must have at most one suspension point, that is, it can have been activated at most once. If recursive procedures are to be included, a scheme allowing for several suspension points must be found. This implies that the stack entries must not only contain an identification of the suspended subcircuit, but also of the suspension point of the call in question. This can be achieved in a manner similar to storing the circuit identification, either as a bit set or an encoded value. The respective state reenable signals must then be properly qualified, further complicating the circuit. We shall not pursue this topic any further.

24

12.Conclusion
We have found that a number of techniques and algorithms developed for software compilers for fixed architectures are useful in hardware compilation. We add the caveat that while programming language compilation techniques for fixed architectures can often give us a good start on many problems in hardware compilation it is important to remember that language-to-hardware compilations are harder pre-cicely because the hardware architecture is not fixed. A good software compiler must optimize its code for the target of its code. A good hardware compiler must be able optimize its code for a target architecture, but must also explore resource tradeoffs across a potentially wide range of such targets. Nevertheless, we hope that hardware and programming- language compilation techniques become even more closely related as we discover more general ways to compile software descriptions into custom hardware. For the moment we continue to experiment with and extend our hardware compiler technology to make it more general, while still retaining its ability to create industrial-quality designs. It is now evident that subroutines, and more so procedures, introduce a significant degree of complexity into a circuit. It is indeed highly questionable, whether it is worth while considering their implementation in hardware. It comes as no surprise that state machines, implementing assignments, conditional and repetitive statements, but not subroutines, play such a dominant role in sequential circuit design. The principal concern in this area is to make optimal use of the implemented facilities. This is achieved if most of the subcircuits yield values contributing to the process most of the time. The simplest way to achieve this goal is to introduce as few sequential steps as possible. It is the underlying principle in computer architectures, where (almost) the same steps are performed for each instruction. The steps are typically the subcycles of an instruction interpretation. Hardware acts as what is known in software as an interpretive system. The strength of hardware is the possibility of concurrent operation of subcircuits. This is also a topic in software design. But we must be aware of the fact that concurrency is only possible by having concurrently operating circuits to support this concept. Much work on parallel computing in software actually ends up in implementing quasiconcurrency, that is, in pretending concurrent execution, and, in fact, in hiding the underlying sequentiality. This leads us to contend that any scheme of direct hardware compilation may well omit the concept of subroutines, but must include the facility of specifying concurrent, parallel statements. Such a hardware programming language may indeed be the best way to let the programmer specify parallel statements. We call this finegrained parallelism. Coarsegrained concurrency may well be left to conventional programming languages, where parallel processes interact infrequently, and where they are generated and deleted at arbitrary but distant intervals. This claim is amply supported by the fact that finegrained parallelism has mostly been left to be introduced by compilers, under the hood, so to say. The reason for this is that compilers may be tuned to particular architectures, and they can therefore make use of their target computer's specific characteristics. In this light, the consideration of a common language for hard and software specification has a certain merit. It may reveal the inherent difference in the designers' goals. As C. Thacker expressed it succinctly: "Programming (software) is about finding the best sequential algorithm to solve a problem, and implementing it efficiently. The hardware designer, on the other hand, tries to bring as much parallelism to bear on the problem as possible, in order to improve performance". In other words, a good circuit is one where most gates contribute to the result in every clock cycle. Exploiting parallelism is not an optional luxury, but a necessity. We end by repeating that hardware compilation has gained interest in practice primarily because of the recent advent of largescale programmable devices. These can be configured

25

onthefly, and hence be used to directly represent circuits generated through a hardware compiler. It is therefore quite conceivable that parts of a program are compiled into instruction sequences for a conventional processor, and other parts into circuits to be loaded onto programmable gate arrays. Although specified in the same language, the engineer will observe different criteria for good design in the two areas.

26

13. References

1. Aho and M. Corasick. Efficient string matching: an aid to bibliographic search. Journal of the ACM, 18(6):333-340, June 1975. 2. A. Aho and M. Ganapathi. Efficient tree pattern matching: an aid to code generation. In Conference Record of the Twelfth Annual ACM Symposium on Principle3 of Programming Languages,pages 334-340, January 1985. 3. A. Aho and S. C. Johnson. Code generation for a one register machine. Journal of the ACM, 23(3):502-510, 1976. 4. A. Aho and S. C. Johnson. Optimal code generation for expression trees. Journal of the ACM, 23(3):488-501, 1976. 5. The Yorktown Silicon Compiler. In Proceedings, International Conference on Circuits and Systems, pages 391- 394, IEEE Circuits and Systems Society, June 1985. 6. Richard Brent, David Kuck, and Kiyoshi Maruyama. The parallel evaluation of arithmetic expressions without division. In IEEE Transactions on Computers, pages 532-534, May 7. J. Darringer, W. Joyner, C. L. Berman, and L. Trevillyan. Logic synthesis through local transformations. IBM Journal of Research and Development, 25(4):272-280, 1981. 8. H. DeMan, J. Rabaey, P. Six, and L. Claesen. Cathedral-ii: a silicon compiler for digital signal processing. IEEE Design d Test, 3(6):13-25, 1984. 9. I. Page. Constructing hardwaresoftware systems from a single description. Journal of VLSI Signal Processing 12, 87107 (1996). 10. N. Wirth. Digital Circuit Design. SpringerVerlag, Heidelberg, 1995. 11. www.Lola.ethz.ch 12. Y. N. Patt, et al. One Billion Transistors.

You might also like