You are on page 1of 11

Helvio P. Peixoto and Margarida F.

Jacome, Algorithm and Architecture-level Design Space Exploration Using Hierarchical Data Flows, 11th International Conference on Application-specific Systems, Architectures and Processors, ASAP-97, Zurich, 1997

Algorithm and Architecture-level Design Space Exploration Using Hierarchical Data Flows
Helvio P. Peixoto and Margarida F. Jacome ECE Dept., The University of Texas at Austin, Austin, TX 78712 {peixoto|jacome}@ece.utexas.edu Abstract
Incorporating algorithm and architecture level design space exploration in the early phases of the design process can have a dramatic impact on the area, speed, and power consumption of the resulting systems. This paper proposes a framework for supporting system-level design space exploration and discusses the three fundamental issues involved in effectively supporting such an early design space exploration: denition of an adequate level of abstraction; denition of gooddelity system-level metrics; and denition of mechanisms for automating the exploration process. The rst issue, the denition of an adequate level of abstraction is then addressed in detail. Specifically, an algorithm-level model, an architecture-level model, and a set of operations on these models, are proposed, aiming at efciently supporting an early, aggressive system-level design space exploration. A discussion on work in progress in the other two topics, metrics and automation, concludes the paper.

Introduction

The ability to perform a thorough algorithm and architecture level design space exploration in the early phases of the design process can have a dramatic impact (orders of magnitude!) on the area, speed, and power consumption of a system.[1][2] Gains in performance achieved by parallelizing specic code segments, while taking into consideration the communication overhead introduced by such parallelism, are among the area-speed trade-offs that can be explored at the algorithmic level. Taking advantage of different execution speeds of hardware and software, and/or pipelining computations, are among the strategies that can be pursued at the architectural level in order to nd solutions that can cost-effectively meet the systems constraints. For the growing portable consumer electronics market, power consumption has also become a critical design concern.[1][2] As with area and speed, much can be done at the algorithmic and architectural levels to minimize the systems average power consumption, as discussed in [1]-[4]. For instance, in a typical signal processing application, the ratio of the critical path to the sampling rate indicates the potential for power reduction by lowering the voltage while still meeting the throughput requirements. Trade-offs between area and power can be achieved by using parallelism for voltage scaling. By identifying spatially local sub-structures, i.e., isolated, strongly connected clusters of operations within an algorithm, one can do partitioning so as to minimize global buses. Unfortunately, as the system size increases, so does the design space - as a consequence, algorithm and architecture-level design space exploration has been, so far, quite limited in practice.[5][12] This paper describes a model that supports efcient algorithm and architecture level design space exploration of complex systems subject to timing, area, and power constraints. The goal of this early exploration is to identify, in a reasonable amount of time, one or more regions of the design space that potentially hold a cost-effective solution, given an algorithmic level functional description and a set of constraints and optimization goals. A design space region is dened by a

broad architectural specication, which consists of: (1) a partition of the systems algorithmic description into a set of algorithmic segments; (2) and the denition of a set of hardware and/or software architectural components and interfaces, for implementing these algorithmic segments. During this early, aggressive design space exploration, only system level design issues are considered -- the resulting broad architectural specication(s) should thus be seen as inputs to the next steps of the top-down design process. Figure 1 summarizes our approach to early, system-level design space exploration. The search for a cost-effective system-level architecture is an iterative process where algorithm and architecture level design space exploration are synergistically combined, as briey discussed in the sequel. Two types of models are maintained: (1) algorithmic models that represent the system behavior; and (2) architectural models that represent possible architectures for implementing the system behavior. At the algorithmic level, timing, area, and power budgets are assigned to the various computations dened in the model. Such assignments are based on: (1) the constraints and optimization goals stated in the system specication; and (2) actual measures on the relative complexity of these computations (considering a particular implementation). Behavior preserving algorithmic transformations, such as parallelization of code segments or loop unrolling, are then applied during the algorithmic exploration, and their impact on the budgets assessed. At the architectural level, different partitions and mappings of the system behavior into architectural components (hardware or software) are explored. A library of components, properly characterized for the effect, denes the alternatives that are viable for implementing the particular system. Rough estimates on performance, area, and power of a particular architectural solution are then compared with the algorithmic budgets, in order to assess the solutions potential for meeting the specied constraints.
System Specication Budgets Algorithmic Model Library of Algorithms

System Level Design Space Exploration


Architectural Specication

Algorithm-Level Design Space Exploration

Architecture-Level Design Space Exploration

Estimates Promising Design Space Regions

Architectural Model

Library of Components

Figure 1 System-level design space exploration The remainder of this paper is organized as follows. Sections 2 and 3 describe the proposed algorithmic-level and architecture-level models, respectively, and identify the set of basic operations that should be implemented on these models in order to effectively support system-level design space exploration. Section 4 discusses the importance of deriving good delity systemlevel metrics for individual and comparative assessment of competing architectural solutions, in order to adequately guide the design space exploration. Some conclusions and discussion of work in progress are given in Section 5.

2
2.1

Algorithm Level of Abstraction


Algorithmic Representation

In our approach, the specication of a system comprises: (1) an algorithm-level behavioral description (written in C or VHDL), annotated with probabilistic bounds on data dependent loop

indices, and with transition probabilities on alternative control paths (dened by conditional constructs); (2) a set of timing constraints; and (3) a set of optimization goals on area and/or power. The algorithmic model is constructed by compiling the input specication into a set of hierarchical data ow graphs, [13]-[16] where nodes represent computations and edges represent data dependencies. The resulting model is hierarchical since basic blocks, dened by alternative (ow of) control paths, are rst represented as single complex nodes. Such complex nodes are then recursively decomposed (i.e., expanded) into sub-graphs, containing less complex nodes, until the lowest level of granularity (i.e., the single operation level) is reached. These last nodes are designated atomic nodes. A complex node is thus a contraction of nodes of lower complexity (i.e., smaller granularity). Complex nodes contain key information about their hierarchically dependent sub-graphs, such their critical path. Edges contain information on the data (tokens) produced and consumed by the various nodes in the graph, including the specication of their type and size. Such information is essential for determining the specic physical resources required by each atomic node. (For example, we may have some nodes performing oating-point multiplications while other nodes perform just integer multiplications.) The graphs G ( V , E ) in are polar, i.e., have the beginning and the end of their execution synchronized by a source and a sink nodes, respectively. [15] Sink and source nodes have no computational cost, they are used simply to assert that graphs contracted into complex nodes represent basic blocks dened by alternative control ow paths. In our proposed algorithmic model, hierarchy is represented by containment -- in order words, each graph may be considered at its higher level of abstraction or may have some of its complex nodes expanded into their corresponding subgraphs. Thereby, the level of granularity of the model (and thus its size and complexity), can be dynamically adjusted, according to the needs of the design space exploration. Type of Nodes Granularity Source Sink Merge Basic Block Select Atomic Atomic Atomic Complex Complex Firing Rules input-conjunctive/output-conjunctive input-conjunctive/output-conjunctive input-disjunctive/output-conjunctive input-conjunctive/output-conjunctive input-conjunctive/output-conjunctive input-conjunctive/output-disjunctive
?

Icon

Basic Operation Atomic

Table 1 Classication on Nodes Supported in the Algorithmic Model Table 1 summarizes the various types of nodes supported in the algorithmic model and the ring rules dened for these node types.[13][14] Three types of ring rules are dened: input-conjunctive/output-conjunctive, input-conjunctive/output-disjunctive, and input-disjunctive/output-conjunctive. Nodes with input-conjunctive/output-conjunctive ring rules become enabled for execution only when all incoming edges have data available (i.e., have tokens present), and after executing, they produce data (tokens) in all output edges. The source, sink, basic operation, and basic block nodes shown in Table 1 have input-conjunctive/output-conjunctive ring rules. Nodes with input-conjunctive/output-disjunctive ring rules, the second type, become enabled for execution only when all incoming edges have data available (i.e., tokens) but, differently from the previous case, they produce data only in a single output edge. Nodes with execution semantics based on this ring rule are denoted select since, as a result of their execution, a specic (alternative) control ow path is selected. Finally, nodes with input-disjunctive/output-conjunctive ring rules become enabled for execution as soon as data is available in one of their input arcs, and then produce data in all output edges, as in the rst case. Nodes with execution semantics based on this ring rule are denoted merge since they mark the end of a set of alternative control ow paths dened by a previ-

ous select node. Merge and select nodes are thus used, in combination, to model the loops and conditionals that may be present in a behavioral description, as illustrated in the example that follows. Consider the small segment of the QRS behavioral description [19] and its correspondent algorithmic model, shown in Figure 2.1 To facilitate the discussion, the hierarchy of basic blocks dened in procedure p is identied by the dotted boxes shown in the gure. Procedure p is thus composed of a single assignment instruction followed by a basic block, this last abstracting the entire while statement. The corresponding top level graph, denoted G in the gure, is thus composed of four nodes: a basic operation node (for the assignment), a basic block node (abstracting the while statement) and the required pair of sink and source nodes. The only complex node in G (the while basic block node) is shown expanded in graph G 1 -- note that the condition and the body of the while statement are represented by the select and the basic block nodes shown in G 1 . Their expansion results in graphs G 2 and G 3 , respectively. Finally, graph G 4 represents the if statement marked by the inner most dotted box in Figure 2. The select node, representing the if clause, is shown expanded in graph G 5 , and the basic block nodes dened by the then and the else parts of the statement are shown expanded in G 6 and G 7 , respectively. As mentioned above, the algorithmic description provided in the system specication is required to be annotated with proling information providing: (1) upper bounds and mean values for data dependent loop indices; and (2) relative frequencies of execution for basic blocks contained in alternative control ow paths (dened by if-then-else or switch conditional statements). This information is important when, in the presence of data dependent constructs, one still wants to be able to reason about (average and worst case) execution delays and power consumption.2 Accordingly, based on the proling information referred to above, a function is dened
procedure p(){ i := 1; while (i <= indx) { ft := ftm1 + (ecg1 - ecgm1) - ((ecg1 - ecgm1)/256); ysi := ft - ftm2; if (ysi > ymax) ymax := ysi; i++; }}
G2 G1 G G3 G5

<> + v ?
G7 G4 G6

<>

:=

:=

noop

Figure 2 Example: Algorithmic description and corresponding Algorithmic Model


1This

QRS is an algorithm used for monitoring heart-rate in ECG applications.

which associates a transition probability to every edge sourcing from a select node or sinking into a merge node -- such a function indicates the probability that the particular node will consume/produce tokens from/to the specic edge. Obviously, the summation of the transition probabilities of all arcs sourcing (sinking) from a select (merge) node has to be one. The function is used to estimate the average and the worst case execution delays and power consumption for all nodes in the model hierarchy. In the example that follows, we illustrate how average and worst case execution delays are estimated. Consider the graph G 4 shown in Figure 2. Assume that the execution delay of all atomic nodes is 1 reference operation, 3 except for the sink, source, merge, and no-op nodes, which are assumed to have zero execution delay. If the transition probabilities for edges u and v are, for example, ( u ) = 0.8 and ( v ) = 0.2 , then the average execution delay for G 4 (and thus the average execution delay of the corresponding contracted node, shown in G 3 ), is 1.8 reference operations. The critical path (or worst case execution delay) is dened, instead, as the longest path from source to sink. Since G 4 contains two alternative control ow paths, its critical path will thus be the longest path from source to sink either through edge u or through edge v. (Considering the execution delays provided in this example, the actual critical path of G 4 is 2 reference operations, and is dened through edge u.) In the case of graphs directly expanding loop statements, such as G 1 , transition probabilities are also dened (for the corresponding merge and select nodes) based on the upper bounds and on the mean values provided for the corresponding loop indices -- these transition probabilities are then similarly used to determine the worst case and the average execution delays for the particular loop nodes. During architecture level design space exploration, the specic resources that will execute each atomic node dened in the algorithmic model are selected from the components library. This design decision provides basic information on execution delay and power consumption for all of the atomic nodes in the model. For example, a specic multiplier may be dened as having an execution delay of 3 reference operations (normalized time measure) and as consuming 4 power units per reference operation (normalized power measure), while another resource, say, an adder may only take 1 reference operation and consume a single power unit per reference operation. (Note that, for atomic nodes, the worst case and the average for delay and also for power consumption are assumed to be the same4). Starting from those normalized measures for the atomic nodes, the hierarchy is then traversed, bottom-up, and the worst case and average execution delay and power consumption for all complex nodes in the model are computed, as discussed in the previous example. This permits a quick identication of (potential) constraint violations at any level of the hierarchy, by comparing these values with individual node budgets, derived from the system specication, as discussed in the sequel. The area and power constraints (given in the system specication) are translated into delay and power budgets for the various nodes in the algorithmic model. These budgets are dened topdown, considering the relative time complexity and the relative power requirements of the various nodes in the model, i.e., considering the execution delay and the power consumption values computed for these nodes during the previous step (recall that all measures were normalized to a reference operation.) Hierarchy is explored during the budgeting process, since the budgets for a complex node are distributed through the nodes contained in its expansion graph. Budgets thus establish upper-bounds on worst case execution delays, average power consumption, etc., for each node in the hierarchy. Feasibility is assessed by comparing the node budgets with the actual values required by the current node implementation, which can be determined by resolving the reference operation. Nodes that violate constraints at any level of the hierarchy can be thus quickly identied, and then locally optimized.

of probabilistic constraints can be also performed using the proposed algorithmic model (in addition to the worst case analysis to be discussed next). For a detailed discussion on algorithms for performing probabilistic constraint analysis using the algorithmic model discussed in this paper see [21]. 3The reference operation is a normalized measure used in the components library. 4At system-level, it would be prohibitively expensive to consider otherwise.

2Analysis

2.2

Algorithmic Model: Basic Operations

Transformations at the algorithmic level are modications to the graph structure that change the number, type and/or precedence between nodes within the graph, while preserving its overall input/output behavior. Examples are algorithm substitution, algorithm parallelization, loop unrolling, common sub-expression elimination/replication, constant propagation, operation strength reduction, function in-lining, etc. Due to the hierarchical nature of the algorithmic model, it is possible to apply these transformations to functions of any size and complexity, or to any other types of basic blocks, or even to individual operations, as in the case of operation strength reduction. For example, at the functional level, one may try a completely different implementation of a given algorithm or, instead, just try to parallelize a given block of operations, in order to reduce the critical path. In the example given in Figure 2, one could, for example, easily substitute the entire graph G 1 , i.e., the node that contracts it, with another node, derived by applying loop unrolling and sub-expression elimination to the original code segment. Note that graph G would remain the same, as well as all of the other graphs located above G 1 in the hierarchy.

3
3.1

Architecture Level of Abstraction


Architectural Representation and Component Libraries

As mentioned above, the design decisions made during architecture-level exploration are compiled in a broad architectural specication, which comprises: (1) a rst-cut partitioning of the algorithmic model into a set of algorithmic segments, each of which is to be implemented by an individual architectural component; and (2) a set of fundamental design decisions on the implementation of these architectural component and their interfaces. Architectural components are single ICs -- they thus dene a space for resource sharing and they also dene an independent clock. Architectural components can be implemented in hardware, meaning that the corresponding architectural component will be an ASIC or an FPGA, or can alternatively be implemented in software, meaning that the corresponding architectural component will be realized by an off-the-shelf micro-controller, DSP, or general purpose processor. The decision of implementing a giving system component in hardware or in software is done by selecting a target component library that properly describes the required implementation technology. So far we have mostly concentrated on dening hardware libraries tailored to ASIC designs. In these libraries, components, such as multipliers and adders, are characterized in terms of their delay, area, power consumption, voltage supply, etc. Unless otherwise stated, the discussion from this point on focuses on hardware architectural components only. The basic elements of the proposed architectural model are: architectural components, modules, and physical resources. In order to efciently support architectural exploration, three additional elements were added to the algorithmic model: algorithmic segments, clusters of nodes, and pipeline stages. In this section we focus on precisely dening those various model elements and in Section 4 we discuss their use during the design space exploration. As referred to above, the original algorithmic model is partitioned into a set of algorithmic segments -- an algorithmic segment thus comprises one or more (sub)graphs belonging to an algorithmic model.5 Figure 3(a), for example, shows two algorithmic segments dened on a simple algorithmic model that contains only one top-level graph with ve basic block nodes. Architectural components are ICs that implement algorithmic segments using a given target library, i.e., a given universe of components (see Figure 4). The implementation of an algorithmic segment (by an architectural component) is dened as the implementation of the various clusters of nodes dened for the particular segment, as discussed next. A cluster of nodes is a set of one or more connected atomic nodes belonging to a graph within an algorithm segment -- Figure
5Some

transformations may need to be performed in nodes at the interface of a partition but, for simplicity, we omit them here.

Module 1 A C B Algorithmic Segment 1 D + * Module 2 + A C B E

PipeStage1 D

E Algorithmic Segment 2

PipeStage2

(a) (b) (c) Figure 3 Example of (a) Algorithm Segments; (b) Clusters and Modules; and (3) Pipeline Stages 3(b), for example, shows three clusters dened within a particular graph. In order to completely dene an architecture component, every atomic node within an algorithm segment must thus belong to one and only one cluster. Clusters of nodes are implemented by modules -- a module consists of one or more physical resources (specically, functional units), optimally interconnected for executing a specic cluster of nodes6. If no scheduling conicts exist, modules can be shared among isomorphic clusters. For example, Module 1 in Figure 3(b) implements a multiplication and is used by only one single cluster, while Module 2 implements an addition followed by a subtraction and is shared by two isomorphic clusters. We conclude the discussion by introducing the notion of pipeline stage. As mentioned in Section 2, the graphs that comprise an algorithmic description are polar, i.e., have a source and a sink node that synchronize the beginning and the end of the execution of the graph. By default, each of such graphs is fully enclosed in what we call a pipeline stage -- specically, this means that the source node can only be red after the sink node has completed its execution. This default scheduling policy can be modied, though, by increasing the number of pipeline stages dened within a graph. Figure 3(c), for example, shows a graph where two pipeline stages were dened -- the nodes enclosed in each of the two stages shown in the gure can now execute concurrently with respect to each other. 3.2 Architectural Model: Basic Operations

Operations at the architectural level are aimed at changing the structure or type of architectural components, clusters, modules and pipeline stages. These basic operations include: (1) splitting and merging of algorithmic segments and/or modication of the target component library for a segment; (2) migration of node clusters across algorithmic segments, checking for invalid conditions due to resource sharing; (3) merging and splitting of clusters, checking for invalid conditions, and automatically generating new modules, when applicable; (4) denition of pipeline stages within a graph; (5) algorithm substitution;7 and (6) substitution of individual physical resources within modules.

Metrics and Design Space Exploration

Sections 2 and 3 discussed the level of abstraction at which the proposed system level design space exploration is to be conducted -- properly dening this level of abstraction is of paramount importance in order to enable an aggressive, early exploration of design spaces of such huge
6As

will be discussed in Section 4, clusters and modules are used to quantify the potential for implementation overhead minimization. 7Blocks of statements can be substituted by different, functionally equivalent blocks. This allows us, for instance, to experiment with different loop factorizations.

dimensions. A second, equally fundamental issue in supporting early design space exploration concerns to the ability to evaluate the absolute and relative goodness of competing solutions, i.e., broad architectural specications, with respect to the area, speed and power constraints dened in the specication. One of the fundamental difculties involved in accurately estimating performance, area, and power consumption at any level of design abstraction other than the physical level, is the ability to precisely account for the physical resources that are abstracted, i.e., not yet dened at the particular level. As suggested in [3][4], resources can be broadly classied into two categories: algorithm inherent and implementation overhead. Algorithm inherent resources are the functional units (multipliers, adders, etc.) needed to implement the operations dened in the systems algorithmic description -- such resources are tangible and can be reasoned upon at the system-level of abstraction. Implementation overhead resources account for all of the other resources needed to correctly execute such operations, including control logic, steering logic (e.g, buses and MUXES), registers, and wiring. Implementation overhead resources are not yet dened at the system-level of abstraction. Unfortunately, implementation overhead can signicantly contribute to the delay, power consumption and area of a design. Thereby, ignoring implementation overhead, by performing tradeoffs only in terms of functional units, may lead to sub-optimal design solutions. For example, at the architecture level of abstraction, one may attempt to trade-off speed for area by increasing the level of resource sharing in a given design. However, due to the corresponding increase in control complexity and to a possible increase in number and/or size of buses, etc., the overall area of the design may actually increase! Similarly, at the algorithm level of abstraction, one may attempt to reduce power consumption using, for example, operation strength reduction. A recent case study illustrates the results of applying one of such transformations -- multiplications substituted by adds and shifts -- to fourteen Discrete Cosine Transform (DCT) algorithms.[17] Interestingly, in all cases the overall power consumption has actually increased as a result of the transformation -- in half of the cases it actually increased by more than 40%! This was so because the power consumption in buses, registers and control circuitry has dramatically increased, in some cases by more than 400%, totally overshadowing the power savings in the functional units. The central issue thus becomes to be able to account for the relative impact of implementation overhead in the performance, area, and power consumption of competing alternative solutions without having hard information about these resources. In order to tackle this problem, we are working on an approach that combines: (1) A set of metrics designed to rank competing solutions based on their potential to take advantage of specic properties that correlate to minimal implementation overhead. Examples of such properties are given below. (2) A methodology for statistical characterization of implementation overhead area, delay, and power consumption, taking into consideration the previous measures (i.e., considering the degree to which the properties referred to above are taken advantage of by a particular solution). This second set of metrics, together with metrics that account for area, delay, and power consumption in algorithm inherent resources,8 are used for feasibility analysis, i.e., for performing a rough assessment on the ability of a solution to meet its area, delay, and power budgets. The properties referred to in point (1) constitute the essential mechanism in our approach to identify good quality system level solutions, i.e., broad architectural specications that have potential to minimize implementation overhead. In what follows, we briey discuss three of such properties:9

8Note

that accounting for area, power consumption and delay in algorithm inherent resources does not pose major difculties -- all of the necessary information is provided in the corresponding target component library, as briey discussed in Section 3. 9The metrics to be discussed below and the precise denition of the statistical parameters to be used in (2) are still work in progress.

Locality of Computations A group of computations within an algorithm is said to have a high degree of locality if the algorithm level representation of those computations corresponds to an isolated, strongly connected (sub)graph. [18] More informally, the largest the volume of data being transferred among the nodes belonging to the cluster, in comparison with the volume of data entering and exiting the cluster, the highest will be the degree of locality.10 By indeed considering such strongly connected sub-graphs as single clusters of nodes, and thus implementing them using modules optimized for performing the specic set of computations and data transfers, it is possible to minimize the implementation overhead needed to execute the corresponding part of the algorithm. Namely, the required physical resources will be located in the proximity of each other, thus minimizing the length of interconnect and/or buses. So, if an algorithm exhibits a good degree of locality of computations, i.e, has a number of isolated, strongly connected clusters of computation, such clusters dene a clear way of organizing the functional units into modules so as minimize the number of global buses, and possibly other implementation overhead resources, such as control and/or steering logic. Regularity If a given algorithm, or algorithmic segment, exhibits a high degree of regularity, i.e., requires the repeated computation of certain patterns of operations, it is possible to take advantage of such regularity for minimizing implementation overhead. Consider for example the Fast Fourier Transform (FFT) algorithm shown in Figure 4.[20] For the sake of the discussion, assume that the FFT is to be computed in two clock cycles, one for each stage -- two multipliers, two adders, and two subtracters are thus needed. Observe further that, as shown in Figure 4, the clusters 1 and 3 are isomorphic sub-graphs, and the clusters 2 and 4 are also isomorphic subgraphs. It is thus possible to dene two modules (each of which containing one multiplier, one adder, and one subtracter), and then use a single module for implementing all clusters within a given isomorphic group, as illustrated in Figure 4. Modules can thus be optimized for executing not only one, but a number of isomorphic clusters, thus leading to solutions that minimize the cost of resource sharing in terms of implementation overhead. We measure the degree of regularity of an algorithm by the number of isomorphic sub-graphs that can be identied in its corresponding algorithmic model.
stage 1 cluster 1 + + * cluster 2 * * cluster 4 stage 2 cluster 3 + + Module 2 Architecture Component Module 1 Component Libraries

* *
+

Figure 4 Clusters, Modules and Architecture Components. Sharing Distance This metric considers the distance between clusters of nodes that share the same module.The idea is to favor solutions that maintain a certain degree of locality in their module sharing policy, thus minimizing the need for global buses. Note that since the distances referred to above are measured in time (i.e., in number of reference operations between two consecutive executions of
10The

volume of data can be determined using the information provided in the edges.

the same module), this metric also favors solutions that concentrate the use of modules in specic time intervals, thus creating opportunities for clock gating. Measures involving the properties referred to above should be carefully considered during design space exploration, as briey discussed next. Locality of computations and regularity are intrinsic algorithmic properties -- they can thus be used to initially drive the algorithm level design space exploration process. Taking maximum advantage of these properties may require an aggressive architecture level design space exploration, though. For example, an obvious goal is to maintain strongly connected clusters of computation within the same architectural component. A less trivial task, though, is to attempt to dene algorithmic segments (i.e., partitions) that maximize the degree of regularity within each component. Such a strategy can unquestionably lead to orders of magnitude savings in implementation overhead and clearly illustrates the benets of an aggressive, early design space exploration.11 In addition, within a given component, trade-offs between larger modules, optimizing large chunks of computations, vs. smaller modules, shared by a number of smaller isomorphic clusters, should be carefully considered. Obviously, the timing, power, and area budgets for the several parts of the algorithmic description must be also considered during the design space exploration, through feasibility analysis. For example, if a given computationally expensive loop has stringent timing constraints, its parallelization may need to be considered, in an attempt to increase execution speed, even if locality will most probably be compromised by such a parallelization. Due to the complex interplay between all of the possible transformations and architectural alternatives than can be considered, stochastic optimization will be initially used for the design space exploration. We nish by contrasting the system level exploration of architectural components discussed in this paper with the detailed design of xed components performed during behavioral synthesis. In our approach, only candidate partitions that exhibit the highest (comparative) ranks in terms of maximizing overall degrees of locality, regularity, etc., are selected for further architectural exploration. The decisions of assigning modules to components during architectural exploration are also very different in nature from the allocation and binding of physical resources performed by a behavioral synthesis tool. Module allocations are done exclusively with the purpose of: (1) identifying heuristically good points within the selected promising design space region -- this is done by exploring local trade-offs in terms of locality, regularity, sharing distance, etc.; (2) quickly characterizing those alternative design points, i.e., estimating their area, performance, and power, based on the statistics referred to in the beginning of this section -- those results are then used to decide on the adequacy of the overall design space region with respect to the specication.

Conclusions and Work in Progress

This paper proposes a framework, outlined in Figure 1, for supporting efcient algorithm and architecture level design space exploration of complex systems subject to timing, area, and power constraints. The fundamental issues involved in effectively supporting such an early design space exploration are: (1) dening the correct level of abstraction at which the exploration should be performed, i.e., identifying the truly system-level design issues; (2) being able to assess the quality of competing solutions, using good delity system-level metrics; and (3) being able to automatically explore the design space. Issues (2) and (3) were briey discussed in Section 4. The main focus of this paper was on the rst issue, though. Specically, on precisely dening the abstraction level at which early system-level design space exploration should be undertaken. Such level of abstraction, dened by the models discussed in Sections 2 and 3, is an issue of paramount importance if one wants to be able to support an aggressive and effective early exploration of the design space. Sky, an environment for early design space exploration, is currently being prototyped in Java. Sky implements the algorithm-level and architecture-level models and operations discussed in
11The search for alternative sub-graph groupings is too daunting a task to be manually performed by a system level

designer.

Sections 2 and 3, and the (hardware) component libraries referred to in Section 3. The automatic computation and update of the timing/area/power budgets and the metrics are also being incorporated in Sky. A companion paper will be soon submitted, focusing mostly on the second issue identied above, i.e., on system-level metrics, and reporting some of our preliminary results.

6
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]

Bibliography
J. Rabaey and M. Pedram (eds.), Low Power Design Methodologies, Kluwer Academic Publishers, 1996. M. Pedram, Power Minimization in IC Design: Principles and Applications, In ACM Transactions on Design Automation of Electronic Systems, ACM Press, 1996. J. Rabaey, L. Guerra, and R. Mehra, Design Guidance in the Power Dimension, In Proceedings of the ICASSP 95, May 1995. R. Mehra and J. Rabaey, Behavioral Level Power Estimation and Exploration, In Proceedings of the First International Workshop on Low Power Design, ACM Press, 1994. G. De Micheli and M. Sami (eds.), Hardware/Software Codesign, Kluwer Academic Publishers, 1996. D. Gajski, F. Vahid, S. Narayan, and J. Gong. Specication and Design of Embedded Systems. P T R Prentice Hall, 1994. R. Gupta and G. De Micheli, Hardware-Software Cosynthesis for Digital Systems, IEEE Design & Test of Computers, v10(3), 1993. R. Gupta, Co-Synthesis of Hardware and Software for Digital Embedded Systems, Kluwer Academic Publishers, 1995. R. Ernst, J. Henkel, and T. Benner, Hardware-Software Cosynthesis for Microcontrollers, IEEE Design & Test of Computers, v10(3), 1993. D. Gajski and F. Vahid, Specication and Design of Embedded Hardware-Software Systems, IEEE Design & Test of Computers, v12(1), 1995. P. Knudsen and J. Madsen, PACE: A Dynamic Programming Algorithm for Hardware/Software Partitioning, In Proceedings of the Fourth International Workshop on Hardware/Software Codesign, ACM Press, 1996. B. Lin, S. Vercauteren, and Hugo De Man, Embedded Architecture Co-Synthesis and System Integration, In Proceedings of the Fourth International Workshop on Hardware/Software Codesign, ACM Press, 1996. K. M. Kavi, B. P. Buckles, and U. N. Bhat, A Formal Denition of Data Flow Graph Models, IEEE Transactions on Computers, C-35 (11), 1986. P. C. Treleaven, D. R. Brownbridge, and R. P. Hopkins, Data driven and demand driven computer architecture, ACM Comput. Surv., Mar. 1982. D. Ku and G. De Micheli, High-level Synthesis of ASICs under Timing and Synchronization Constraints, Kluwer Academic Publishers, 1992. G. De Micheli. Synthesis and Optimization of Digital Circuits. McGraw-Hill, 1994. M. Potkonjak, K. Kim, and R. Karri, Methodology for Behavioral Synthesis-based Algorithm-level Design Space Exploration: DCT Case Study, to appear in Proceedings of the 35th Design Automation Conference, ACM Press, June 1997. G. Schmidt, T. Strohlein. Relations and Graphs - Discrete Mathematics for Computer Scientists. Springer-Verlag, 1993. P. R. Panda and N. Dutt, 1995 High Level Synthesis Design Repository, Technical Report #95-04, University of California - Irvine, Feb. 1995. E. C. Ifeachor and B. W. Jervis, Digital Signal Processing, A Pratical Appoach, Addison-Wesley Publishers Ltd., 1993. G. Veciana and M. Jacome, Hierarchical Algorithms for Assessing Probabilistic Constraints on System Performance, Technical Report, The University of Texas at Austin, Mar 1997. (also submitted to a conference)

[18] [19] [20] [21]