## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

**Trees Involving Large Objects
**

1

CHI-CHUNG LAM

The Ohio State University

THOMAS RAUBER

Martin-Luther University Halle-Wittenberg

GERALD BAUMGARTNER, DANIEL COCIORVA, and P. SADAYAPPAN

The Ohio State University

The need to evaluate expression trees involving large objects arises in scientiﬁc computing appli-

cations such as electronic structure calculations. Often, the tree node objects are so large that

only a subset of them can ﬁt into memory at a time. This paper addresses the problem of ﬁnding

an evaluation order of the nodes in a given expression tree that uses the least amount of memory.

We present an algorithm that ﬁnds an optimal evaluation order in Θ(nlog

2

n) time for an n-node

expression tree and prove its correctness.

Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors—Code generation; Opti-

mization; I.1.3 [Algebraic Manipulation]: Languages and Systems—Evaluation strategies; I.2.1 [Artiﬁcial

Intelligence]: Automatic Programming—Program synthesis; J.2 [Computer Applications]: Physical Sciences

and Engineering—Chemistry; Physics

General Terms: Algorithms, Languages

Additional Key Words and Phrases: Expression tree, evaluation order, memory minimization

1. INTRODUCTION

This paper addresses the problem of ﬁnding an evaluation order of the nodes in a given

expression tree that minimizes memory usage. The expression tree must be evaluated in

some bottom-up order, i.e., the evaluation of a node cannot precede the evaluation of any of

its children. The nodes of the expression tree are large data objects whose sizes are given.

If the total size of the data objects is so large that they cannot all ﬁt into memory at the

same time, space for the data objects has to be allocated and deallocated dynamically. Due

to the parent-child dependence relation, a data object cannot be deallocated until its parent

This work was supported in part by the National Science Foundation under grants DMR-9520319 and CHE-

0121676.

Authors’ address: C. Lam, G. Baumgartner, D. Cociorva, and P. Sadayappan, Department of Computer and Infor-

mation Science, The Ohio State University, Columbus, OH 43210; email: {clam; gb; cociorva; saday}@cis.ohio-

state.edu.

Author’s address: T. Rauber, Martin-Luther-University Halle-Wittenberg, Fachbereich Mathematik und Infor-

matik, Von-Seckendorff-Platz 1, D-06120 Halle/Saale, Germany; email: rauber@informatik.uni-halle.de.

Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use

provided that the copies are not made or distributed for proﬁt or commercial advantage, the ACM copyright/server

notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the

ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior speciﬁc

permission and/or a fee.

c 2002 ACM 0164-0925/02/0100-TBD $00.02

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD, Pages 1–15.

2 Chi-Chung Lam et al.

node data object has been evaluated. The objective is to minimize the maximum memory

usage during the evaluation of the entire expression tree.

This problem arises, for example, in optimizing a class of loop calculations implement-

ing multi-dimensional integrals of the products of several large input arrays to compute

the electronic properties of semiconductors and metals [Hybertsen and Louie 1986; Rojas

et al. 1995]. One such application is the calculation of electronic properties of MX materi-

als with the inclusion of many-body effects [Aulbur 1996]. MX materials are linear chain

compounds with alternating transition-metal atoms (M = Ni, Pd, or Pt) and halogen atoms

(X = Cl, Br, or I). The following multi-dimensional integral computes the susceptibility in

momentum space for the determination of the self-energy of MX compounds in real space.

χ

G,G

(k, iτ) = i

R

L

,R

L

D

k,G

R

L

,R

L

(iτ) D

∗k,G

R

L

,R

L

(−iτ)

where

D

k,G

R

L

,R

L

(iτ) =

1

√

Ω

Ω

dr e

−i(k+G)r

R

L

θ

R

L

,R

L

(r) G

R

L

,R

L

(iτ)

and

θ

R

L

,R

L

(r) = Ψ

R

L

(r −R

) Ψ

∗

R

L

(r −R

)

In the above equations, Ψ is the localized basis function, G is the orbital projected

Green function for electrons and holes, and D is an intermediate array computed using

a fast Fourier transform (FFT). The interpretation of the variables and their ranges are

given in Table I. The computation can be expressed in a canonical form as the following

multi-dimensional summation of the product of several arrays, where Ψ is written as a

two-dimensional array Y .

**r,r1,RL,RL1,RL2,RL3
**

Y [r, RL] Y [r, RL2] Y [r1, RL3] Y [r1, RL1]

G[RL1, RL, t] G[RL2, RL3, t]

exp[k, r] exp[G, r] exp[k, r1] exp[G1, r1]

When expressed in this form, as a multi-dimensional summation of the primary inputs to

the computation, no intermediate quantities such as D are explicitly computed. Although

it requires less memory than the previous form, it is not computationally attractive since

the number of arithmetic operations is enormous (multiplying the extents of the various

summation indices gives O(10

22

) operations). The number of operations can be consid-

erably reduced by applying algebraic properties to factor out terms that are independent

of some of the summation indices, thereby computing some intermediate results that can

be stored and reused instead of being recomputed many times. There are many different

ways of applying algebraic laws of distributivity and associativity, resulting in different

alternative computational structures. Using simple examples, we brieﬂy touch upon some

of the optimization issues that arise, and then proceed to discuss in detail the problem of

memory-optimal evaluation of expression trees.

The problem of minimizing the number of arithmetic operations by applying the alge-

braic laws of commutativity, associativity, and distributivity has been previously addressed

in [Lam et al. 1997a; 1997b]. Consider, for example, the multi-dimensional integral shown

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

Memory-Optimal Evaluation of Expression Trees 3

W[k] =

(i,j,l)

A[i, j] ×B[j, k, l] ×C[k, l]

(a) A multi-dimensional integral

f1[j] =

i

A[i, j]

f2[j, k, l] = B[j, k, l] ×C[k, l]

f3[j, k] =

l

f2[j, k, l]

f4[j, k] = f1[j] ×f3[j, k]

W[k] = f5[k] =

j

f4[j, k]

(b) A formula sequence for computing (a)

B[j, k, l] C[k, l]

d

d

A[i, j] × f2

i

k

f1 f3

d

d

× f4

j

f5

(c) An expression tree representation of (b)

Fig. 1. An example multi-dimensional integral and two representations of a computation.

in Figure 1(a). If implemented directly as expressed (i.e., as a single set of perfectly-nested

loops), the computation would require 2 N

i

N

j

N

k

N

l

arithmetic operations

to compute. However, assuming associative reordering of the operations and use of the

distributive law of multiplication over addition is satisfactory for the ﬂoating-point com-

putations, the above computation can be rewritten in various ways. One equivalent form

that only requires 2 N

j

N

k

N

l

+ 2 N

j

N

k

+ N

i

N

j

operations is given in

Figure 1(b). It expresses the sequence of steps in computing the multi-dimensional inte-

gral as a sequence of formulae. Each formula computes some intermediate result and the

last formula gives the ﬁnal result. A sequence of formulae can also be represented as an

expression tree in which the leaf nodes are input arrays, the internal nodes are intermediate

results, and the root node is the ﬁnal integral. For instance, Figure 1(c) shows the expres-

sion tree corresponding to the example formula sequence. This problem has been proved

to be NP-complete and a pruning search algorithm was proposed.

Once the operation-minimal form is determined, the next step is to implement it as

some loop structure. A straightforward way is to generate a sequence of perfectly nested

loops, each evaluating an intermediate result. However, in practice, the input arrays and

the intermediate arrays are often so large that they cannot all ﬁt into available memory.

The minimization of memory usage is thus desirable. By fusing loops it is possible to

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

4 Chi-Chung Lam et al.

C : 30

A : 20 D : 9

B : 3 E : 16 G : 25

d

d

F : 15 H : 5

I : 16

Fig. 2. An example expression tree

reduce the dimensionality of some of the intermediate arrays [Lam et al. 1999; Lam 1999].

Furthermore, it is not necessary to keep all arrays allocated for the duration of the entire

computation. Space allocated for an intermediate array can be deallocated as soon as the

operation using the array has been performed. When allocating and deallocating arrays

dynamically, the evaluation order affects the maximum memory requirement. In this paper,

we focus on the problem of ﬁnding an evaluation order of the nodes in a given expression

tree that minimizes dynamic memory usage. A solution to this problem would allow the

automatic generation of efﬁcient code for evaluating expression trees, e.g., for computing

the electronic properties of MX materials. We believe that the solution we develop here

may have applicability to other areas such as database query optimization and data mining.

As an example of the memory usage optimization problem, consider the expression

tree shown in Fig. 2. The size of each data object is shown alongside the correspond-

ing node label. Before evaluating a data object, space for it must be allocated. This

space can be deallocated only after the evaluation of its parent is complete. There are

many allowable evaluation orders of the nodes. One of them is the post-order traversal

¸A, B, C, D, E, F, G, H, I) of the expression tree. It has a maximum memory usage of

45 units. This occurs during the evaluation of H, when F, G, and H are in memory.

Other evaluation orders may use more memory or less memory. Finding the optimal order

¸C, D, G, H, A, B, E, F, I), which uses 39 units of memory, is not trivial.

A simpler problem related to the memory usage minimization problem is the register

allocation problem for binary expression trees in which the sizes of all nodes are unity. It

has been addressed in [Nakata 1967; Sethi and Ullman 1970] and can be solved in Ω(n)

time, where n is the number of nodes in the expression tree. But if the expression tree is

replaced by a directed acyclic graph (in which all nodes are still of unit size), the problem

becomes NP-complete [Sethi 1975]. The algorithm in [Sethi and Ullman 1970] for ex-

pression trees of unit-sized nodes does not extend directly to expression trees having nodes

of different sizes. Appel and Supowit [Appel and Supowit 1987] generalized the register

allocation problem to higher degree expression trees of arbitrarily-sized nodes. However,

they restricted their attention to solutions that evaluate subtrees contiguously, which could

result in non-optimal solutions (as we show through an example in Section 2). The gen-

eralized register allocation problem for expression trees was addressed in the context of

vector machines in [Rauber 1990a; 1990b].

The rest of this paper is organized as follows. In Section 2, we formally deﬁne the

memory usage optimization problem and make some observations about it. Section 3

presents an efﬁcient algorithm for ﬁnding the optimal evaluation order for an expression

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

Memory-Optimal Evaluation of Expression Trees 5

Node himem deallocate lomem

A 0 + 20 = 20 - 20 −0 = 20

B 20 + 3 = 23 A 23 −20 = 3

C 3 + 30 = 33 - 33 −0 = 33

D 33 + 9 = 42 C 42 −30 = 12

E 12 + 16 = 28 D 28 −9 = 19

F 19 + 15 = 34 B, E 34 −3 −16 = 15

G 15 + 25 = 40 - 40 −0 = 40

H 40 + 5 = 45 G 45 −25 = 20

I 20 + 16 = 36 F, H 36 −15 −5 = 16

max 45

Fig. 3. Memory usage of a post-order traversal of the expression tree in Fig. 2

tree. In Section 4, we prove that the algorithm ﬁnds a solution in Θ(nlog

2

n) time for an

n-node expression tree. The correctness of the algorithm is proved in Section 5. Section 6

provides conclusions.

2. PROBLEM STATEMENT

We address the problem of optimizing the memory usage in the evaluation of a given ex-

pression tree whose nodes correspond to large data objects of various sizes. Each data

object depends on all its children (if any), and thus can be evaluated only after all its chil-

dren have been evaluated. We assume that the evaluation of each node in the expression

tree is atomic. Space for each data object is dynamically allocated/deallocated in its en-

tirety. Leaf node objects are created or read in as needed; internal node objects must be

allocated before their evaluation begins; and each object must remain in memory until the

evaluation of its parent is completed. The goal is to ﬁnd an evaluation order of the nodes

that uses the least amount of memory. Since an evaluation order is also a traversal of the

nodes, we will use these two terms interchangeably.

We deﬁne the problem formally as follows:

Given a tree T and a size v.size for each node v ∈ T, ﬁnd a computation of T

that uses the least memory, i.e., an ordering P = ¸v

1

, v

2

, . . . , v

n

) of the nodes

in T, where n is the number of nodes in T, such that

(1) for all v

i

, v

j

, if v

i

is the parent of v

j

, then i > j; and

(2) max

vi∈P

¦himem(v

i

, P)¦ is minimized, where

himem(v

i

, P) = lomem(v

i−1

, P) +v

i

.size

lomem(v

i

, P) =

himem(v

i

, P) −

{child vj of vi}

v

j

.size if i > 0

0 if i = 0

Here, himem(v

i

, P) is the memory usage during the evaluation of v

i

in the traversal P,

and lomem(v

i

, P) is the memory usage upon completion of the same evaluation. These

deﬁnitions reﬂect that we need to allocate space for v

i

before its evaluation, and that after

evaluation of v

i

, the space allocated to all its children may be released. For instance,

consider the post-order traversal P = ¸A, B, C, D, E, F, G, H, I) of the expression tree

shown in Fig. 2. During and after the evaluation of A, Ais in memory. So, himem(A, P) =

lomem(A, P) = A.size = 20. For evaluating B, we need to allocate space for B, thus

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

6 Chi-Chung Lam et al.

Node himem lomem

G 25 25

H 30 5

C 35 35

D 44 14

E 30 21

A 41 41

B 44 24

F 39 20

I 36 16

max 44

Node himem lomem

C 30 30

D 39 9

G 34 34

H 39 14

A 34 34

B 37 17

E 33 24

F 39 20

I 36 16

max 39

(a) A better traversal (b) The optimal traversal

Fig. 4. Memory usage of two different traversals of the expression tree in Fig. 2

himem(B, P) = lomem(A, P) +B.size = 23. After B is obtained, A can be deallocated,

giving lomem(B, P) = himem(B, P) −A.size = 3. The memory usage for the rest of the

nodes is determined similarly and shown in Fig. 3.

The post-order traversal of the given expression tree, however, is not optimal in memory

usage. In this example, none of the traversals that visit all nodes in one subtree before visit-

ing another subtree is optimal. There are four such traversals: ¸A, B, C, D, E, F, G, H, I),

¸C, D, E, A, B, F, G, H, I), ¸G, H, A, B, C, D, E, F, I) and ¸G, H, C, D, E, A, B, F, I).

If we follow the traditional wisdom of visiting the subtree with the higher memory usage

ﬁrst, as in the Sethi-Ullman algorithm [Sethi and Ullman 1970], we obtain the best of these

four traversals, which is ¸G, H, C, D, E, A, B, F, I). Its overall memory usage is 44 units,

as shown in Fig. 4(a), and is not optimal. The optimal traversal, which uses only 39 units

of memory, is ¸C, D, G, H, A, B, E, F, I) (see Fig. 4(b)). Notice that it ‘jumps’ back and

forth between the subtrees. Therefore, any algorithm that only considers traversals that

visit subtrees contiguously may not produce an optimal solution.

The memory usage optimization problem has an interesting property: an expression tree

or a subtree may have more than one optimal traversal. For example, for the subtree rooted

at F, the traversals ¸C, D, E, A, B, F) and ¸C, D, A, B, E, F) both use the least memory

space of 39 units. One might attempt to take two optimal subtree traversals, one from each

child of a node X, merge them together optimally, and then append X to form a traversal

for X. But, this resulting traversal may not be optimal for X. Continuing with the above

example, if we merge together ¸C, D, E, A, B, F) and ¸G, H) (which are optimal for the

subtrees rooted at F and H, respectively) and then append I, the best we can get is a sub-

optimal traversal ¸G, H, C, D, E, A, B, F, I) that uses 44 units of memory (see Fig. 4(a)).

However, the other optimal traversal ¸C, D, A, B, E, F) for the subtree rooted at F can be

merged with ¸G, H) to form ¸C, D, G, H, A, B, E, F, I) (with I appended), which is an

optimal traversal of the entire expression tree. Thus, locally optimal traversals may not be

globally optimal. In the next section, we present an efﬁcient algorithm that ﬁnds traversals

which are not only locally optimal but also globally optimal.

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

Memory-Optimal Evaluation of Expression Trees 7

3. AN EFFICIENT ALGORITHM

We now present an efﬁcient divide-and-conquer algorithm that, given an expression tree

whose nodes are large data objects, ﬁnds an evaluation order of the tree that minimizes

the memory usage. For each node in the expression tree, it computes an optimal traversal

for the subtree rooted at that node. The optimal subtree traversal that it computes has a

special property: it is not only locally optimal for the subtree, but also globally optimal in

the sense that it can be merged together with globally optimal traversals for other subtrees

to form an optimal traversal for a larger tree that is also globally optimal. As we have seen

in Section 2, not all locally optimal traversals for a subtree can be used to form an optimal

traversal for a larger tree.

The algorithm stores a traversal not as an ordered list of nodes, but as an ordered list

of indivisible units or elements. Each element contains an ordered list of nodes with the

property that there necessarily exists some globally optimal traversal of the entire tree

wherein this sequence appears undivided. Therefore, as we show later, inserting any node

in between the nodes of an element does not lower the total memory usage. An element

initially contains a single node. But as the algorithm goes up the tree merging traversals

together and appending new nodes to them, elements may be appended together to form

new elements containing a larger number of nodes. Moreover, the order of indivisible

units in a traversal stays invariant, i.e., the indivisible units must appear in the same order

in some optimal traversal of the entire expression tree. This means that indivisible units

can be treated as a whole and we only need to consider the relative order of indivisible

units from different subtrees.

Each element (or indivisible unit) in a traversal is a (nodelist, hi, lo) triple, where nodelist

is an ordered list of nodes, hi is the maximum memory usage during the evaluation of the

nodes in nodelist, and lo is the memory usage after those nodes are evaluated. Using the

terminology from Section 2, hi is the highest himem among the nodes in nodelist, and lo

is the lomem of the last node in nodelist. The algorithm always maintains the elements of

a traversal in decreasing hi and increasing lo order, which implies in order of decreasing

hi-lo difference. In Section 5, we prove that arranging the indivisible units in this order

minimizes memory usage.

We can also visualize the elements in a traversal as follows. The ﬁrst element in a

traversal contains the nodes from the beginning of the traversal up to and including the

node that has the lowest lomem after the node that has the highest himem. The highest

himem and the lowest lomem become the hi and the lo of the ﬁrst element, respectively.

The same rules are recursively applied to the remainder of the traversal to form the second

element, and so on.

Before formally describing the algorithm, we illustrate how it works with an example.

Consider the expression tree shown in Fig. 2. We visit the nodes in a bottom-up order.

Since A has no children, the optimal traversal for the subtree rooted at A, denoted by

A.seq, is ¸(A, 20, 20)), meaning that 20 units of memory are needed during the evaluation

of A and immediately afterwards. To form B.seq, we take A.seq, append a new element

(B, 3 + 20, 3) (the hi of which is adjusted by the lo of the preceding element), and get

¸(A, 20, 20), (B, 23, 3)). Whenever two adjacent elements are not in decreasing hi and

increasing lo order, we combine them into one element by concatenating the nodelists

and taking the highest hi and the second lo. Thus, B.seq becomes ¸(AB, 23, 3)). (For

clarity, we write nodelists in a sequence as strings). Here, A and B form an indivisible

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

8 Chi-Chung Lam et al.

Node v Subtree traversals merged in Optimal traversal v.seq

decreasing hi-lo order

A (A, 20, 20) (A, 20, 20)

B (A, 20, 20), (B, 23, 3) (AB, 23, 3)

C (C, 30, 30) (C, 30, 30)

D (C, 30, 30), (D, 39, 9) (CD, 39, 9)

E (CD, 39, 9), (E, 25, 16) (CD, 39, 9), (E, 25, 16)

F (CD, 39, 9), (AB, 32, 2), (CD, 39, 9), (ABEF, 34, 15)

(E, 28, 19), (F, 34, 15)

G (G, 25, 25) (G, 25, 25)

H (G, 25, 25), (H, 30, 5) (GH, 30, 5)

I (CD, 39, 9), (GH, 39, 14), (CDGHABEFI, 39, 16)

(ABEF, 39, 20), (I, 36, 16)

Fig. 5. Optimal traversals for the subtrees in the expression tree in Fig. 2

unit, implying that B must follow A in some optimal traversal of the entire expression

tree. The maximum memory usage during the evaluation of this indivisible unit is 23

and the result of the evaluation occupies 3 units of memory. Similarly, we get E.seq =

¸(CD, 39, 9), (E, 25, 16)). Note that these two adjacent elements cannot be combined

because they are already in decreasing hi and increasing lo order. For node F, which

has two children B and E, we merge B.seq and E.seq by the order of decreasing hi-lo

difference. The merged elements are in the order (CD, 39, 9), (AB, 23 + 9, 3 + 9), and

ﬁnally (E, 25 + 3, 16 + 3) with the hi and lo values adjusted by the lo value of the last

merged element from the other subtree. They are the three elements in F.seq after the

merge as no elements are combined so far. Then, we append to F.seq the new element

(F, 15 +19, 15) for the root of the subtree. The new element is combined with the last two

elements in F.seq to ensure that the elements are in decreasing hi-lo difference. Hence,

F.seq becomes ¸(CD, 39, 9), (ABEF, 34, 15)), a sequence of only two indivisible units.

The optimal traversals for the other nodes are computed in the same way and are shown in

Fig. 5. At the end, the algorithm returns the optimal traversal ¸C, D, G, H, A, B, E, F, I)

for the entire expression tree (see Fig. 4(b)).

Fig. 6 shows the algorithm. The input to the algorithm (the MinMemTraversal proce-

dure) is an expression tree T, in which each node v has a ﬁeld v.size denoting the size of its

data object. The procedure performs a bottom-up traversal of the tree and, for each node v,

computes an optimal traversal v.seq for the subtree rooted at v. The optimal traversal v.seq

is obtained by optimally merging together the optimal traversals u.seq from each child u

of v, and then appending v. At the end, the procedure returns a concatenation of all the

nodelists in T.root.seq as the optimal traversal for the given expression tree. The memory

usage of the optimal traversal is T.root.seq[1].hi.

The MergeSeq procedure optimally merges two given traversals S1 and S2 and returns

the merged result S. S1 and S2 are subtree traversals of two children nodes of the same

parent. The optimal merge is performed in a fashion similar to merge-sort. Elements

from S1 and S2 are scanned sequentially and appended into S in the order of decreasing

hi-lo difference. This order guarantees that the indivisible units are arranged to minimize

memory usage. Since S1 and S2 are formed independently, the hi-lo values in the elements

from S1 and S2 must be adjusted before they can be appended to S. The amount of

adjustment for an element from S1 (S2) equals the lo value of the last merged element

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

Memory-Optimal Evaluation of Expression Trees 9

Variable Range

RL Orbital 10

3

r Discretized points in real space 10

5

τ Time step 10

2

k K-point in a irreducible Brillouin zone 10

G Reciprocal lattice vector 10

3

Table I. Variables in an example physics computation.

MinMemTraversal (T):

foreach node v in some bottom-up traversal of T

v.seq =

foreach child u of v

v.seq = MergeSeq(v.seq, u.seq)

if |v.seq| > 0 then // |x| is the length of x

base = v.seq[|v.seq|].lo

else

base = 0

AppendSeq (v.seq, v, v.size + base, v.size)

nodelist =

for i = 1 to |T.root.seq|

nodelist = nodelist +T.root.seq[i].nodelist // + is concatenation

return nodelist // memory usage is T.root.seq[1].hi

MergeSeq (S1, S2):

S =

i = j = 1

base1 = base2 = 0

while i ≤ |S1| or j ≤ |S2|

if j > |S2| or (i ≤ |S1| and S1[i].hi −S1[i].lo > S2[j].hi −S2[j].lo) then

AppendSeq (S, S1[i].nodelist, S1[i].hi + base1, S1[i].lo + base1)

base2 = S1[i].lo

i++

else

AppendSeq (S, S2[j].nodelist, S2[j].hi + base2, S2[j].lo + base2)

base1 = S1[j].lo

j++

end while

return S

AppendSeq (S, nodelist, hi, lo):

E = (nodelist, hi, lo) // new element to append to S

i = |S|

while i ≥ 1 and (E.hi ≥ S[i].hi or E.lo ≤ S[i].lo)

// combine S[i] with E

E = (S[i].nodelist +E.nodelist, max(S[i].hi, E.hi), E.lo)

remove S[i] from S

i – –

end while

S = S +E // |S| is now i + 1

Fig. 6. Procedure for ﬁnding an memory-optimal traversal of an expression tree

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

10 Chi-Chung Lam et al.

from S2 (S1), which is kept in variable base1 (base2).

The AppendSeq procedure appends the new element speciﬁed by the triple (nodelist,

hi, lo) to the given traversal S. Before the new element E is appended to S, it is combined

with elements at the end of S whose hi is not higher than E.hi or whose lo is not lower

than E.lo. The combined element has the concatenated nodelist and the highest hi but the

original E.lo.

This algorithm has the property that the traversal it ﬁnds for a subtree T

is not only

optimal for T

**but must also appear as a subsequence in some optimal traversal for any
**

larger tree that contains T

**as a subtree. For example, E.seq is a subsequence in F.seq,
**

which is in turn a subsequence in I.seq (see Fig. 5).

4. COMPLEXITY OF THE ALGORITHM

The time complexity of our algorithm is Θ(nlog

2

n) for an n-node expression tree. We

represent a sequence of indivisible units as a red-black tree with the indivisible units at the

leaves. The tree is sorted by decreasing hi-lo difference. In addition, the leaves are linked

in sorted order in a doubly-linked list, and a count of the number of indivisible units in the

sequence is maintained.

The cost of constructing the ﬁnal evaluation order consists of the cost for building se-

quences of indivisible units and the cost for combining indivisible units into larger indi-

visible units. For ﬁnding an upper bound on the cost of the algorithm, the worst-case cost

for building sequences can be analyzed separately from the worst-case cost for combining

indivisible units.

The sequence for a leaf node of the expression tree can be constructed in constant time.

For a unary interior node, we simply append the node to the sequence of its subtree, which

costs O(log n) time. For an m-ary interior node, we merge the sequences of the subtrees

by inserting the nodes from the smaller sequences into the largest sequence. Inserting a

node into a sequence represented as a red-black tree costs O(log n) time. Since we always

insert the nodes of the smaller sequences into the largest one, every time a given node of the

expression tree gets inserted into a sequence the size of the sequence containing this node

at least doubles. Each node, therefore, can be inserted into a sequence at most O(log n)

times, with each insertion costing O(log n) time. The cost for building the traversal for the

entire expression tree is, therefore, O(nlog

2

n).

Two individual indivisible units can be combined in constant time. When combining

two adjacent indivisible units within a sequence, one of them must be deleted from the

sequence and the red-black tree must be re-balanced, which costs O(log n) time. Since

there can be at most n − 1 of these combine operations, the total cost is O(nlog n). The

cost of the whole algorithm is, therefore, dominated by the cost for building sequences,

which is O(nlog

2

n).

Combining indivisible units into larger ones reduces the number of elements in the se-

quences and, therefore, the time required for merging and combining sequences. In the

best case, indivisible units are always combined such that each sequence contains a single

element. In this case, the algorithm only takes linear time.

In the worst case, a degenerate expression tree consists of small nodes and large nodes

such that every small node has as its only child a large node. A pair of such nodes will

form an indivisible unit with the hi being the size of the large node and the lo being the

size of the small node. Such a tree can be constructed such that these indivisible units

will not further combine into larger indivisible units. In this case, the algorithm will result

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

Memory-Optimal Evaluation of Expression Trees 11

in a sequence of n/2 indivisible units containing two nodes each. If such a degenerate

expression tree is also unbalanced, the algorithm requires Ω(nlog

2

n) time for computing

the optimal traversal.

5. CORRECTNESS OF THE ALGORITHM

We now show the correctness of the algorithm. The proof proceeds as follows. Lemma 1

characterizes the indivisible units in a subtree traversal by providing some invariants.

Lemma 2 shows that, once formed, each indivisible unit can be considered as a whole.

Lemma 3 deals with the optimal ordering of indivisible units from different subtrees. Fi-

nally, using the three lemmas we prove the correctness of the algorithm by arguing that

any traversal of an expression tree can be transformed in a series of steps into the optimal

traversal found by the algorithm without increasing memory usage.

The ﬁrst lemma establishes some important invariants about the indivisible units in an

ordered list v.seq that represents a traversal. The hi value of an indivisible unit is the

highest himem of the nodes in the indivisible unit. The lo value of an indivisible unit is the

lomem of the last node in the indivisible unit. Given a sequence of nodes, an indivisible

unit extends to the last node with the lowest lomem following the last node with the highest

himem. In addition, the indivisible units in a sequence are in decreasing hi and increasing

lo order.

LEMMA 5.1. Let v be any node in an expression tree, S = v.seq, and P be the traversal

represented by S of the subtree rooted at v, i.e., P = S[1].nodelist + +S[[S[].nodelist.

The algorithm maintains the following invariants:

For all 1 ≤ i ≤ [S[, let S[i].nodelist = ¸v

1

, v

2

, . . . , v

n

) and v

m

be the last

node in S[i].nodelist that has the maximum himem value, i.e., for all k <

m, himem(v

k

, P) ≤ himem(v

m

, P) and for all k > m, himem(v

k

, P) <

himem(v

m

, P). Then, we have,

(1) S[i].hi = himem(v

m

, P),

(2) S[i].lo = lomem(v

n

, P),

(3) for all m ≤ k ≤ n, lomem(v

k

, P) ≥ lomem(v

n

, P),

(4) for all 1 ≤ j < i,

(a) for all 1 ≤ k ≤ n, S[j].hi > himem(v

k

, P),

(b) for all 1 ≤ k ≤ n, S[j].lo < lomem(v

k

, P),

(c) S[j].hi > S[i].hi, and

(d) S[j].lo < S[i].lo.

Proof

The above invariants are true by construction. ¯.

The second lemma asserts the ‘indivisibility’ of an indivisible unit by showing that un-

related nodes inserted in between the nodes of an indivisible unit can always be moved to

the beginning or the end of the indivisible unit without increasing memory usage. Thus,

once an indivisible unit is formed, we do not need to consider breaking it up later. This

lemma allows us to treat each traversal as a sequence of indivisible units (each containing

one or more nodes) instead of a list of the individual nodes.

LEMMA 5.2. Let v be a node in an expression tree T, S = v.seq, and P be a traver-

sal of T in which the nodes from S[i].nodelist appear in the same order as they are in

S[i].nodelist, but not contiguously. Then, any nodes that are in between the nodes in

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

12 Chi-Chung Lam et al.

S[i].nodelist can always be moved to the beginning or the end of S[i].nodelist without in-

creasing memory usage, provided that none of the nodes that are in between the nodes in

S[i].nodelist are ancestors or descendants of any nodes in S[i].nodelist.

Proof

Let S[i].nodelist = ¸v

1

, . . . , v

n

), v

0

be the node before v

1

in S, v

m

be the node

in S[i].nodelist such that for all k < m, himem(v

k

, P) ≤ himem(v

m

, P) and for all

k > m, himem(v

m

, S) > himem(v

k

, S). Let v

1

, . . . , v

b

be the ‘foreign’ nodes, i.e.,

the nodes that are in between the nodes in S[i].nodelist in P, with v

1

, . . . , v

a

(not neces-

sarily contiguously) before v

m

and v

a+1

, . . . , v

b

(not necessarily contiguously) after v

m

in P. Let Q be the traversal obtained from P by removing the nodes in S[i].nodelist.

We construct another traversal P

of T from P by moving v

1

, . . . , v

a

to the beginning

of S[i].nodelist and v

a+1

, . . . , v

b

to the end of S[i].nodelist. In other words, we replace

¸v

1

, . . . , v

1

, . . . , v

a

, . . . , v

m

, . . . , v

a+1

, . . . , v

b

, . . . , v

n

) in P with ¸v

1

, . . . , v

a

, v

1

, . . . , v

m

,

. . . , v

n

, v

a+1

, . . . , v

b

, ) to form P

.

Traversals P and P

**differ in memory usage only at the nodes ¦v
**

1

, . . . , v

n

, v

1

, . . . , v

b

¦.

P

**does not use more memory than P because:
**

(1) The memory usage for P

at v

m

is the same as the memory usage for P at v

m

since

himem(v

m

, P

) = himem(v

m

, S) + lomem(v

a

, Q) = himem(v

m

, P).

(2) For all 1 ≤ k ≤ n, the memory usage for P

at v

k

is no higher than the memory usage

for P at v

k

, since himem(v

k

, S) ≤ himem(v

m

, S) implies that himem(v

k

, P

) =

himem(v

k

, S) +lomem(v

a

, Q) ≤ himem(v

m

, S) +lomem(v

a

, Q) = himem(v

m

, P).

(3) For all 1 ≤ j ≤ a, the memory usage for P

at v

j

is no higher than the memory

usage for P at v

j

, since for all 1 ≤ k ≤ m, lomem(v

0

, S) < lomem(v

k

, S) (by in-

variant 4(b) in Lemma 1) implies himem(v

j

, P

) = himem(v

j

, Q) +lomem(v

0

, S) ≤

himem(v

j

, P).

(4) For all a < j ≤ b, the memory usage for P

at v

j

is no higher than the memory usage

for P at v

j

, since for all m ≤ k ≤ n, lomem(v

k

, S) ≥ lomem(v

n

, S) (by invariant 3 in

Lemma 1) implies himem(v

j

, P

) = himem(v

j

, Q)+lomem(v

n

, S) ≤ himem(v

j

, P).

Since the memory usage of any node in v

1

, . . . , v

n

after moving the foreign nodes cannot

exceed that of v

m

, which remains unchanged, and the memory usage of the foreign nodes

does increase as a result of moving them, the overall maximum memory usage cannot

increase. ¯.

The next lemma deals with the ordering of indivisible units. It shows that arranging in-

divisible units from different subtrees in the order of decreasing hi-lo difference minimizes

memory usage, since two indivisible units that are not in that order can be interchanged in

the merged traversal without increasing memory usage.

LEMMA 5.3. Let v and v

**be two nodes in an expression tree that are siblings of each
**

other, S = v.seq, and S

= v

.seq. Then, among all possible merges of S and S

, the

merge that arranges the elements from S and S

**in the order of decreasing hi-lo difference
**

uses the least memory.

Proof

Let M be a merge of S and S

**that is not in the order of decreasing hi-lo difference.
**

Then there exists an adjacent pair of elements, one from each of S and S

**, that are not in
**

that order. Without loss of generality, we assume the ﬁrst element is S

[j] from S

and

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

Memory-Optimal Evaluation of Expression Trees 13

element himem lomem

.

.

.

.

.

.

.

.

.

S

[j] H

j

+Li−1 L

j

+Li−1

S[i] Hi +L

j

Li +L

j

.

.

.

.

.

.

.

.

.

element himem lomem

.

.

.

.

.

.

.

.

.

S[i] Hi +L

j−1

Li +L

j−1

S

[j] H

j

+Li L

j

+Li

.

.

.

.

.

.

.

.

.

(a) Sequence M (b) Sequence M

**Fig. 7. Memory usage comparison of two traversals in Lemma 5.3
**

the second one is S[i] from S. Consider the merge M

**obtained from M by interchanging
**

S

**[j] and S[i]. To simplify the notation, let H
**

r

= S[r].hi, L

r

= S[r].lo, H

r

= S

[r].hi,

and L

r

= S

[r].lo. The memory usage of M and M

differs only at S

[j] and S[i] and is

compared in Fig. 7.

The memory usage of M at the two elements is max(H

j

+ L

i−1

, H

i

+ L

j

) while the

memory usage of M

at the same two elements is max(H

j

+ L

i

, H

i

+ L

j−1

). Since the

two elements are out of order, the hi-lo difference of S

**[j] must be less than that of S[i],
**

i.e., H

j

−L

j

< H

i

−L

i

. This implies H

i

+L

j

> H

j

+L

i

. Invariant 4 in Lemma 1 gives

us L

j

> L

j−1

, which implies H

i

+L

j

> H

i

+L

j−1

. Thus, max(H

j

+L

i−1

, H

i

+L

j

) ≥

H

i

+L

j

> max(H

j

+L

i

, H

i

+L

j−1

). Therefore, M

**cannot use more memory than M.
**

By switching all adjacent pairs in M that are out of order until no such pair exists, we get

an optimal order without increasing memory usage. ¯.

THEOREM 5.4. Given an expression tree, the algorithm presented in Section 3 com-

putes a traversal that uses the least memory.

Proof

We prove the correctness of the algorithm by describing a procedure that transforms any

given traversal to the traversal found by the algorithm without increase in memory usage

in any transformation step. Given a traversal P for an expression tree T, we visit the nodes

in T in a bottom-up manner and, for each non-leaf node v in T, we perform the following

steps:

(1) Let T

be the subtree rooted at v and P

**be the minimal substring of P that contains
**

all the nodes from T

**− ¦v¦. In the following steps, we will rearrange the nodes in
**

P

**such that the nodes that form an indivisible unit in v.seq are contiguous and the
**

indivisible units are in the same order as they are in v.seq.

(2) First, we sort the components of the indivisible units in v.seq so that they are in the

same order as in v.seq. The sorting process involves rearranging two kinds of units.

The ﬁrst kind of units are the indivisible units in u.seq for each child u of v. The

second kind of units are the contiguous sequences of nodes in P

**which are from
**

T − T

**. For this sorting step, we temporarily treat each such maximal contiguous
**

sequence of nodes as a unit. For each unit E of the second kind, we take E.hi =

max

w∈E

himem(w, P) and E.lo = lomem(w

n

, P) where w

n

is the last node in E.

The sorting process is as follows.

While there exist two adjacent units E

and E in P

such that E

is before

E and E

.hi −E

.lo < E.hi −E.lo,

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

14 Chi-Chung Lam et al.

(a) Swap E

**and E. By Lemma 3, this does not increase the memory
**

usage.

(b) If two units of the second kind become adjacent to each other as a

result of the swapping, combine the two units into one and recompute

its new hi and lo.

When the above sorting process ﬁnishes, all units of the ﬁrst kind, which are compo-

nents of the indivisible units in v.seq, are in the order of decreasing hi-lo difference.

Since, for each child u of v, indivisible units in u.seq have been in the correct order

before the sorting process, their relative order is not changed. The order of the nodes

from T − T

**is preserved because the sorting process never swaps two units of the
**

second kind. Also, v and its ancestors do not appear in P

**, and nodes in units of the
**

ﬁrst kind are not ancestors or descendants of any nodes in units of the second kind.

Therefore, the sorting process does not violate parent-child dependencies.

(3) Now that the components of the indivisible units in v.seq are in the correct order, we

make the indivisible units contiguous using the following combining process.

For each indivisible unit E in v.seq,

(a) In the traversal P, if there are nodes from T − T

in between the

nodes from E, move them either to the beginning or the end of E as

speciﬁed by Lemma 2.

(b) Make the contiguous sequence of nodes from E an indivisible unit.

Upon completion, each indivisible unit in v.seq is contiguous in P and the order in

P of the indivisible units is the same as they are in v.seq. According to Lemma 2,

moving ‘foreign’ nodes out of an indivisible unit does not increase the memory usage.

Also, the order of the nodes from T −T

**is preserved. Hence, the combining process
**

does not violate parent-child dependencies.

We use induction to showthat the above procedure correctly transforms any given traver-

sal P into an optimal traversal found by the algorithm. The induction hypothesis H(u) for

each node u is that:

—the nodes in each indivisible unit in u.seq appear contiguously in P and are in the same

order as they are in u.seq, and

—the order in P of the indivisible units in u.seq is the same as they are in u.seq.

Initially, H(u) is true for every leaf node u because there is only one traversal order for

a leaf node. As the induction step, assume H(u) is true for each child u of a node v.

The procedure rearranges the nodes in P

**such that the nodes that form an indivisible unit
**

in v.seq are contiguous in P, the sets of nodes corresponding to the indivisible units are

in the same order in P as they are in v.seq, and the order among the nodes that are not

in the subtree rooted at v is preserved. Thus, when the procedure ﬁnishes processing a

node v, H(v) becomes true. By induction, H(T.root) is true and a traversal found by the

algorithm is obtained. Since any traversal P can be transformed into a traversal found by

the algorithm without increasing memory usage in any transformation step, no traversal

can use less memory and the algorithm is correct. ¯.

6. CONCLUSION

In this paper, we have considered the memory usage optimization problem in the evalua-

tion of expression trees involving large objects of different sizes. This problem arose in

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

Memory-Optimal Evaluation of Expression Trees 15

the context of optimizing electronic structure calculations. The developed solution would

apply in any context involving the evaluation of an expression tree, in which intermediate

results are so large that it is impossible to keep all of them in memory at the same time.

In such situations, it is necessary to dynamically allocate and deallocate space for the in-

termediate results and to ﬁnd an evaluation order that uses the least memory. We have

developed an efﬁcient algorithm that ﬁnds an optimal evaluation order in Θ(nlog

2

n) time

for an expression tree containing n nodes and proved its correctness.

Acknowledgments

We would like to thank Tamal Dey for help with the complexity proof and for proofreading

the paper. We are grateful to the National Science Foundation for support of this work

through the Information Technology Research (ITR) program.

REFERENCES

APPEL, A. W. AND SUPOWIT, K. J. 1987. Generalizations of the Sethi-Ullman algorithm for register allocation.

Softw. Pract. Exper. 17, 6 (June), 417–421.

AULBUR, W. 1996. Parallel implementation of quasiparticle calculations of semiconductors and insulators. Ph.D.

thesis, The Ohio State University.

HYBERTSEN, M. S. AND LOUIE, S. G. 1986. Electronic correlation in semiconductors and insulators: Band

gaps and quasiparticle energies. Phys. Rev. B 34, 5390.

LAM, C.-C. 1999. Performance optimization of a class of loops implementing multi-dimensional integrals. Ph.D.

thesis, The Ohio State University. Also available as Technical Report No. OSU-CISRC-8/99-TR22, Dept. of

Computer and Information Science, The Ohio State University, August 1999.

LAM, C.-C., SADAYAPPAN, P., COCIORVA, D., ALOUANI, M., AND WILKINS, J. 1999. Performance optimiza-

tion of a class of loops involving sums of products of sparse arrays. In Ninth SIAM Conference on Parallel

Processing for Scientiﬁc Computing. Society for Industrial and Applied Mathematics, San Antonio, TX.

LAM, C.-C., SADAYAPPAN, P., AND WENGER, R. 1997a. On optimizing a class of multi-dimensional loops

with reductions for parallel execution. Parall. Process. Lett. 7, 2, 157–168.

LAM, C.-C., SADAYAPPAN, P., AND WENGER, R. 1997b. Optimization of a class of multi-dimensional integrals

on parallel machines. In Eighth SIAM Conference on Parallel Processing for Scientiﬁc Computing,. Society

for Industrial and Applied Mathematics, Minneapolis, MN.

NAKATA, I. 1967. On compiling algorithms for arithmetic expressions. Commun. ACM 10, 492–494.

RAUBER, T. 1990a. Ein Compiler f¨ ur Vektorrechner mit Optimaler Auswertung von vektoriellen Aus-

drucksb¨ aumen. Ph.D. thesis, University Saarbr¨ ucken.

RAUBER, T. 1990b. Optimal Evaluation of Vector Expression Trees. In Proceedings of the 5th Jerusalem

Conference on Information Technology. IEEE Computer Society Press, Jerusalem, Israel, 467–473.

ROJAS, H. N., GODBY, R. W., AND NEEDS, R. J. 1995. Space-time method for Ab-initio calculations of

self-energies and dielectric response functions of solids. Phys. Rev. Lett. 74, 1827.

SETHI, R. 1975. “Complete register allocation problems”. SIAM J. Comput. 4, 3 (Sept.), 226–248.

SETHI, R. AND ULLMAN, J. D. 1970. The generation of optimal code for arithmetic expressions. J. ACM 17, 1

(Oct.), 715–728.

Received April 2002; revised TBD TBD; accepted TBD TBD

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

RL. or I). or Pt) and halogen atoms (X = Cl. G is the orbital projected Green function for electrons and holes.R L k. The computation can be expressed in a canonical form as the following multi-dimensional summation of the product of several arrays. The problem of minimizing the number of arithmetic operations by applying the algebraic laws of commutativity. RL1] r. 1995].R L 1 (iτ ) = √ Ω dr e−i(k+G)r Ω R L θR L . and distributivity has been previously addressed in [Lam et al. RL2] × Y [r1.R L (−iτ ) where k. thereby computing some intermediate results that can be stored and reused instead of being recomputed many times. for example. Vol. RL3] × Y [r1. the multi-dimensional integral shown ACM Transactions on Programming Languages and Systems. r] × exp[G.G (iτ ) DR L . MX materials are linear chain compounds with alternating transition-metal atoms (M = Ni. associativity. r1] × exp[G1. as a multi-dimensional summation of the primary inputs to the computation. RL. The objective is to minimize the maximum memory usage during the evaluation of the entire expression tree. TBD TBD. and D is an intermediate array computed using a fast Fourier transform (FFT). χG. Y [r. and then proceed to discuss in detail the problem of memory-optimal evaluation of expression trees. Ψ is the localized basis function.R L (iτ ) and θR L . TBD. resulting in different alternative computational structures. in optimizing a class of loop calculations implementing multi-dimensional integrals of the products of several large input arrays to compute the electronic properties of semiconductors and metals [Hybertsen and Louie 1986.G DR L . Consider. The following multi-dimensional integral computes the susceptibility in momentum space for the determination of the self-energy of MX compounds in real space.G DR L . no intermediate quantities such as D are explicitly computed.RL3 ×G[RL1. .r1. for example. node data object has been evaluated. Using simple examples. it is not computationally attractive since the number of arithmetic operations is enormous (multiplying the extents of the various summation indices gives O(1022 ) operations).RL1. RL] × Y [r.R L ∗k. Pd. The number of operations can be considerably reduced by applying algebraic properties to factor out terms that are independent of some of the summation indices. One such application is the calculation of electronic properties of MX materials with the inclusion of many-body effects [Aulbur 1996]. This problem arises.RL2. The interpretation of the variables and their ranges are given in Table I.2 · Chi-Chung Lam et al.R L (r) = ΨR L (r − R ) Ψ∗ R L (r − R ) In the above equations. we brieﬂy touch upon some of the optimization issues that arise. Rojas et al. 1997b]. r1] When expressed in this form. where Ψ is written as a two-dimensional array Y . t] ×exp[k. TBD. There are many different ways of applying algebraic laws of distributivity and associativity. t] × G[RL2. Br. r] × exp[k. RL3.G (k. Although it requires less memory than the previous form.R L (r) GR L . No. 1997a. iτ ) = i R L .

in Figure 1(a). j] × B[j. k.e. TBD. For instance. It expresses the sequence of steps in computing the multi-dimensional integral as a sequence of formulae. k. Vol. One equivalent form that only requires 2 × Nj × Nk × Nl + 2 × Nj × Nk + Ni × Nj operations is given in Figure 1(b). No. k. each evaluating an intermediate result. j] f2 × B[j. TBD. k] = f [j. the internal nodes are intermediate results. If implemented directly as expressed (i. Each formula computes some intermediate result and the last formula gives the ﬁnal result. and the root node is the ﬁnal integral. k. However. the input arrays and the intermediate arrays are often so large that they cannot all ﬁt into available memory. l] (c) An expression tree representation of (b) Fig. k. assuming associative reordering of the operations and use of the distributive law of multiplication over addition is satisfactory for the ﬂoating-point computations. However. l] l 2 = f1 [j] × f3 [j.. k] f4 [j.j. Figure 1(c) shows the expression tree corresponding to the example formula sequence. l] f3 [j. . This problem has been proved to be NP-complete and a pruning search algorithm was proposed. k] W [k] = f5 [k] = A[i. l] (a) A multi-dimensional integral f1 [j] f2 [j.l) · 3 A[i. 1. l] × C[k. TBD TBD. A sequence of formulae can also be represented as an expression tree in which the leaf nodes are input arrays. The minimization of memory usage is thus desirable. l] × C[k. An example multi-dimensional integral and two representations of a computation.Memory-Optimal Evaluation of Expression Trees W [k] = (i. the next step is to implement it as some loop structure. the above computation can be rewritten in various ways. in practice. By fusing loops it is possible to ACM Transactions on Programming Languages and Systems. k] j 4 (b) A formula sequence for computing (a) f5 j f4 × f1 d d i f3 k A[i. l] d d C[k. Once the operation-minimal form is determined. A straightforward way is to generate a sequence of perfectly nested loops. j] i = B[j. the computation would require 2 × Ni × Nj × Nk × Nl arithmetic operations to compute. as a single set of perfectly-nested loops). l] = f [j.

In this paper. G. E. The size of each data object is shown alongside the corresponding node label. TBD. we formally deﬁne the memory usage optimization problem and make some observations about it. A simpler problem related to the memory usage minimization problem is the register allocation problem for binary expression trees in which the sizes of all nodes are unity. As an example of the memory usage optimization problem. space for it must be allocated. It has a maximum memory usage of 45 units. It has been addressed in [Nakata 1967. Lam 1999]. . G. B. it is not necessary to keep all arrays allocated for the duration of the entire computation. When allocating and deallocating arrays dynamically. where n is the number of nodes in the expression tree. G. D. This space can be deallocated only after the evaluation of its parent is complete. TBD TBD. 1999. The rest of this paper is organized as follows. Section 3 presents an efﬁcient algorithm for ﬁnding the optimal evaluation order for an expression ACM Transactions on Programming Languages and Systems. the evaluation order affects the maximum memory requirement. An example expression tree d d reduce the dimensionality of some of the intermediate arrays [Lam et al.g. we focus on the problem of ﬁnding an evaluation order of the nodes in a given expression tree that minimizes dynamic memory usage. E. I : 16 F : 15 H:5 B : 3 E : 16 G : 25 A : 20 D : 9 C : 30 Fig. the problem becomes NP-complete [Sethi 1975]. B. But if the expression tree is replaced by a directed acyclic graph (in which all nodes are still of unit size). e. for computing the electronic properties of MX materials. However. One of them is the post-order traversal A. C.4 · Chi-Chung Lam et al.. I of the expression tree. In Section 2. and H are in memory. Sethi and Ullman 1970] and can be solved in Ω(n) time. which uses 39 units of memory. which could result in non-optimal solutions (as we show through an example in Section 2). 2. I . Furthermore. Vol. they restricted their attention to solutions that evaluate subtrees contiguously. TBD. A. Before evaluating a data object. Space allocated for an intermediate array can be deallocated as soon as the operation using the array has been performed. H. The algorithm in [Sethi and Ullman 1970] for expression trees of unit-sized nodes does not extend directly to expression trees having nodes of different sizes. This occurs during the evaluation of H. F. The generalized register allocation problem for expression trees was addressed in the context of vector machines in [Rauber 1990a. consider the expression tree shown in Fig. H. F. 1990b]. Other evaluation orders may use more memory or less memory. 2. No. D. is not trivial. when F . Appel and Supowit [Appel and Supowit 1987] generalized the register allocation problem to higher degree expression trees of arbitrarily-sized nodes. We believe that the solution we develop here may have applicability to other areas such as database query optimization and data mining. Finding the optimal order C. A solution to this problem would allow the automatic generation of efﬁcient code for evaluating expression trees. There are many allowable evaluation orders of the nodes.

and (2) maxvi ∈P {himem(vi . Space for each data object is dynamically allocated/deallocated in its entirety. . P ) = lomem(A. Each data object depends on all its children (if any).e. For instance. P ) = lomem(vi−1 . G. P ) = A. TBD. vj . we will use these two terms interchangeably. E G F. The correctness of the algorithm is proved in Section 5.size for each node v ∈ T . internal node objects must be allocated before their evaluation begins. the space allocated to all its children may be released.size = 20. v2 .. F. 2. P ) = 0 if i = 0 Here. · 5 himem 0 + 20 = 20 20 + 3 = 23 3 + 30 = 33 33 + 9 = 42 12 + 16 = 28 19 + 15 = 34 15 + 25 = 40 40 + 5 = 45 20 + 16 = 36 45 deallocate A C D B. we need to allocate space for B. TBD. We assume that the evaluation of each node in the expression tree is atomic. In Section 4. 3. then i > j. Leaf node objects are created or read in as needed. E. P ) is the memory usage during the evaluation of vi in the traversal P . PROBLEM STATEMENT We address the problem of optimizing the memory usage in the evaluation of a given expression tree whose nodes correspond to large data objects of various sizes. 2. himem(vi . So. TBD TBD. C. consider the post-order traversal P = A. where himem(vi . Section 6 provides conclusions. . and lomem(vi . P ) is the memory usage upon completion of the same evaluation. an ordering P = v1 . and thus can be evaluated only after all its children have been evaluated. if vi is the parent of vj . Vol. .size himem(vi . P ) − {child vj of vi } vj . These deﬁnitions reﬂect that we need to allocate space for vi before its evaluation. such that (1) for all vi . i. where n is the number of nodes in T . P )} is minimized. himem(A. vn of the nodes in T . P ) + vi . H. we prove that the algorithm ﬁnds a solution in Θ(n log2 n) time for an n-node expression tree. . During and after the evaluation of A. H lomem 20 − 0 = 20 23 − 20 = 3 33 − 0 = 33 42 − 30 = 12 28 − 9 = 19 34 − 3 − 16 = 15 40 − 0 = 40 45 − 25 = 20 36 − 15 − 5 = 16 Memory usage of a post-order traversal of the expression tree in Fig. I of the expression tree shown in Fig. ﬁnd a computation of T that uses the least memory. We deﬁne the problem formally as follows: Given a tree T and a size v. and that after evaluation of vi . Since an evaluation order is also a traversal of the nodes. The goal is to ﬁnd an evaluation order of the nodes that uses the least amount of memory.size if i > 0 lomem(vi . For evaluating B. No. 2 tree. thus ACM Transactions on Programming Languages and Systems. B. and each object must remain in memory until the evaluation of its parent is completed. D. A is in memory. .Memory-Optimal Evaluation of Expression Trees Node A B C D E F G H I max Fig.

ACM Transactions on Programming Languages and Systems. G. A. H. I . B. is C. TBD. H. A. we present an efﬁcient algorithm that ﬁnds traversals which are not only locally optimal but also globally optimal. B. D. D. is not optimal in memory usage. none of the traversals that visit all nodes in one subtree before visiting another subtree is optimal. I that uses 44 units of memory (see Fig. 4(b)). Its overall memory usage is 44 units. D. Thus. which is G. H. F. A. B. TBD. No. E. P ) + B.size = 23. Vol. The optimal traversal. F. If we follow the traditional wisdom of visiting the subtree with the higher memory usage ﬁrst. TBD TBD. F and G. H. In this example. E. However. E. 3. if we merge together C. I (see Fig. respectively) and then append I. A. F. G. E. C. A. 4(a)). D. Therefore. I . as shown in Fig.6 · Chi-Chung Lam et al. C. H. B. E. 4. C. F. D. But. Node G H C D E A B F I max himem 25 30 35 44 30 41 44 39 36 44 lomem 25 5 35 14 21 41 24 20 16 Node C D G H A B E F I max himem 30 39 34 39 34 37 33 39 36 39 lomem 30 9 34 14 34 17 24 20 16 (a) A better traversal Fig. P ) − A. giving lomem(B. A. B. one from each child of a node X. E. P ) = lomem(A. A. I . H to form C. The memory usage for the rest of the nodes is determined similarly and shown in Fig. D. A. we obtain the best of these four traversals. any algorithm that only considers traversals that visit subtrees contiguously may not produce an optimal solution. P ) = himem(B. H. D. and is not optimal. F. 2 himem(B. B.size = 3. H. F. One might attempt to take two optimal subtree traversals. For example. as in the Sethi-Ullman algorithm [Sethi and Ullman 1970]. C. G. D. this resulting traversal may not be optimal for X. D. 4(a). for the subtree rooted at F . E. E. B. the other optimal traversal C. F. which uses only 39 units of memory. There are four such traversals: A. the traversals C. merge them together optimally. C. F and C. E. E. D. F both use the least memory space of 39 units. B. G. D. locally optimal traversals may not be globally optimal. G. C. H (which are optimal for the subtrees rooted at F and H. H. Notice that it ‘jumps’ back and forth between the subtrees. B. however. D. . A. F. A. The post-order traversal of the given expression tree. F for the subtree rooted at F can be merged with G. I and G. A can be deallocated. B. After B is obtained. I (with I appended). E. I . B. and then append X to form a traversal for X. A. (b) The optimal traversal Memory usage of two different traversals of the expression tree in Fig. The memory usage optimization problem has an interesting property: an expression tree or a subtree may have more than one optimal traversal. the best we can get is a suboptimal traversal G. Continuing with the above example. E. B. In the next section. which is an optimal traversal of the entire expression tree.

the optimal traversal for the subtree rooted at A. Moreover. Since A has no children. Here. Therefore. not all locally optimal traversals for a subtree can be used to form an optimal traversal for a larger tree. but as an ordered list of indivisible units or elements. hi is the highest himem among the nodes in nodelist. No. hi is the maximum memory usage during the evaluation of the nodes in nodelist. which implies in order of decreasing hi-lo difference. we illustrate how it works with an example. To form B.seq. 20. where nodelist is an ordered list of nodes. 3) . we take A. and lo is the memory usage after those nodes are evaluated. 3) . as we show later. We visit the nodes in a bottom-up order.seq. and lo is the lomem of the last node in nodelist. the indivisible units must appear in the same order in some optimal traversal of the entire expression tree. respectively. but also globally optimal in the sense that it can be merged together with globally optimal traversals for other subtrees to form an optimal traversal for a larger tree that is also globally optimal. TBD. 3) (the hi of which is adjusted by the lo of the preceding element). Using the terminology from Section 2. Each element contains an ordered list of nodes with the property that there necessarily exists some globally optimal traversal of the entire tree wherein this sequence appears undivided. Before formally describing the algorithm.e. An element initially contains a single node. 2. TBD TBD. 20). The algorithm stores a traversal not as an ordered list of nodes. Vol. 20) . But as the algorithm goes up the tree merging traversals together and appending new nodes to them. A and B form an indivisible ACM Transactions on Programming Languages and Systems. is (A. . The ﬁrst element in a traversal contains the nodes from the beginning of the traversal up to and including the node that has the lowest lomem after the node that has the highest himem. inserting any node in between the nodes of an element does not lower the total memory usage. Consider the expression tree shown in Fig. it computes an optimal traversal for the subtree rooted at that node. The highest himem and the lowest lomem become the hi and the lo of the ﬁrst element. For each node in the expression tree. AN EFFICIENT ALGORITHM We now present an efﬁcient divide-and-conquer algorithm that. we prove that arranging the indivisible units in this order minimizes memory usage. 3 + 20. elements may be appended together to form new elements containing a larger number of nodes. Whenever two adjacent elements are not in decreasing hi and increasing lo order.seq becomes (AB. i. TBD. and so on. In Section 5. (For clarity. hi.Memory-Optimal Evaluation of Expression Trees · 7 3. This means that indivisible units can be treated as a whole and we only need to consider the relative order of indivisible units from different subtrees. ﬁnds an evaluation order of the tree that minimizes the memory usage.seq. we combine them into one element by concatenating the nodelists and taking the highest hi and the second lo. The same rules are recursively applied to the remainder of the traversal to form the second element. We can also visualize the elements in a traversal as follows. 23. append a new element (B. Thus. meaning that 20 units of memory are needed during the evaluation of A and immediately afterwards. (B.. The optimal subtree traversal that it computes has a special property: it is not only locally optimal for the subtree. lo) triple. Each element (or indivisible unit) in a traversal is a (nodelist. The algorithm always maintains the elements of a traversal in decreasing hi and increasing lo order. and get (A. 23. we write nodelists in a sequence as strings). the order of indivisible units in a traversal stays invariant. 20. As we have seen in Section 2. denoted by A. given an expression tree whose nodes are large data objects. B.

for each node v. 30) (CD. 15) (G. 23 + 9. This order guarantees that the indivisible units are arranged to minimize memory usage. TBD. E. (E. computes an optimal traversal v. 25. The optimal traversal v.seq and E. The optimal merge is performed in a fashion similar to merge-sort. The input to the algorithm (the MinMemTraversal procedure) is an expression tree T . 15) (G. and ﬁnally (E. Note that these two adjacent elements cannot be combined because they are already in decreasing hi and increasing lo order. At the end. 2 unit. S1 and S2 are subtree traversals of two children nodes of the same parent. 39.8 · Chi-Chung Lam et al.size denoting the size of its data object.seq by the order of decreasing hi-lo difference. Optimal traversals for the subtrees in the expression tree in Fig. 9). 25 + 3. 6 shows the algorithm. (I. (E. we get E. 16) Fig. 5) (CD.seq (A. 39. 2). (F. The amount of adjustment for an element from S1 (S2) equals the lo value of the last merged element ACM Transactions on Programming Languages and Systems. 25). F. 30. (ABEF. 36. Similarly. 39. 20). (H. (AB. 23. which has two children B and E. G. 16) .seq[1]. (AB. 23. 9). 9). implying that B must follow A in some optimal traversal of the entire expression tree. 20).seq after the merge as no elements are combined so far. Since S1 and S2 are formed independently.root. 30). The merged elements are in the order (CD. TBD. 32. F. 16) (CD. 16) (CD. (E. 9) (CD. (B. the procedure returns a concatenation of all the nodelists in T. 39. 9). 30. 39. 3) (C. 20. (GH. 39. Elements from S1 and S2 are scanned sequentially and appended into S in the order of decreasing hi-lo difference. the hi-lo values in the elements from S1 and S2 must be adjusted before they can be appended to S. The MergeSeq procedure optimally merges two given traversals S1 and S2 and returns the merged result S. (ABEF. For node F . The maximum memory usage during the evaluation of this indivisible unit is 23 and the result of the evaluation occupies 3 units of memory. 15) . 9) (CD. (E. At the end. and then appending v. we merge B. 9). 25. we append to F.root. 39. 30. 5) (CDGHABEF I. 3 + 9).seq = (CD. 34. Vol. 20. 30. 19). 16) Optimal traversal v. 16 + 3) with the hi and lo values adjusted by the lo value of the last merged element from the other subtree. 20.seq as the optimal traversal for the given expression tree. 20) (AB. 9). Then. 15 + 19. 25) (GH. 28. 30) (C. The new element is combined with the last two elements in F. 15) for the root of the subtree. Node v A B C D E F G H I Subtree traversals merged in decreasing hi-lo order (A. the algorithm returns the optimal traversal C. 5. 25. 5. The procedure performs a bottom-up traversal of the tree and. TBD TBD. A. 39. They are the three elements in F. 9). .seq for the subtree rooted at v.seq is obtained by optimally merging together the optimal traversals u. 34.seq to ensure that the elements are in decreasing hi-lo difference. 25.seq from each child u of v. 39. 34. 20) (A. 39. 39. 25) (G. The memory usage of the optimal traversal is T. 4(b)). (ABEF.seq becomes (CD. D.hi. Hence. 39. 3) (C. B. 39. 14). Fig. 25. 30. H. The optimal traversals for the other nodes are computed in the same way and are shown in Fig. in which each node v has a ﬁeld v. a sequence of only two indivisible units. 9). 25. (D. No. I for the entire expression tree (see Fig.seq the new element (F.

root. No.root.lo) // combine S[i] with E E = (S[i]. S2[j].hi + base2. max(S[i].seq[i]. v .hi). Variables in an example physics computation.nodelist.hi MergeSeq (S1. MinMemTraversal (T ): foreach node v in some bottom-up traversal of T v.seq|].lo + base1) base2 = S1[i]. hi.root.hi. hi.seq) if |v.size + base. Vol. nodelist.nodelist // + is concatenation return nodelist // memory usage is T.hi + base1.hi or E.size) nodelist = for i = 1 to |T. TBD TBD. S2): S= i=j=1 base1 = base2 = 0 while i ≤ |S1| or j ≤ |S2| if j > |S2| or (i ≤ |S1| and S1[i]. u. TBD. S1[i].seq| > 0 then // |x| is the length of x base = v.Memory-Optimal Evaluation of Expression Trees Variable RL r τ k G Orbital Discretized points in real space Time step K-point in a irreducible Brillouin zone Reciprocal lattice vector Range 103 105 102 10 103 · 9 Table I.seq.hi − S1[i].hi ≥ S[i].seq = MergeSeq(v.lo ≤ S[i].lo else base = 0 AppendSeq (v.nodelist + E.lo i++ else AppendSeq (S.seq. S1[i]. S1[i].lo + base2) base1 = S1[j].seq = foreach child u of v v. TBD. S2[j].nodelist. v.lo) then AppendSeq (S.lo j++ end while return S AppendSeq (S. lo) // new element to append to S i = |S| while i ≥ 1 and (E.seq[1].hi − S2[j].seq[|v.nodelist. E.lo) remove S[i] from S i–– end while S =S+E // |S| is now i + 1 Fig. E.seq| nodelist = nodelist + T. . 6. v.lo > S2[j]. Procedure for ﬁnding an memory-optimal traversal of an expression tree ACM Transactions on Programming Languages and Systems. S2[j]. lo): E = (nodelist.

E. We represent a sequence of indivisible units as a red-black tree with the indivisible units at the leaves. For example.seq (see Fig.seq. therefore. Since there can be at most n − 1 of these combine operations. a degenerate expression tree consists of small nodes and large nodes such that every small node has as its only child a large node. The combined element has the concatenated nodelist and the highest hi but the original E. the total cost is O(n log n).10 · Chi-Chung Lam et al. The cost for building the traversal for the entire expression tree is. A pair of such nodes will form an indivisible unit with the hi being the size of the large node and the lo being the size of the small node. from S2 (S1). Inserting a node into a sequence represented as a red-black tree costs O(log n) time. The AppendSeq procedure appends the new element speciﬁed by the triple (nodelist. In this case. In addition. No. we simply append the node to the sequence of its subtree. In the best case. which is in turn a subsequence in I. This algorithm has the property that the traversal it ﬁnds for a subtree T is not only optimal for T but must also appear as a subsequence in some optimal traversal for any larger tree that contains T as a subtree. which costs O(log n) time. COMPLEXITY OF THE ALGORITHM The time complexity of our algorithm is Θ(n log2 n) for an n-node expression tree. we merge the sequences of the subtrees by inserting the nodes from the smaller sequences into the largest sequence. the algorithm only takes linear time. which is kept in variable base1 (base2). Vol. indivisible units are always combined such that each sequence contains a single element. The cost of the whole algorithm is. Two individual indivisible units can be combined in constant time. In this case. For a unary interior node. can be inserted into a sequence at most O(log n) times. O(n log2 n). In the worst case. When combining two adjacent indivisible units within a sequence. every time a given node of the expression tree gets inserted into a sequence the size of the sequence containing this node at least doubles. For an m-ary interior node. hi. 5). the leaves are linked in sorted order in a doubly-linked list. it is combined with elements at the end of S whose hi is not higher than E. TBD. which costs O(log n) time.seq is a subsequence in F. TBD TBD. 4. lo) to the given traversal S. TBD. one of them must be deleted from the sequence and the red-black tree must be re-balanced. therefore. the time required for merging and combining sequences. the worst-case cost for building sequences can be analyzed separately from the worst-case cost for combining indivisible units. For ﬁnding an upper bound on the cost of the algorithm.hi or whose lo is not lower than E. which is O(n log2 n). therefore. Each node. Such a tree can be constructed such that these indivisible units will not further combine into larger indivisible units. with each insertion costing O(log n) time. the algorithm will result ACM Transactions on Programming Languages and Systems. Before the new element E is appended to S. The tree is sorted by decreasing hi-lo difference. Since we always insert the nodes of the smaller sequences into the largest one. Combining indivisible units into larger ones reduces the number of elements in the sequences and. and a count of the number of indivisible units in the sequence is maintained.lo. The sequence for a leaf node of the expression tree can be constructed in constant time. dominated by the cost for building sequences.lo. therefore. The cost of constructing the ﬁnal evaluation order consists of the cost for building sequences of indivisible units and the cost for combining indivisible units into larger indivisible units. .

CORRECTNESS OF THE ALGORITHM We now show the correctness of the algorithm. Let v be any node in an expression tree.e. (2) S[i].e. we do not need to consider breaking it up later. P ). for all k < m. L EMMA 5. . i.lo = lomem(vn . (1) S[i]. Thus. TBD TBD. The ﬁrst lemma establishes some important invariants about the indivisible units in an ordered list v.lo < lomem(vk . TBD.nodelist appear in the same order as they are in S[i]. TBD. (3) for all m ≤ k ≤ n.nodelist = v1 . v2 ..seq that represents a traversal. Lemma 1 characterizes the indivisible units in a subtree traversal by providing some invariants. (c) S[j]. the indivisible units in a sequence are in decreasing hi and increasing lo order. the algorithm requires Ω(n log2 n) time for computing the optimal traversal. Then. S[j]. any nodes that are in between the nodes in ACM Transactions on Programming Languages and Systems. vn and vm be the last node in S[i]. P = S[1]. P ). Given a sequence of nodes. Proof The above invariants are true by construction.seq.1. P ). This lemma allows us to treat each traversal as a sequence of indivisible units (each containing one or more nodes) instead of a list of the individual nodes. and P be the traversal represented by S of the subtree rooted at v. S[j]. P ) < himem(vm . but not contiguously. The second lemma asserts the ‘indivisibility’ of an indivisible unit by showing that unrelated nodes inserted in between the nodes of an indivisible unit can always be moved to the beginning or the end of the indivisible unit without increasing memory usage. The hi value of an indivisible unit is the highest himem of the nodes in the indivisible unit. let S[i].hi. P ). S = v. If such a degenerate expression tree is also unbalanced. an indivisible unit extends to the last node with the lowest lomem following the last node with the highest himem. (b) for all 1 ≤ k ≤ n. Lemma 3 deals with the optimal ordering of indivisible units from different subtrees. Finally. each indivisible unit can be considered as a whole. (4) for all 1 ≤ j < i. No. himem(vk . Let v be a node in an expression tree T . P ). P ) ≤ himem(vm .nodelist.nodelist + · · · + S[|S|]. lomem(vk .hi > himem(vk . . using the three lemmas we prove the correctness of the algorithm by arguing that any traversal of an expression tree can be transformed in a series of steps into the optimal traversal found by the algorithm without increasing memory usage. In addition. we have.lo. (a) for all 1 ≤ k ≤ n. 5. and (d) S[j].hi = himem(vm . S = v. and P be a traversal of T in which the nodes from S[i]. Vol. P ) and for all k > m.lo < S[i]. P ) ≥ lomem(vn .2. Lemma 2 shows that.Memory-Optimal Evaluation of Expression Trees · 11 in a sequence of n/2 indivisible units containing two nodes each. The proof proceeds as follows.hi > S[i].. The algorithm maintains the following invariants: For all 1 ≤ i ≤ |S|.nodelist. once an indivisible unit is formed. P ). Then. L EMMA 5.seq. i. . himem(vk . . once formed.nodelist that has the maximum himem value. . The lo value of an indivisible unit is the lomem of the last node in the indivisible unit.

S) ≤ himem(vj . . . S) ≤ himem(vj . . . . S) + lomem(va .3. . va+1 . It shows that arranging indivisible units from different subtrees in the order of decreasing hi-lo difference minimizes memory usage. . v1 . . the memory usage for P at vk is no higher than the memory usage for P at vk . one from each of S and S . vb . . . Q) + lomem(v0 . va (not necessarily contiguously) before vm and va+1 . vb to the end of S[i]. . vb . the overall maximum memory usage cannot increase. . since for all m ≤ k ≤ n. Then. The next lemma deals with the ordering of indivisible units. lomem(vk . S = v. . . Let Q be the traversal obtained from P by removing the nodes in S[i]. . (3) For all 1 ≤ j ≤ a. . since two indivisible units that are not in that order can be interchanged in the merged traversal without increasing memory usage. . . the memory usage for P at vj is no higher than the memory usage for P at vj . Q)+lomem(vn . which remains unchanged.nodelist. . vn in P with v1 . . Q) ≤ himem(vm . . Q) = himem(vm . Traversals P and P differ in memory usage only at the nodes {v1 . Vol. Then there exists an adjacent pair of elements. TBD. . vn .nodelist such that for all k < m. . . S) + lomem(va . that are not in that order.e. In other words. S) < lomem(vk . TBD TBD. provided that none of the nodes that are in between the nodes in S[i]. .nodelist and va+1 . vn . P ) = himem(vk . P ) ≤ himem(vm . vm be the node in S[i]. since himem(vk . among all possible merges of S and S . . . Q) = himem(vm . the merge that arranges the elements from S and S in the order of decreasing hi-lo difference uses the least memory. S[i]. We construct another traversal P of T from P by moving v1 . Proof Let M be a merge of S and S that is not in the order of decreasing hi-lo difference. vb be the ‘foreign’ nodes.nodelist are ancestors or descendants of any nodes in S[i]. . . lomem(v0 . va . P ). P ) = himem(vm . himem(vm . . . . Let v1 . P ) = himem(vj . with v1 . . vm . . since for all 1 ≤ k ≤ m. vn after moving the foreign nodes cannot exceed that of vm . and the memory usage of the foreign nodes does increase as a result of moving them. . S) ≤ himem(vm . . P does not use more memory than P because: (1) The memory usage for P at vm is the same as the memory usage for P at vm since himem(vm . . .nodelist. . L EMMA 5.nodelist without increasing memory usage. P ). . . (2) For all 1 ≤ k ≤ n. P ). the memory usage for P at vj is no higher than the memory usage for P at vj . S) (by invariant 4(b) in Lemma 1) implies himem(vj . va to the beginning of S[i]. . v1 . Let v and v be two nodes in an expression tree that are siblings of each other. S) > himem(vk . . . . va+1 . S) (by invariant 3 in Lemma 1) implies himem(vj . . . i. . . P ). Without loss of generality. . P ) = himem(vj .nodelist = v1 .nodelist. . . to form P . v1 . vm . . . . P ) and for all k > m. . and S = v . . No. . .seq. v0 be the node before v1 in S. va .seq.. S) ≥ lomem(vn . . . . vb (not necessarily contiguously) after vm in P . .12 · Chi-Chung Lam et al. Proof Let S[i]. . S).nodelist in P . . S) implies that himem(vk . . . the nodes that are in between the nodes in S[i]. S) + lomem(va . . we replace v1 . . TBD. himem(vk . (4) For all a < j ≤ b. vn . . . Since the memory usage of any node in v1 . . . . . . . vb }.nodelist can always be moved to the beginning or the end of S[i]. we assume the ﬁrst element is S [j] from S and ACM Transactions on Programming Languages and Systems.

lo. .Memory-Optimal Evaluation of Expression Trees element . The ﬁrst kind of units are the indivisible units in u. P ) and E. (b) Sequence M Memory usage comparison of two traversals in Lemma 5. To simplify the notation. In the following steps.hi. Invariant 4 in Lemma 1 gives us Lj > Lj−1 .seq.e. . . Hi + Lj ) while the memory usage of M at the same two elements is max(Hj + Li . Given an expression tree. . 7. . . By switching all adjacent pairs in M that are out of order until no such pair exists. TBD TBD. . Lr = S[r]. . S [j] S[i] . .lo = lomem(wn . Hj + Li−1 Hi + L j . which implies Hi + Lj > Hi + Lj−1 .lo < E. . The memory usage of M and M differs only at S [j] and S[i] and is compared in Fig. For each unit E of the second kind. . This implies Hi + Lj > Hj + Li . . element . for each non-leaf node v in T . . . Hi + Lj−1 Hj + L i . and Lr = S [r]. . (2) First. we get an optimal order without increasing memory usage.seq. . . Hi + Lj−1 ).lo. himem . M cannot use more memory than M . The memory usage of M at the two elements is max(Hj + Li−1 .. . Lj + Li−1 Li + Lj . .3 the second one is S[i] from S. Hi + Lj ) ≥ Hi + Lj > max(Hj + Li . ACM Transactions on Programming Languages and Systems. i. . we sort the components of the indivisible units in v. lomem . For this sorting step. let Hr = S[r]. TBD. we will rearrange the nodes in P such that the nodes that form an indivisible unit in v. the algorithm presented in Section 3 computes a traversal that uses the least memory.seq so that they are in the same order as in v. Vol. . The sorting process involves rearranging two kinds of units. . T HEOREM 5. No.hi. The second kind of units are the contiguous sequences of nodes in P which are from T − T . Li + Lj−1 Lj + Li . we perform the following steps: (1) Let T be the subtree rooted at v and P be the minimal substring of P that contains all the nodes from T − {v}. TBD. Hr = S [r]. . we take E. Therefore. we visit the nodes in T in a bottom-up manner and. Thus. lomem .hi − E . Given a traversal P for an expression tree T . S[i] S [j] . Hi + Lj−1 ). While there exist two adjacent units E and E in P such that E is before E and E . himem . Proof We prove the correctness of the algorithm by describing a procedure that transforms any given traversal to the traversal found by the algorithm without increase in memory usage in any transformation step. the hi-lo difference of S [j] must be less than that of S[i]. P ) where wn is the last node in E. we temporarily treat each such maximal contiguous sequence of nodes as a unit.seq are contiguous and the indivisible units are in the same order as they are in v.4. Hj − Lj < Hi − Li .seq for each child u of v.hi = maxw∈E himem(w. . The sorting process is as follows. Since the two elements are out of order.hi − E.lo. Consider the merge M obtained from M by interchanging S [j] and S[i]. 7. max(Hj + Li−1 . · 13 (a) Sequence M Fig. .

By induction.seq. Vol. v and its ancestors do not appear in P . Upon completion.seq. Hence. no traversal can use less memory and the algorithm is correct. if there are nodes from T − T in between the nodes from E. all units of the ﬁrst kind. When the above sorting process ﬁnishes. (3) Now that the components of the indivisible units in v. No. Also. (a) Swap E and E.seq. each indivisible unit in v.14 · Chi-Chung Lam et al. . this does not increase the memory usage. We use induction to show that the above procedure correctly transforms any given traversal P into an optimal traversal found by the algorithm. Since any traversal P can be transformed into a traversal found by the algorithm without increasing memory usage in any transformation step. their relative order is not changed. According to Lemma 2. TBD. (a) In the traversal P . the sorting process does not violate parent-child dependencies. H(v) becomes true.seq is contiguous in P and the order in P of the indivisible units is the same as they are in v. The procedure rearranges the nodes in P such that the nodes that form an indivisible unit in v.root) is true and a traversal found by the algorithm is obtained.seq. H(u) is true for every leaf node u because there is only one traversal order for a leaf node. (b) Make the contiguous sequence of nodes from E an indivisible unit. CONCLUSION In this paper.seq are in the correct order. For each indivisible unit E in v. for each child u of v. indivisible units in u. when the procedure ﬁnishes processing a node v. This problem arose in ACM Transactions on Programming Languages and Systems. 6. and nodes in units of the ﬁrst kind are not ancestors or descendants of any nodes in units of the second kind. and the order among the nodes that are not in the subtree rooted at v is preserved. moving ‘foreign’ nodes out of an indivisible unit does not increase the memory usage. Therefore. Since. By Lemma 3.seq appear contiguously in P and are in the same order as they are in u.seq. move them either to the beginning or the end of E as speciﬁed by Lemma 2. combine the two units into one and recompute its new hi and lo. the sets of nodes corresponding to the indivisible units are in the same order in P as they are in v. TBD TBD.seq is the same as they are in u. As the induction step. H(T. and —the order in P of the indivisible units in u. (b) If two units of the second kind become adjacent to each other as a result of the swapping. Also. we have considered the memory usage optimization problem in the evaluation of expression trees involving large objects of different sizes. assume H(u) is true for each child u of a node v.seq have been in the correct order before the sorting process. the combining process does not violate parent-child dependencies. are in the order of decreasing hi-lo difference. which are components of the indivisible units in v. we make the indivisible units contiguous using the following combining process.seq are contiguous in P . Initially.seq. the order of the nodes from T − T is preserved. The order of the nodes from T − T is preserved because the sorting process never swaps two units of the second kind. Thus. TBD. The induction hypothesis H(u) for each node u is that: —the nodes in each indivisible unit in u.

226–248. 1995.. 417–421. AND L OUIE . 17. C. 157–168. T. SIAM J. C. TX. TBD. P. of Computer and Information Science. 1996. 1999.. Exper. IEEE Computer Society Press. In such situations. L AM . Parall. S ADAYAPPAN . AND N EEDS . revised TBD TBD. TBD TBD.).-C.D. R. S. AND W ENGER .-C. Phys. 3 (Sept. Process. In Ninth SIAM Conference on Parallel Processing for Scientiﬁc Computing. J. S. 715–728. H.. The generation of optimal code for arithmetic expressions. S ETHI .. A. R. thesis. Ph. In Eighth SIAM Conference on Parallel Processing for Scientiﬁc Computing. W.D.. N. B 34. C OCIORVA . Israel. 4. University Saarbr¨ cken. accepted TBD TBD ACM Transactions on Programming Languages and Systems. 492–494. 1987. J. On optimizing a class of multi-dimensional loops with reductions for parallel execution. Generalizations of the Sethi-Ullman algorithm for register allocation. 1990a. Society for Industrial and Applied Mathematics. A LOUANI . R. ACM 17. T. H YBERTSEN . The developed solution would apply in any context involving the evaluation of an expression tree. R AUBER . M. J. R. G. The Ohio State University. Received April 2002. . Ph. G ODBY. it is necessary to dynamically allocate and deallocate space for the intermediate results and to ﬁnd an evaluation order that uses the least memory. 467–473. 7. J. ACM 10. C. Parallel implementation of quasiparticle calculations of semiconductors and insulators. C. S ETHI .). 6 (June). in which intermediate results are so large that it is impossible to keep all of them in memory at the same time. D. AND W ILKINS . Optimal Evaluation of Vector Expression Trees. TBD. 1997a. REFERENCES A PPEL . Ein Compiler f¨ r Vektorrechner mit Optimaler Auswertung von vektoriellen Ausu drucksb¨ umen. ROJAS . J. D. L AM . AND S UPOWIT. We have developed an efﬁcient algorithm that ﬁnds an optimal evaluation order in Θ(n log2 n) time for an expression tree containing n nodes and proved its correctness. 1990b. Commun. 1970.-C. K. Rev. Society for Industrial and Applied Mathematics. I. a u R AUBER . OSU-CISRC-8/99-TR22. Acknowledgments We would like to thank Tamal Dey for help with the complexity proof and for proofreading the paper. Jerusalem. The Ohio State University. Pract. Lett. AND U LLMAN . Minneapolis. thesis. Optimization of a class of multi-dimensional integrals on parallel machines. On compiling algorithms for arithmetic expressions. Vol. 1975. P. 1 (Oct..-C. San Antonio. Rev. AND W ENGER . Dept. Ph. NAKATA . Electronic correlation in semiconductors and insulators: Band gaps and quasiparticle energies. thesis. No. 1967.D. Phys. R. R. 1986... In Proceedings of the 5th Jerusalem Conference on Information Technology. 2. S ADAYAPPAN . 5390. Lett. “Complete register allocation problems”.. 1827. S ADAYAPPAN . L AM . Performance optimization of a class of loops involving sums of products of sparse arrays. Softw. Also available as Technical Report No. 1999. W.Memory-Optimal Evaluation of Expression Trees · 15 the context of optimizing electronic structure calculations.. Comput. August 1999. 1997b. AULBUR .. We are grateful to the National Science Foundation for support of this work through the Information Technology Research (ITR) program. L AM . M. Performance optimization of a class of loops implementing multi-dimensional integrals. P. 74. Space-time method for Ab-initio calculations of self-energies and dielectric response functions of solids. W. MN. The Ohio State University.

- 08_BIDE
- Trees Data structures
- ds ppt
- Incremental Dynamic Code Generation with Trace Trees
- Applications of Tree Data Structure - GeeksforGeeks
- P1213
- p250-fredman
- Traverse Hierarchies
- pathloc
- Problem
- Oracle Cloud Genral Ledger Setups in Fusion Application
- Placement Info 2
- Trees in Web Dynpro
- אלגוריתמים- הרצאה 11 | Red Black Trees
- MSCS Entry Test Sample Papers
- Tree Data Structures
- J 8705 PAPER II
- SAP Overview Manual
- Binary Search Tree
- C Interview Question Bank
- Spring End Sem Data Structure Question 2012-13
- 06255 Raheel Abbas MS76
- demantra ppt-
- Ch6.Trees
- Trees
- pro076-001
- Managing Hierarchical Data in MySQL
- Final Crib
- Data Structures and Algorithms
- BS-star

- Optimization of Water Distribution Network for Dharampeth Area
- tmpBFF4
- Analysis of CAS Using Improved PSO with Fuzzy Based System
- Comparative Analysis of Optimization Algorithms Based on Hybrid Soft Computing Algorithm
- The Optimizing Multiple Travelling Salesman Problem Using Genetic Algorithm
- Multi-objective Optimization for Ultrasonic Drilling of GFRP Using Grey Relational Analysis
- Determining the shortest path for Travelling Salesman Problem using Nearest Neighbor Algorithm
- UT Dallas Syllabus for opre7313.001.11s taught by Milind Dawande (milind)
- A Review on Image Segmentation using Clustering and Swarm Optimization Techniques
- UT Dallas Syllabus for eco5302.501.07f taught by Chetan Dave (cxd053000)
- Weight Optimization of Composite Beams by Finite Element Method
- Surveying, Planning and Scheduling For A Hill Road Work at Kalrayan Hills by Adopting ISS Method
- Emerging Trends of Ant Colony Optimization in Project Scheduling
- UT Dallas Syllabus for opre7330.001.07f taught by Milind Dawande (milind)
- Comparative study of Conventional and Traditional Controller for Two Equal Area interconnected power plant for Automatic Generation Control
- ON Fuzzy Linear Programming Technique Application
- Optimization of Shell and Tube Heat Exchanger
- Recent Advancements in Scheduling Algorithms in Cloud Computing Environment
- Quantitative Management for Analysis
- tmp82D3.tmp
- UT Dallas Syllabus for opre7320.001.09s taught by Suresh Sethi (sethi)
- tmpA166.tmp
- UT Dallas Syllabus for math6321.521 06u taught by Janos Turi (turi)
- Mathematics-III
- Planning of Supply when lead time uncertain
- tmp5DD5.tmp
- tmp5056.tmp
- A Review on Various Approach for Process Parameter Optimization of Burnishing Process and TAGUCHI Approach for Optimization
- Optimal Power Flow Analysis of Bus System using Practical Swarm Optimization
- Combined Economic and Emission Load Dispatch using PPF using PSO

- Cross Eye Jamming
- Parallel Networks v. Array Networks
- UT Dallas Syllabus for cs1336.003 05s taught by George Steinhorst (csteinh)
- UT Dallas Syllabus for cs1336.003 06s taught by Shyam Karrah (skarrah)
- UT Dallas Syllabus for cs1336.002 05s taught by Laurie Thompson (lthomp)
- Algorithm And Programming Language MCQ'S
- Analysis & Design Algorithm MCQ'S
- Algorithms and Data Structure
- UT Dallas Syllabus for cs1336.002 06s taught by Laurie Thompson (lthomp)
- Compare "Urdhva Tiryakbhyam Multiplier" and "Hierarchical Array of Array Multiplier"
- NIIT Sample Programming Placement Paper Level-1
- UT Dallas Syllabus for cs1336.005.09f taught by Timothy Farage (tfarage)
- Data Structures and Algorithms MCQ'S
- tmp244D.tmp
- UT Dallas Syllabus for cs1336.002 05f taught by Laurie Thompson (lthomp)
- Data Structure
- Implementation of Digital Beamforming Technique for Linear Antenna Arrays
- UT Dallas Syllabus for cs1336.501.10f taught by Timothy Farage (tfarage)
- Nokia VP8 Infringement Claim Chart for '211 Patent
- UT Dallas Syllabus for cs1336.001 05f taught by Laurie Thompson (lthomp)

Sign up to vote on this title

UsefulNot usefulClose Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Close Dialog## This title now requires a credit

Use one of your book credits to continue reading from where you left off, or restart the preview.

Loading