Memory-Optimal Evaluation of Expression

Trees Involving Large Objects
1
CHI-CHUNG LAM
The Ohio State University
THOMAS RAUBER
Martin-Luther University Halle-Wittenberg
GERALD BAUMGARTNER, DANIEL COCIORVA, and P. SADAYAPPAN
The Ohio State University
The need to evaluate expression trees involving large objects arises in scientific computing appli-
cations such as electronic structure calculations. Often, the tree node objects are so large that
only a subset of them can fit into memory at a time. This paper addresses the problem of finding
an evaluation order of the nodes in a given expression tree that uses the least amount of memory.
We present an algorithm that finds an optimal evaluation order in Θ(nlog
2
n) time for an n-node
expression tree and prove its correctness.
Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors—Code generation; Opti-
mization; I.1.3 [Algebraic Manipulation]: Languages and Systems—Evaluation strategies; I.2.1 [Artificial
Intelligence]: Automatic Programming—Program synthesis; J.2 [Computer Applications]: Physical Sciences
and Engineering—Chemistry; Physics
General Terms: Algorithms, Languages
Additional Key Words and Phrases: Expression tree, evaluation order, memory minimization
1. INTRODUCTION
This paper addresses the problem of finding an evaluation order of the nodes in a given
expression tree that minimizes memory usage. The expression tree must be evaluated in
some bottom-up order, i.e., the evaluation of a node cannot precede the evaluation of any of
its children. The nodes of the expression tree are large data objects whose sizes are given.
If the total size of the data objects is so large that they cannot all fit into memory at the
same time, space for the data objects has to be allocated and deallocated dynamically. Due
to the parent-child dependence relation, a data object cannot be deallocated until its parent
This work was supported in part by the National Science Foundation under grants DMR-9520319 and CHE-
0121676.
Authors’ address: C. Lam, G. Baumgartner, D. Cociorva, and P. Sadayappan, Department of Computer and Infor-
mation Science, The Ohio State University, Columbus, OH 43210; email: {clam; gb; cociorva; saday}@cis.ohio-
state.edu.
Author’s address: T. Rauber, Martin-Luther-University Halle-Wittenberg, Fachbereich Mathematik und Infor-
matik, Von-Seckendorff-Platz 1, D-06120 Halle/Saale, Germany; email: rauber@informatik.uni-halle.de.
Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use
provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server
notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the
ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific
permission and/or a fee.
c 2002 ACM 0164-0925/02/0100-TBD $00.02
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD, Pages 1–15.
2 Chi-Chung Lam et al.
node data object has been evaluated. The objective is to minimize the maximum memory
usage during the evaluation of the entire expression tree.
This problem arises, for example, in optimizing a class of loop calculations implement-
ing multi-dimensional integrals of the products of several large input arrays to compute
the electronic properties of semiconductors and metals [Hybertsen and Louie 1986; Rojas
et al. 1995]. One such application is the calculation of electronic properties of MX materi-
als with the inclusion of many-body effects [Aulbur 1996]. MX materials are linear chain
compounds with alternating transition-metal atoms (M = Ni, Pd, or Pt) and halogen atoms
(X = Cl, Br, or I). The following multi-dimensional integral computes the susceptibility in
momentum space for the determination of the self-energy of MX compounds in real space.
χ
G,G
(k, iτ) = i

R

L

,R

L

D
k,G
R

L

,R

L
(iτ) D
∗k,G

R

L

,R

L
(−iτ)
where
D
k,G
R

L

,R

L
(iτ) =
1


dr e
−i(k+G)r

R

L

θ
R

L

,R

L
(r) G
R

L

,R

L
(iτ)
and
θ
R

L

,R

L
(r) = Ψ
R

L
(r −R

) Ψ

R

L
(r −R

)
In the above equations, Ψ is the localized basis function, G is the orbital projected
Green function for electrons and holes, and D is an intermediate array computed using
a fast Fourier transform (FFT). The interpretation of the variables and their ranges are
given in Table I. The computation can be expressed in a canonical form as the following
multi-dimensional summation of the product of several arrays, where Ψ is written as a
two-dimensional array Y .

r,r1,RL,RL1,RL2,RL3
Y [r, RL] Y [r, RL2] Y [r1, RL3] Y [r1, RL1]
G[RL1, RL, t] G[RL2, RL3, t]
exp[k, r] exp[G, r] exp[k, r1] exp[G1, r1]
When expressed in this form, as a multi-dimensional summation of the primary inputs to
the computation, no intermediate quantities such as D are explicitly computed. Although
it requires less memory than the previous form, it is not computationally attractive since
the number of arithmetic operations is enormous (multiplying the extents of the various
summation indices gives O(10
22
) operations). The number of operations can be consid-
erably reduced by applying algebraic properties to factor out terms that are independent
of some of the summation indices, thereby computing some intermediate results that can
be stored and reused instead of being recomputed many times. There are many different
ways of applying algebraic laws of distributivity and associativity, resulting in different
alternative computational structures. Using simple examples, we briefly touch upon some
of the optimization issues that arise, and then proceed to discuss in detail the problem of
memory-optimal evaluation of expression trees.
The problem of minimizing the number of arithmetic operations by applying the alge-
braic laws of commutativity, associativity, and distributivity has been previously addressed
in [Lam et al. 1997a; 1997b]. Consider, for example, the multi-dimensional integral shown
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
Memory-Optimal Evaluation of Expression Trees 3
W[k] =

(i,j,l)
A[i, j] ×B[j, k, l] ×C[k, l]
(a) A multi-dimensional integral
f1[j] =

i
A[i, j]
f2[j, k, l] = B[j, k, l] ×C[k, l]
f3[j, k] =

l
f2[j, k, l]
f4[j, k] = f1[j] ×f3[j, k]
W[k] = f5[k] =

j
f4[j, k]
(b) A formula sequence for computing (a)
B[j, k, l] C[k, l]

d
d
A[i, j] × f2

i

k
f1 f3

d
d
× f4

j
f5
(c) An expression tree representation of (b)
Fig. 1. An example multi-dimensional integral and two representations of a computation.
in Figure 1(a). If implemented directly as expressed (i.e., as a single set of perfectly-nested
loops), the computation would require 2 N
i
N
j
N
k
N
l
arithmetic operations
to compute. However, assuming associative reordering of the operations and use of the
distributive law of multiplication over addition is satisfactory for the floating-point com-
putations, the above computation can be rewritten in various ways. One equivalent form
that only requires 2 N
j
N
k
N
l
+ 2 N
j
N
k
+ N
i
N
j
operations is given in
Figure 1(b). It expresses the sequence of steps in computing the multi-dimensional inte-
gral as a sequence of formulae. Each formula computes some intermediate result and the
last formula gives the final result. A sequence of formulae can also be represented as an
expression tree in which the leaf nodes are input arrays, the internal nodes are intermediate
results, and the root node is the final integral. For instance, Figure 1(c) shows the expres-
sion tree corresponding to the example formula sequence. This problem has been proved
to be NP-complete and a pruning search algorithm was proposed.
Once the operation-minimal form is determined, the next step is to implement it as
some loop structure. A straightforward way is to generate a sequence of perfectly nested
loops, each evaluating an intermediate result. However, in practice, the input arrays and
the intermediate arrays are often so large that they cannot all fit into available memory.
The minimization of memory usage is thus desirable. By fusing loops it is possible to
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
4 Chi-Chung Lam et al.
C : 30
A : 20 D : 9
B : 3 E : 16 G : 25

d
d
F : 15 H : 5




I : 16
Fig. 2. An example expression tree
reduce the dimensionality of some of the intermediate arrays [Lam et al. 1999; Lam 1999].
Furthermore, it is not necessary to keep all arrays allocated for the duration of the entire
computation. Space allocated for an intermediate array can be deallocated as soon as the
operation using the array has been performed. When allocating and deallocating arrays
dynamically, the evaluation order affects the maximum memory requirement. In this paper,
we focus on the problem of finding an evaluation order of the nodes in a given expression
tree that minimizes dynamic memory usage. A solution to this problem would allow the
automatic generation of efficient code for evaluating expression trees, e.g., for computing
the electronic properties of MX materials. We believe that the solution we develop here
may have applicability to other areas such as database query optimization and data mining.
As an example of the memory usage optimization problem, consider the expression
tree shown in Fig. 2. The size of each data object is shown alongside the correspond-
ing node label. Before evaluating a data object, space for it must be allocated. This
space can be deallocated only after the evaluation of its parent is complete. There are
many allowable evaluation orders of the nodes. One of them is the post-order traversal
¸A, B, C, D, E, F, G, H, I) of the expression tree. It has a maximum memory usage of
45 units. This occurs during the evaluation of H, when F, G, and H are in memory.
Other evaluation orders may use more memory or less memory. Finding the optimal order
¸C, D, G, H, A, B, E, F, I), which uses 39 units of memory, is not trivial.
A simpler problem related to the memory usage minimization problem is the register
allocation problem for binary expression trees in which the sizes of all nodes are unity. It
has been addressed in [Nakata 1967; Sethi and Ullman 1970] and can be solved in Ω(n)
time, where n is the number of nodes in the expression tree. But if the expression tree is
replaced by a directed acyclic graph (in which all nodes are still of unit size), the problem
becomes NP-complete [Sethi 1975]. The algorithm in [Sethi and Ullman 1970] for ex-
pression trees of unit-sized nodes does not extend directly to expression trees having nodes
of different sizes. Appel and Supowit [Appel and Supowit 1987] generalized the register
allocation problem to higher degree expression trees of arbitrarily-sized nodes. However,
they restricted their attention to solutions that evaluate subtrees contiguously, which could
result in non-optimal solutions (as we show through an example in Section 2). The gen-
eralized register allocation problem for expression trees was addressed in the context of
vector machines in [Rauber 1990a; 1990b].
The rest of this paper is organized as follows. In Section 2, we formally define the
memory usage optimization problem and make some observations about it. Section 3
presents an efficient algorithm for finding the optimal evaluation order for an expression
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
Memory-Optimal Evaluation of Expression Trees 5
Node himem deallocate lomem
A 0 + 20 = 20 - 20 −0 = 20
B 20 + 3 = 23 A 23 −20 = 3
C 3 + 30 = 33 - 33 −0 = 33
D 33 + 9 = 42 C 42 −30 = 12
E 12 + 16 = 28 D 28 −9 = 19
F 19 + 15 = 34 B, E 34 −3 −16 = 15
G 15 + 25 = 40 - 40 −0 = 40
H 40 + 5 = 45 G 45 −25 = 20
I 20 + 16 = 36 F, H 36 −15 −5 = 16
max 45
Fig. 3. Memory usage of a post-order traversal of the expression tree in Fig. 2
tree. In Section 4, we prove that the algorithm finds a solution in Θ(nlog
2
n) time for an
n-node expression tree. The correctness of the algorithm is proved in Section 5. Section 6
provides conclusions.
2. PROBLEM STATEMENT
We address the problem of optimizing the memory usage in the evaluation of a given ex-
pression tree whose nodes correspond to large data objects of various sizes. Each data
object depends on all its children (if any), and thus can be evaluated only after all its chil-
dren have been evaluated. We assume that the evaluation of each node in the expression
tree is atomic. Space for each data object is dynamically allocated/deallocated in its en-
tirety. Leaf node objects are created or read in as needed; internal node objects must be
allocated before their evaluation begins; and each object must remain in memory until the
evaluation of its parent is completed. The goal is to find an evaluation order of the nodes
that uses the least amount of memory. Since an evaluation order is also a traversal of the
nodes, we will use these two terms interchangeably.
We define the problem formally as follows:
Given a tree T and a size v.size for each node v ∈ T, find a computation of T
that uses the least memory, i.e., an ordering P = ¸v
1
, v
2
, . . . , v
n
) of the nodes
in T, where n is the number of nodes in T, such that
(1) for all v
i
, v
j
, if v
i
is the parent of v
j
, then i > j; and
(2) max
vi∈P
¦himem(v
i
, P)¦ is minimized, where
himem(v
i
, P) = lomem(v
i−1
, P) +v
i
.size
lomem(v
i
, P) =

himem(v
i
, P) −

{child vj of vi}
v
j
.size if i > 0
0 if i = 0
Here, himem(v
i
, P) is the memory usage during the evaluation of v
i
in the traversal P,
and lomem(v
i
, P) is the memory usage upon completion of the same evaluation. These
definitions reflect that we need to allocate space for v
i
before its evaluation, and that after
evaluation of v
i
, the space allocated to all its children may be released. For instance,
consider the post-order traversal P = ¸A, B, C, D, E, F, G, H, I) of the expression tree
shown in Fig. 2. During and after the evaluation of A, Ais in memory. So, himem(A, P) =
lomem(A, P) = A.size = 20. For evaluating B, we need to allocate space for B, thus
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
6 Chi-Chung Lam et al.
Node himem lomem
G 25 25
H 30 5
C 35 35
D 44 14
E 30 21
A 41 41
B 44 24
F 39 20
I 36 16
max 44
Node himem lomem
C 30 30
D 39 9
G 34 34
H 39 14
A 34 34
B 37 17
E 33 24
F 39 20
I 36 16
max 39
(a) A better traversal (b) The optimal traversal
Fig. 4. Memory usage of two different traversals of the expression tree in Fig. 2
himem(B, P) = lomem(A, P) +B.size = 23. After B is obtained, A can be deallocated,
giving lomem(B, P) = himem(B, P) −A.size = 3. The memory usage for the rest of the
nodes is determined similarly and shown in Fig. 3.
The post-order traversal of the given expression tree, however, is not optimal in memory
usage. In this example, none of the traversals that visit all nodes in one subtree before visit-
ing another subtree is optimal. There are four such traversals: ¸A, B, C, D, E, F, G, H, I),
¸C, D, E, A, B, F, G, H, I), ¸G, H, A, B, C, D, E, F, I) and ¸G, H, C, D, E, A, B, F, I).
If we follow the traditional wisdom of visiting the subtree with the higher memory usage
first, as in the Sethi-Ullman algorithm [Sethi and Ullman 1970], we obtain the best of these
four traversals, which is ¸G, H, C, D, E, A, B, F, I). Its overall memory usage is 44 units,
as shown in Fig. 4(a), and is not optimal. The optimal traversal, which uses only 39 units
of memory, is ¸C, D, G, H, A, B, E, F, I) (see Fig. 4(b)). Notice that it ‘jumps’ back and
forth between the subtrees. Therefore, any algorithm that only considers traversals that
visit subtrees contiguously may not produce an optimal solution.
The memory usage optimization problem has an interesting property: an expression tree
or a subtree may have more than one optimal traversal. For example, for the subtree rooted
at F, the traversals ¸C, D, E, A, B, F) and ¸C, D, A, B, E, F) both use the least memory
space of 39 units. One might attempt to take two optimal subtree traversals, one from each
child of a node X, merge them together optimally, and then append X to form a traversal
for X. But, this resulting traversal may not be optimal for X. Continuing with the above
example, if we merge together ¸C, D, E, A, B, F) and ¸G, H) (which are optimal for the
subtrees rooted at F and H, respectively) and then append I, the best we can get is a sub-
optimal traversal ¸G, H, C, D, E, A, B, F, I) that uses 44 units of memory (see Fig. 4(a)).
However, the other optimal traversal ¸C, D, A, B, E, F) for the subtree rooted at F can be
merged with ¸G, H) to form ¸C, D, G, H, A, B, E, F, I) (with I appended), which is an
optimal traversal of the entire expression tree. Thus, locally optimal traversals may not be
globally optimal. In the next section, we present an efficient algorithm that finds traversals
which are not only locally optimal but also globally optimal.
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
Memory-Optimal Evaluation of Expression Trees 7
3. AN EFFICIENT ALGORITHM
We now present an efficient divide-and-conquer algorithm that, given an expression tree
whose nodes are large data objects, finds an evaluation order of the tree that minimizes
the memory usage. For each node in the expression tree, it computes an optimal traversal
for the subtree rooted at that node. The optimal subtree traversal that it computes has a
special property: it is not only locally optimal for the subtree, but also globally optimal in
the sense that it can be merged together with globally optimal traversals for other subtrees
to form an optimal traversal for a larger tree that is also globally optimal. As we have seen
in Section 2, not all locally optimal traversals for a subtree can be used to form an optimal
traversal for a larger tree.
The algorithm stores a traversal not as an ordered list of nodes, but as an ordered list
of indivisible units or elements. Each element contains an ordered list of nodes with the
property that there necessarily exists some globally optimal traversal of the entire tree
wherein this sequence appears undivided. Therefore, as we show later, inserting any node
in between the nodes of an element does not lower the total memory usage. An element
initially contains a single node. But as the algorithm goes up the tree merging traversals
together and appending new nodes to them, elements may be appended together to form
new elements containing a larger number of nodes. Moreover, the order of indivisible
units in a traversal stays invariant, i.e., the indivisible units must appear in the same order
in some optimal traversal of the entire expression tree. This means that indivisible units
can be treated as a whole and we only need to consider the relative order of indivisible
units from different subtrees.
Each element (or indivisible unit) in a traversal is a (nodelist, hi, lo) triple, where nodelist
is an ordered list of nodes, hi is the maximum memory usage during the evaluation of the
nodes in nodelist, and lo is the memory usage after those nodes are evaluated. Using the
terminology from Section 2, hi is the highest himem among the nodes in nodelist, and lo
is the lomem of the last node in nodelist. The algorithm always maintains the elements of
a traversal in decreasing hi and increasing lo order, which implies in order of decreasing
hi-lo difference. In Section 5, we prove that arranging the indivisible units in this order
minimizes memory usage.
We can also visualize the elements in a traversal as follows. The first element in a
traversal contains the nodes from the beginning of the traversal up to and including the
node that has the lowest lomem after the node that has the highest himem. The highest
himem and the lowest lomem become the hi and the lo of the first element, respectively.
The same rules are recursively applied to the remainder of the traversal to form the second
element, and so on.
Before formally describing the algorithm, we illustrate how it works with an example.
Consider the expression tree shown in Fig. 2. We visit the nodes in a bottom-up order.
Since A has no children, the optimal traversal for the subtree rooted at A, denoted by
A.seq, is ¸(A, 20, 20)), meaning that 20 units of memory are needed during the evaluation
of A and immediately afterwards. To form B.seq, we take A.seq, append a new element
(B, 3 + 20, 3) (the hi of which is adjusted by the lo of the preceding element), and get
¸(A, 20, 20), (B, 23, 3)). Whenever two adjacent elements are not in decreasing hi and
increasing lo order, we combine them into one element by concatenating the nodelists
and taking the highest hi and the second lo. Thus, B.seq becomes ¸(AB, 23, 3)). (For
clarity, we write nodelists in a sequence as strings). Here, A and B form an indivisible
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
8 Chi-Chung Lam et al.
Node v Subtree traversals merged in Optimal traversal v.seq
decreasing hi-lo order
A (A, 20, 20) (A, 20, 20)
B (A, 20, 20), (B, 23, 3) (AB, 23, 3)
C (C, 30, 30) (C, 30, 30)
D (C, 30, 30), (D, 39, 9) (CD, 39, 9)
E (CD, 39, 9), (E, 25, 16) (CD, 39, 9), (E, 25, 16)
F (CD, 39, 9), (AB, 32, 2), (CD, 39, 9), (ABEF, 34, 15)
(E, 28, 19), (F, 34, 15)
G (G, 25, 25) (G, 25, 25)
H (G, 25, 25), (H, 30, 5) (GH, 30, 5)
I (CD, 39, 9), (GH, 39, 14), (CDGHABEFI, 39, 16)
(ABEF, 39, 20), (I, 36, 16)
Fig. 5. Optimal traversals for the subtrees in the expression tree in Fig. 2
unit, implying that B must follow A in some optimal traversal of the entire expression
tree. The maximum memory usage during the evaluation of this indivisible unit is 23
and the result of the evaluation occupies 3 units of memory. Similarly, we get E.seq =
¸(CD, 39, 9), (E, 25, 16)). Note that these two adjacent elements cannot be combined
because they are already in decreasing hi and increasing lo order. For node F, which
has two children B and E, we merge B.seq and E.seq by the order of decreasing hi-lo
difference. The merged elements are in the order (CD, 39, 9), (AB, 23 + 9, 3 + 9), and
finally (E, 25 + 3, 16 + 3) with the hi and lo values adjusted by the lo value of the last
merged element from the other subtree. They are the three elements in F.seq after the
merge as no elements are combined so far. Then, we append to F.seq the new element
(F, 15 +19, 15) for the root of the subtree. The new element is combined with the last two
elements in F.seq to ensure that the elements are in decreasing hi-lo difference. Hence,
F.seq becomes ¸(CD, 39, 9), (ABEF, 34, 15)), a sequence of only two indivisible units.
The optimal traversals for the other nodes are computed in the same way and are shown in
Fig. 5. At the end, the algorithm returns the optimal traversal ¸C, D, G, H, A, B, E, F, I)
for the entire expression tree (see Fig. 4(b)).
Fig. 6 shows the algorithm. The input to the algorithm (the MinMemTraversal proce-
dure) is an expression tree T, in which each node v has a field v.size denoting the size of its
data object. The procedure performs a bottom-up traversal of the tree and, for each node v,
computes an optimal traversal v.seq for the subtree rooted at v. The optimal traversal v.seq
is obtained by optimally merging together the optimal traversals u.seq from each child u
of v, and then appending v. At the end, the procedure returns a concatenation of all the
nodelists in T.root.seq as the optimal traversal for the given expression tree. The memory
usage of the optimal traversal is T.root.seq[1].hi.
The MergeSeq procedure optimally merges two given traversals S1 and S2 and returns
the merged result S. S1 and S2 are subtree traversals of two children nodes of the same
parent. The optimal merge is performed in a fashion similar to merge-sort. Elements
from S1 and S2 are scanned sequentially and appended into S in the order of decreasing
hi-lo difference. This order guarantees that the indivisible units are arranged to minimize
memory usage. Since S1 and S2 are formed independently, the hi-lo values in the elements
from S1 and S2 must be adjusted before they can be appended to S. The amount of
adjustment for an element from S1 (S2) equals the lo value of the last merged element
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
Memory-Optimal Evaluation of Expression Trees 9
Variable Range
RL Orbital 10
3
r Discretized points in real space 10
5
τ Time step 10
2
k K-point in a irreducible Brillouin zone 10
G Reciprocal lattice vector 10
3
Table I. Variables in an example physics computation.
MinMemTraversal (T):
foreach node v in some bottom-up traversal of T
v.seq =
foreach child u of v
v.seq = MergeSeq(v.seq, u.seq)
if |v.seq| > 0 then // |x| is the length of x
base = v.seq[|v.seq|].lo
else
base = 0
AppendSeq (v.seq, v, v.size + base, v.size)
nodelist =
for i = 1 to |T.root.seq|
nodelist = nodelist +T.root.seq[i].nodelist // + is concatenation
return nodelist // memory usage is T.root.seq[1].hi
MergeSeq (S1, S2):
S =
i = j = 1
base1 = base2 = 0
while i ≤ |S1| or j ≤ |S2|
if j > |S2| or (i ≤ |S1| and S1[i].hi −S1[i].lo > S2[j].hi −S2[j].lo) then
AppendSeq (S, S1[i].nodelist, S1[i].hi + base1, S1[i].lo + base1)
base2 = S1[i].lo
i++
else
AppendSeq (S, S2[j].nodelist, S2[j].hi + base2, S2[j].lo + base2)
base1 = S1[j].lo
j++
end while
return S
AppendSeq (S, nodelist, hi, lo):
E = (nodelist, hi, lo) // new element to append to S
i = |S|
while i ≥ 1 and (E.hi ≥ S[i].hi or E.lo ≤ S[i].lo)
// combine S[i] with E
E = (S[i].nodelist +E.nodelist, max(S[i].hi, E.hi), E.lo)
remove S[i] from S
i – –
end while
S = S +E // |S| is now i + 1
Fig. 6. Procedure for finding an memory-optimal traversal of an expression tree
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
10 Chi-Chung Lam et al.
from S2 (S1), which is kept in variable base1 (base2).
The AppendSeq procedure appends the new element specified by the triple (nodelist,
hi, lo) to the given traversal S. Before the new element E is appended to S, it is combined
with elements at the end of S whose hi is not higher than E.hi or whose lo is not lower
than E.lo. The combined element has the concatenated nodelist and the highest hi but the
original E.lo.
This algorithm has the property that the traversal it finds for a subtree T

is not only
optimal for T

but must also appear as a subsequence in some optimal traversal for any
larger tree that contains T

as a subtree. For example, E.seq is a subsequence in F.seq,
which is in turn a subsequence in I.seq (see Fig. 5).
4. COMPLEXITY OF THE ALGORITHM
The time complexity of our algorithm is Θ(nlog
2
n) for an n-node expression tree. We
represent a sequence of indivisible units as a red-black tree with the indivisible units at the
leaves. The tree is sorted by decreasing hi-lo difference. In addition, the leaves are linked
in sorted order in a doubly-linked list, and a count of the number of indivisible units in the
sequence is maintained.
The cost of constructing the final evaluation order consists of the cost for building se-
quences of indivisible units and the cost for combining indivisible units into larger indi-
visible units. For finding an upper bound on the cost of the algorithm, the worst-case cost
for building sequences can be analyzed separately from the worst-case cost for combining
indivisible units.
The sequence for a leaf node of the expression tree can be constructed in constant time.
For a unary interior node, we simply append the node to the sequence of its subtree, which
costs O(log n) time. For an m-ary interior node, we merge the sequences of the subtrees
by inserting the nodes from the smaller sequences into the largest sequence. Inserting a
node into a sequence represented as a red-black tree costs O(log n) time. Since we always
insert the nodes of the smaller sequences into the largest one, every time a given node of the
expression tree gets inserted into a sequence the size of the sequence containing this node
at least doubles. Each node, therefore, can be inserted into a sequence at most O(log n)
times, with each insertion costing O(log n) time. The cost for building the traversal for the
entire expression tree is, therefore, O(nlog
2
n).
Two individual indivisible units can be combined in constant time. When combining
two adjacent indivisible units within a sequence, one of them must be deleted from the
sequence and the red-black tree must be re-balanced, which costs O(log n) time. Since
there can be at most n − 1 of these combine operations, the total cost is O(nlog n). The
cost of the whole algorithm is, therefore, dominated by the cost for building sequences,
which is O(nlog
2
n).
Combining indivisible units into larger ones reduces the number of elements in the se-
quences and, therefore, the time required for merging and combining sequences. In the
best case, indivisible units are always combined such that each sequence contains a single
element. In this case, the algorithm only takes linear time.
In the worst case, a degenerate expression tree consists of small nodes and large nodes
such that every small node has as its only child a large node. A pair of such nodes will
form an indivisible unit with the hi being the size of the large node and the lo being the
size of the small node. Such a tree can be constructed such that these indivisible units
will not further combine into larger indivisible units. In this case, the algorithm will result
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
Memory-Optimal Evaluation of Expression Trees 11
in a sequence of n/2 indivisible units containing two nodes each. If such a degenerate
expression tree is also unbalanced, the algorithm requires Ω(nlog
2
n) time for computing
the optimal traversal.
5. CORRECTNESS OF THE ALGORITHM
We now show the correctness of the algorithm. The proof proceeds as follows. Lemma 1
characterizes the indivisible units in a subtree traversal by providing some invariants.
Lemma 2 shows that, once formed, each indivisible unit can be considered as a whole.
Lemma 3 deals with the optimal ordering of indivisible units from different subtrees. Fi-
nally, using the three lemmas we prove the correctness of the algorithm by arguing that
any traversal of an expression tree can be transformed in a series of steps into the optimal
traversal found by the algorithm without increasing memory usage.
The first lemma establishes some important invariants about the indivisible units in an
ordered list v.seq that represents a traversal. The hi value of an indivisible unit is the
highest himem of the nodes in the indivisible unit. The lo value of an indivisible unit is the
lomem of the last node in the indivisible unit. Given a sequence of nodes, an indivisible
unit extends to the last node with the lowest lomem following the last node with the highest
himem. In addition, the indivisible units in a sequence are in decreasing hi and increasing
lo order.
LEMMA 5.1. Let v be any node in an expression tree, S = v.seq, and P be the traversal
represented by S of the subtree rooted at v, i.e., P = S[1].nodelist + +S[[S[].nodelist.
The algorithm maintains the following invariants:
For all 1 ≤ i ≤ [S[, let S[i].nodelist = ¸v
1
, v
2
, . . . , v
n
) and v
m
be the last
node in S[i].nodelist that has the maximum himem value, i.e., for all k <
m, himem(v
k
, P) ≤ himem(v
m
, P) and for all k > m, himem(v
k
, P) <
himem(v
m
, P). Then, we have,
(1) S[i].hi = himem(v
m
, P),
(2) S[i].lo = lomem(v
n
, P),
(3) for all m ≤ k ≤ n, lomem(v
k
, P) ≥ lomem(v
n
, P),
(4) for all 1 ≤ j < i,
(a) for all 1 ≤ k ≤ n, S[j].hi > himem(v
k
, P),
(b) for all 1 ≤ k ≤ n, S[j].lo < lomem(v
k
, P),
(c) S[j].hi > S[i].hi, and
(d) S[j].lo < S[i].lo.
Proof
The above invariants are true by construction. ¯.
The second lemma asserts the ‘indivisibility’ of an indivisible unit by showing that un-
related nodes inserted in between the nodes of an indivisible unit can always be moved to
the beginning or the end of the indivisible unit without increasing memory usage. Thus,
once an indivisible unit is formed, we do not need to consider breaking it up later. This
lemma allows us to treat each traversal as a sequence of indivisible units (each containing
one or more nodes) instead of a list of the individual nodes.
LEMMA 5.2. Let v be a node in an expression tree T, S = v.seq, and P be a traver-
sal of T in which the nodes from S[i].nodelist appear in the same order as they are in
S[i].nodelist, but not contiguously. Then, any nodes that are in between the nodes in
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
12 Chi-Chung Lam et al.
S[i].nodelist can always be moved to the beginning or the end of S[i].nodelist without in-
creasing memory usage, provided that none of the nodes that are in between the nodes in
S[i].nodelist are ancestors or descendants of any nodes in S[i].nodelist.
Proof
Let S[i].nodelist = ¸v
1
, . . . , v
n
), v
0
be the node before v
1
in S, v
m
be the node
in S[i].nodelist such that for all k < m, himem(v
k
, P) ≤ himem(v
m
, P) and for all
k > m, himem(v
m
, S) > himem(v
k
, S). Let v

1
, . . . , v

b
be the ‘foreign’ nodes, i.e.,
the nodes that are in between the nodes in S[i].nodelist in P, with v

1
, . . . , v

a
(not neces-
sarily contiguously) before v
m
and v

a+1
, . . . , v

b
(not necessarily contiguously) after v
m
in P. Let Q be the traversal obtained from P by removing the nodes in S[i].nodelist.
We construct another traversal P

of T from P by moving v

1
, . . . , v

a
to the beginning
of S[i].nodelist and v

a+1
, . . . , v

b
to the end of S[i].nodelist. In other words, we replace
¸v
1
, . . . , v

1
, . . . , v

a
, . . . , v
m
, . . . , v

a+1
, . . . , v

b
, . . . , v
n
) in P with ¸v

1
, . . . , v

a
, v
1
, . . . , v
m
,
. . . , v
n
, v

a+1
, . . . , v

b
, ) to form P

.
Traversals P and P

differ in memory usage only at the nodes ¦v
1
, . . . , v
n
, v

1
, . . . , v

b
¦.
P

does not use more memory than P because:
(1) The memory usage for P

at v
m
is the same as the memory usage for P at v
m
since
himem(v
m
, P

) = himem(v
m
, S) + lomem(v

a
, Q) = himem(v
m
, P).
(2) For all 1 ≤ k ≤ n, the memory usage for P

at v
k
is no higher than the memory usage
for P at v
k
, since himem(v
k
, S) ≤ himem(v
m
, S) implies that himem(v
k
, P

) =
himem(v
k
, S) +lomem(v

a
, Q) ≤ himem(v
m
, S) +lomem(v

a
, Q) = himem(v
m
, P).
(3) For all 1 ≤ j ≤ a, the memory usage for P

at v

j
is no higher than the memory
usage for P at v

j
, since for all 1 ≤ k ≤ m, lomem(v
0
, S) < lomem(v
k
, S) (by in-
variant 4(b) in Lemma 1) implies himem(v

j
, P

) = himem(v

j
, Q) +lomem(v
0
, S) ≤
himem(v

j
, P).
(4) For all a < j ≤ b, the memory usage for P

at v

j
is no higher than the memory usage
for P at v

j
, since for all m ≤ k ≤ n, lomem(v
k
, S) ≥ lomem(v
n
, S) (by invariant 3 in
Lemma 1) implies himem(v

j
, P

) = himem(v

j
, Q)+lomem(v
n
, S) ≤ himem(v

j
, P).
Since the memory usage of any node in v
1
, . . . , v
n
after moving the foreign nodes cannot
exceed that of v
m
, which remains unchanged, and the memory usage of the foreign nodes
does increase as a result of moving them, the overall maximum memory usage cannot
increase. ¯.
The next lemma deals with the ordering of indivisible units. It shows that arranging in-
divisible units from different subtrees in the order of decreasing hi-lo difference minimizes
memory usage, since two indivisible units that are not in that order can be interchanged in
the merged traversal without increasing memory usage.
LEMMA 5.3. Let v and v

be two nodes in an expression tree that are siblings of each
other, S = v.seq, and S

= v

.seq. Then, among all possible merges of S and S

, the
merge that arranges the elements from S and S

in the order of decreasing hi-lo difference
uses the least memory.
Proof
Let M be a merge of S and S

that is not in the order of decreasing hi-lo difference.
Then there exists an adjacent pair of elements, one from each of S and S

, that are not in
that order. Without loss of generality, we assume the first element is S

[j] from S

and
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
Memory-Optimal Evaluation of Expression Trees 13
element himem lomem
.
.
.
.
.
.
.
.
.
S

[j] H

j
+Li−1 L

j
+Li−1
S[i] Hi +L

j
Li +L

j
.
.
.
.
.
.
.
.
.
element himem lomem
.
.
.
.
.
.
.
.
.
S[i] Hi +L

j−1
Li +L

j−1
S

[j] H

j
+Li L

j
+Li
.
.
.
.
.
.
.
.
.
(a) Sequence M (b) Sequence M

Fig. 7. Memory usage comparison of two traversals in Lemma 5.3
the second one is S[i] from S. Consider the merge M

obtained from M by interchanging
S

[j] and S[i]. To simplify the notation, let H
r
= S[r].hi, L
r
= S[r].lo, H

r
= S

[r].hi,
and L

r
= S

[r].lo. The memory usage of M and M

differs only at S

[j] and S[i] and is
compared in Fig. 7.
The memory usage of M at the two elements is max(H

j
+ L
i−1
, H
i
+ L

j
) while the
memory usage of M

at the same two elements is max(H

j
+ L
i
, H
i
+ L

j−1
). Since the
two elements are out of order, the hi-lo difference of S

[j] must be less than that of S[i],
i.e., H

j
−L

j
< H
i
−L
i
. This implies H
i
+L

j
> H

j
+L
i
. Invariant 4 in Lemma 1 gives
us L

j
> L

j−1
, which implies H
i
+L

j
> H
i
+L

j−1
. Thus, max(H

j
+L
i−1
, H
i
+L

j
) ≥
H
i
+L

j
> max(H

j
+L
i
, H
i
+L

j−1
). Therefore, M

cannot use more memory than M.
By switching all adjacent pairs in M that are out of order until no such pair exists, we get
an optimal order without increasing memory usage. ¯.
THEOREM 5.4. Given an expression tree, the algorithm presented in Section 3 com-
putes a traversal that uses the least memory.
Proof
We prove the correctness of the algorithm by describing a procedure that transforms any
given traversal to the traversal found by the algorithm without increase in memory usage
in any transformation step. Given a traversal P for an expression tree T, we visit the nodes
in T in a bottom-up manner and, for each non-leaf node v in T, we perform the following
steps:
(1) Let T

be the subtree rooted at v and P

be the minimal substring of P that contains
all the nodes from T

− ¦v¦. In the following steps, we will rearrange the nodes in
P

such that the nodes that form an indivisible unit in v.seq are contiguous and the
indivisible units are in the same order as they are in v.seq.
(2) First, we sort the components of the indivisible units in v.seq so that they are in the
same order as in v.seq. The sorting process involves rearranging two kinds of units.
The first kind of units are the indivisible units in u.seq for each child u of v. The
second kind of units are the contiguous sequences of nodes in P

which are from
T − T

. For this sorting step, we temporarily treat each such maximal contiguous
sequence of nodes as a unit. For each unit E of the second kind, we take E.hi =
max
w∈E
himem(w, P) and E.lo = lomem(w
n
, P) where w
n
is the last node in E.
The sorting process is as follows.
While there exist two adjacent units E

and E in P

such that E

is before
E and E

.hi −E

.lo < E.hi −E.lo,
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
14 Chi-Chung Lam et al.
(a) Swap E

and E. By Lemma 3, this does not increase the memory
usage.
(b) If two units of the second kind become adjacent to each other as a
result of the swapping, combine the two units into one and recompute
its new hi and lo.
When the above sorting process finishes, all units of the first kind, which are compo-
nents of the indivisible units in v.seq, are in the order of decreasing hi-lo difference.
Since, for each child u of v, indivisible units in u.seq have been in the correct order
before the sorting process, their relative order is not changed. The order of the nodes
from T − T

is preserved because the sorting process never swaps two units of the
second kind. Also, v and its ancestors do not appear in P

, and nodes in units of the
first kind are not ancestors or descendants of any nodes in units of the second kind.
Therefore, the sorting process does not violate parent-child dependencies.
(3) Now that the components of the indivisible units in v.seq are in the correct order, we
make the indivisible units contiguous using the following combining process.
For each indivisible unit E in v.seq,
(a) In the traversal P, if there are nodes from T − T

in between the
nodes from E, move them either to the beginning or the end of E as
specified by Lemma 2.
(b) Make the contiguous sequence of nodes from E an indivisible unit.
Upon completion, each indivisible unit in v.seq is contiguous in P and the order in
P of the indivisible units is the same as they are in v.seq. According to Lemma 2,
moving ‘foreign’ nodes out of an indivisible unit does not increase the memory usage.
Also, the order of the nodes from T −T

is preserved. Hence, the combining process
does not violate parent-child dependencies.
We use induction to showthat the above procedure correctly transforms any given traver-
sal P into an optimal traversal found by the algorithm. The induction hypothesis H(u) for
each node u is that:
—the nodes in each indivisible unit in u.seq appear contiguously in P and are in the same
order as they are in u.seq, and
—the order in P of the indivisible units in u.seq is the same as they are in u.seq.
Initially, H(u) is true for every leaf node u because there is only one traversal order for
a leaf node. As the induction step, assume H(u) is true for each child u of a node v.
The procedure rearranges the nodes in P

such that the nodes that form an indivisible unit
in v.seq are contiguous in P, the sets of nodes corresponding to the indivisible units are
in the same order in P as they are in v.seq, and the order among the nodes that are not
in the subtree rooted at v is preserved. Thus, when the procedure finishes processing a
node v, H(v) becomes true. By induction, H(T.root) is true and a traversal found by the
algorithm is obtained. Since any traversal P can be transformed into a traversal found by
the algorithm without increasing memory usage in any transformation step, no traversal
can use less memory and the algorithm is correct. ¯.
6. CONCLUSION
In this paper, we have considered the memory usage optimization problem in the evalua-
tion of expression trees involving large objects of different sizes. This problem arose in
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.
Memory-Optimal Evaluation of Expression Trees 15
the context of optimizing electronic structure calculations. The developed solution would
apply in any context involving the evaluation of an expression tree, in which intermediate
results are so large that it is impossible to keep all of them in memory at the same time.
In such situations, it is necessary to dynamically allocate and deallocate space for the in-
termediate results and to find an evaluation order that uses the least memory. We have
developed an efficient algorithm that finds an optimal evaluation order in Θ(nlog
2
n) time
for an expression tree containing n nodes and proved its correctness.
Acknowledgments
We would like to thank Tamal Dey for help with the complexity proof and for proofreading
the paper. We are grateful to the National Science Foundation for support of this work
through the Information Technology Research (ITR) program.
REFERENCES
APPEL, A. W. AND SUPOWIT, K. J. 1987. Generalizations of the Sethi-Ullman algorithm for register allocation.
Softw. Pract. Exper. 17, 6 (June), 417–421.
AULBUR, W. 1996. Parallel implementation of quasiparticle calculations of semiconductors and insulators. Ph.D.
thesis, The Ohio State University.
HYBERTSEN, M. S. AND LOUIE, S. G. 1986. Electronic correlation in semiconductors and insulators: Band
gaps and quasiparticle energies. Phys. Rev. B 34, 5390.
LAM, C.-C. 1999. Performance optimization of a class of loops implementing multi-dimensional integrals. Ph.D.
thesis, The Ohio State University. Also available as Technical Report No. OSU-CISRC-8/99-TR22, Dept. of
Computer and Information Science, The Ohio State University, August 1999.
LAM, C.-C., SADAYAPPAN, P., COCIORVA, D., ALOUANI, M., AND WILKINS, J. 1999. Performance optimiza-
tion of a class of loops involving sums of products of sparse arrays. In Ninth SIAM Conference on Parallel
Processing for Scientific Computing. Society for Industrial and Applied Mathematics, San Antonio, TX.
LAM, C.-C., SADAYAPPAN, P., AND WENGER, R. 1997a. On optimizing a class of multi-dimensional loops
with reductions for parallel execution. Parall. Process. Lett. 7, 2, 157–168.
LAM, C.-C., SADAYAPPAN, P., AND WENGER, R. 1997b. Optimization of a class of multi-dimensional integrals
on parallel machines. In Eighth SIAM Conference on Parallel Processing for Scientific Computing,. Society
for Industrial and Applied Mathematics, Minneapolis, MN.
NAKATA, I. 1967. On compiling algorithms for arithmetic expressions. Commun. ACM 10, 492–494.
RAUBER, T. 1990a. Ein Compiler f¨ ur Vektorrechner mit Optimaler Auswertung von vektoriellen Aus-
drucksb¨ aumen. Ph.D. thesis, University Saarbr¨ ucken.
RAUBER, T. 1990b. Optimal Evaluation of Vector Expression Trees. In Proceedings of the 5th Jerusalem
Conference on Information Technology. IEEE Computer Society Press, Jerusalem, Israel, 467–473.
ROJAS, H. N., GODBY, R. W., AND NEEDS, R. J. 1995. Space-time method for Ab-initio calculations of
self-energies and dielectric response functions of solids. Phys. Rev. Lett. 74, 1827.
SETHI, R. 1975. “Complete register allocation problems”. SIAM J. Comput. 4, 3 (Sept.), 226–248.
SETHI, R. AND ULLMAN, J. D. 1970. The generation of optimal code for arithmetic expressions. J. ACM 17, 1
(Oct.), 715–728.
Received April 2002; revised TBD TBD; accepted TBD TBD
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TBD, TBD TBD.

RL. or I). or Pt) and halogen atoms (X = Cl. G is the orbital projected Green function for electrons and holes.R L k. The computation can be expressed in a canonical form as the following multi-dimensional summation of the product of several arrays. The problem of minimizing the number of arithmetic operations by applying the algebraic laws of commutativity. RL1] r. 1995].R L 1 (iτ ) = √ Ω dr e−i(k+G)r Ω R L θR L . and distributivity has been previously addressed in [Lam et al. RL2] × Y [r1.R L (−iτ ) where k. thereby computing some intermediate results that can be stored and reused instead of being recomputed many times. for example. Vol. RL3] × Y [r1. the multi-dimensional integral shown ACM Transactions on Programming Languages and Systems. r] × exp[G.G (iτ ) DR L . MX materials are linear chain compounds with alternating transition-metal atoms (M = Ni. associativity. r1] × exp[G1. as a multi-dimensional summation of the primary inputs to the computation. RL. The objective is to minimize the maximum memory usage during the evaluation of the entire expression tree. TBD TBD. and D is an intermediate array computed using a fast Fourier transform (FFT). χG. Y [r. and then proceed to discuss in detail the problem of memory-optimal evaluation of expression trees. Ψ is the localized basis function.R L (iτ ) and θR L . TBD. resulting in different alternative computational structures. in optimizing a class of loop calculations implementing multi-dimensional integrals of the products of several large input arrays to compute the electronic properties of semiconductors and metals [Hybertsen and Louie 1986.G DR L . Consider. The following multi-dimensional integral computes the susceptibility in momentum space for the determination of the self-energy of MX compounds in real space.G DR L . no intermediate quantities such as D are explicitly computed.RL3 ×G[RL1. .r1. for example. node data object has been evaluated. Using simple examples. it is not computationally attractive since the number of arithmetic operations is enormous (multiplying the extents of the various summation indices gives O(1022 ) operations).RL1. RL] × Y [r.R L ∗k. Pd. The number of operations can be considerably reduced by applying algebraic properties to factor out terms that are independent of some of the summation indices. One such application is the calculation of electronic properties of MX materials with the inclusion of many-body effects [Aulbur 1996]. This problem arises.RL2. The interpretation of the variables and their ranges are given in Table I.2 · Chi-Chung Lam et al.R L (r) = ΨR L (r − R ) Ψ∗ R L (r − R ) In the above equations. we briefly touch upon some of the optimization issues that arise. Rojas et al. 1997b]. r1] When expressed in this form. where Ψ is written as a two-dimensional array Y . t] ×exp[k. TBD. There are many different ways of applying algebraic laws of distributivity and associativity. t] × G[RL2. Br. r] × exp[k. RL3.G (k. Although it requires less memory than the previous form.R L (r) GR L . No. 1997a. iτ ) = i R L .

in Figure 1(a). j] × B[j. k.e. TBD. For instance. It expresses the sequence of steps in computing the multi-dimensional integral as a sequence of formulae. k. Vol. One equivalent form that only requires 2 × Nj × Nk × Nl + 2 × Nj × Nk + Ni × Nj operations is given in Figure 1(b). No. k. each evaluating an intermediate result. j] f2 × B[j. TBD. k] = f [j. the internal nodes are intermediate results. If implemented directly as expressed (i. Each formula computes some intermediate result and the last formula gives the final result. and the root node is the final integral. k. However. the input arrays and the intermediate arrays are often so large that they cannot all fit into available memory. l] (c) An expression tree representation of (b) Fig. k. assuming associative reordering of the operations and use of the distributive law of multiplication over addition is satisfactory for the floating-point computations. However. l] l 2 = f1 [j] × f3 [j.. k] f4 [j.j. Figure 1(c) shows the expression tree corresponding to the example formula sequence. l] f3 [j. . This problem has been proved to be NP-complete and a pruning search algorithm was proposed. k] W [k] = f5 [k] = A[i. l] (a) A multi-dimensional integral f1 [j] f2 [j.l) · 3 A[i. 1. l] × C[k. TBD TBD. A sequence of formulae can also be represented as an expression tree in which the leaf nodes are input arrays. The minimization of memory usage is thus desirable. l] × C[k. An example multi-dimensional integral and two representations of a computation.Memory-Optimal Evaluation of Expression Trees W [k] = (i. the next step is to implement it as some loop structure. the above computation can be rewritten in various ways. in practice. By fusing loops it is possible to ACM Transactions on Programming Languages and Systems. k] j 4 (b) A formula sequence for computing (a) f5 j f4 × f1  d   d i f3 k A[i. l]  d   d C[k. Once the operation-minimal form is determined. A straightforward way is to generate a sequence of perfectly nested loops. j] i = B[j. the computation would require 2 × Ni × Nj × Nk × Nl arithmetic operations to compute. as a single set of perfectly-nested loops). l] = f [j.

In this paper. G. E. The size of each data object is shown alongside the corresponding node label. TBD. we formally define the memory usage optimization problem and make some observations about it. A simpler problem related to the memory usage minimization problem is the register allocation problem for binary expression trees in which the sizes of all nodes are unity. As an example of the memory usage optimization problem. space for it must be allocated. It has a maximum memory usage of 45 units. It has been addressed in [Nakata 1967. Lam 1999]. . G. B. it is not necessary to keep all arrays allocated for the duration of the entire computation. When allocating and deallocating arrays dynamically. where n is the number of nodes in the expression tree. G. D. This space can be deallocated only after the evaluation of its parent is complete. TBD TBD. 1999. The rest of this paper is organized as follows. Section 3 presents an efficient algorithm for finding the optimal evaluation order for an expression ACM Transactions on Programming Languages and Systems. the evaluation order affects the maximum memory requirement. An example expression tree   d   d reduce the dimensionality of some of the intermediate arrays [Lam et al.g. we focus on the problem of finding an evaluation order of the nodes in a given expression tree that minimizes dynamic memory usage. E. I : 16 F : 15      H:5 B : 3 E : 16 G : 25 A : 20 D : 9 C : 30 Fig. the problem becomes NP-complete [Sethi 1975]. B. But if the expression tree is replaced by a directed acyclic graph (in which all nodes are still of unit size). e. for computing the electronic properties of MX materials. However. One of them is the post-order traversal A. C.4 · Chi-Chung Lam et al.. I of the expression tree. In Section 2. and H are in memory. Sethi and Ullman 1970] and can be solved in Ω(n) time. which uses 39 units of memory. which could result in non-optimal solutions (as we show through an example in Section 2). 2. I . Furthermore. Vol. they restricted their attention to solutions that evaluate subtrees contiguously. TBD. A. Before evaluating a data object. Space allocated for an intermediate array can be deallocated as soon as the operation using the array has been performed. H. The algorithm in [Sethi and Ullman 1970] for expression trees of unit-sized nodes does not extend directly to expression trees having nodes of different sizes. This occurs during the evaluation of H. F. The generalized register allocation problem for expression trees was addressed in the context of vector machines in [Rauber 1990a. consider the expression tree shown in Fig. H. F. 1990b]. Other evaluation orders may use more memory or less memory. 2. No. D. is not trivial. when F . Appel and Supowit [Appel and Supowit 1987] generalized the register allocation problem to higher degree expression trees of arbitrarily-sized nodes. We believe that the solution we develop here may have applicability to other areas such as database query optimization and data mining. Finding the optimal order C. A solution to this problem would allow the automatic generation of efficient code for evaluating expression trees. There are many allowable evaluation orders of the nodes.

and (2) maxvi ∈P {himem(vi . Space for each data object is dynamically allocated/deallocated in its entirety. . P ) = lomem(A. Each data object depends on all its children (if any).e. For instance. P ) = lomem(vi−1 . G. P ) = A. TBD. vj . we will use these two terms interchangeably. E G F. The correctness of the algorithm is proved in Section 5.size for each node v ∈ T . internal node objects must be allocated before their evaluation begins. the space allocated to all its children may be released.size = 20. v2 .. F. 2. P ) = 0 if i = 0 Here. · 5 himem 0 + 20 = 20 20 + 3 = 23 3 + 30 = 33 33 + 9 = 42 12 + 16 = 28 19 + 15 = 34 15 + 25 = 40 40 + 5 = 45 20 + 16 = 36 45 deallocate A C D B. we need to allocate space for B. TBD. We assume that the evaluation of each node in the expression tree is atomic. In Section 4. 3. then i > j. Leaf node objects are created or read in as needed. E. P ) is the memory usage during the evaluation of vi in the traversal P . PROBLEM STATEMENT We address the problem of optimizing the memory usage in the evaluation of a given expression tree whose nodes correspond to large data objects of various sizes. 2. himem(vi . So. TBD TBD. C. consider the post-order traversal P = A. where himem(vi . Section 6 provides conclusions. . and lomem(vi . P ) is the memory usage upon completion of the same evaluation. an ordering P = v1 . and thus can be evaluated only after all its children have been evaluated. if vi is the parent of vj . Vol. .size himem(vi . P ) − {child vj of vi } vj . These definitions reflect that we need to allocate space for vi before its evaluation. such that (1) for all vi . i. where n is the number of nodes in T . P )} is minimized. himem(A. vn of the nodes in T . P ) + vi . H. we prove that the algorithm finds a solution in Θ(n log2 n) time for an n-node expression tree. . During and after the evaluation of A. H lomem 20 − 0 = 20 23 − 20 = 3 33 − 0 = 33 42 − 30 = 12 28 − 9 = 19 34 − 3 − 16 = 15 40 − 0 = 40 45 − 25 = 20 36 − 15 − 5 = 16 Memory usage of a post-order traversal of the expression tree in Fig. I of the expression tree shown in Fig. find a computation of T that uses the least memory. We define the problem formally as follows: Given a tree T and a size v. and that after evaluation of vi . Since an evaluation order is also a traversal of the nodes. The goal is to find an evaluation order of the nodes that uses the least amount of memory.size if i > 0 lomem(vi . For evaluating B. No. 2 tree. thus ACM Transactions on Programming Languages and Systems. B. and each object must remain in memory until the evaluation of its parent is completed. D. A is in memory. .Memory-Optimal Evaluation of Expression Trees Node A B C D E F G H I max Fig.

ACM Transactions on Programming Languages and Systems. G. A. H. I . B. is C. TBD. H. A. we present an efficient algorithm that finds traversals which are not only locally optimal but also globally optimal. B. D. D. is not optimal in memory usage. none of the traversals that visit all nodes in one subtree before visiting another subtree is optimal. I that uses 44 units of memory (see Fig. 4(b)). Its overall memory usage is 44 units. D. Thus. which is G. H. F. A. B. TBD. No. E. P ) + B.size = 23. Vol. The optimal traversal. F. If we follow the traditional wisdom of visiting the subtree with the higher memory usage first. TBD TBD. F and G. H. In this example. E. However. E. 3. if we merge together C. I (see Fig. respectively) and then append I. A. F. G. E. C. A. 4(a)). D. Therefore. I . as shown in Fig.6 · Chi-Chung Lam et al. C. H. B. E. 4. C. F. D. But. Node G H C D E A B F I max himem 25 30 35 44 30 41 44 39 36 44 lomem 25 5 35 14 21 41 24 20 16 Node C D G H A B E F I max himem 30 39 34 39 34 37 33 39 36 39 lomem 30 9 34 14 34 17 24 20 16 (a) A better traversal Fig. P ) − A. giving lomem(B. A. B. one from each child of a node X. E. P ) = lomem(A. A. I . H to form C. The memory usage for the rest of the nodes is determined similarly and shown in Fig. D. A. we obtain the best of these four traversals. any algorithm that only considers traversals that visit subtrees contiguously may not produce an optimal solution. P ) = himem(B. H. D. and is not optimal. F. 2 himem(B. B.size = 3. H. F. One might attempt to take two optimal subtree traversals. For example. as in the Sethi-Ullman algorithm [Sethi and Ullman 1970]. C. G. D. this resulting traversal may not be optimal for X. D. 4(a). for the subtree rooted at F . E. E. B. the other optimal traversal C. F. which uses only 39 units of memory. There are four such traversals: A. the traversals C. merge them together optimally. C. F and C. E. E. D. F both use the least memory space of 39 units. B. G. D. locally optimal traversals may not be globally optimal. G. C. H (which are optimal for the subtrees rooted at F and H. H. Notice that it ‘jumps’ back and forth between the subtrees. B. however. D. . A. F. A. The post-order traversal of the given expression tree. F for the subtree rooted at F can be merged with G. I and G. A can be deallocated. B. After B is obtained. I (with I appended). E. I . B. and then append X to form a traversal for X. A. (b) The optimal traversal Memory usage of two different traversals of the expression tree in Fig. The memory usage optimization problem has an interesting property: an expression tree or a subtree may have more than one optimal traversal. the best we can get is a suboptimal traversal G. Continuing with the above example. E. B. In the next section. which is an optimal traversal of the entire expression tree.

the optimal traversal for the subtree rooted at A. Moreover. Since A has no children. Here. Therefore. not all locally optimal traversals for a subtree can be used to form an optimal traversal for a larger tree. but as an ordered list of indivisible units or elements. hi is the highest himem among the nodes in nodelist. No. hi is the maximum memory usage during the evaluation of the nodes in nodelist. which implies in order of decreasing hi-lo difference. we illustrate how it works with an example. To form B.seq. 20. where nodelist is an ordered list of nodes. 3) . we take A. and lo is the memory usage after those nodes are evaluated. 3) . as we show later. We visit the nodes in a bottom-up order.seq. and lo is the lomem of the last node in nodelist. the indivisible units must appear in the same order in some optimal traversal of the entire expression tree. respectively. but also globally optimal in the sense that it can be merged together with globally optimal traversals for other subtrees to form an optimal traversal for a larger tree that is also globally optimal. TBD. 3) (the hi of which is adjusted by the lo of the preceding element). Using the terminology from Section 2. Each element contains an ordered list of nodes with the property that there necessarily exists some globally optimal traversal of the entire tree wherein this sequence appears undivided. Before formally describing the algorithm.e. An element initially contains a single node. 2. TBD TBD. 20). The algorithm stores a traversal not as an ordered list of nodes. Vol. 20) . But as the algorithm goes up the tree merging traversals together and appending new nodes to them. A and B form an indivisible ACM Transactions on Programming Languages and Systems. is (A. . The first element in a traversal contains the nodes from the beginning of the traversal up to and including the node that has the lowest lomem after the node that has the highest himem. inserting any node in between the nodes of an element does not lower the total memory usage. Consider the expression tree shown in Fig. it computes an optimal traversal for the subtree rooted at that node. The highest himem and the lowest lomem become the hi and the lo of the first element. For each node in the expression tree. AN EFFICIENT ALGORITHM We now present an efficient divide-and-conquer algorithm that. we prove that arranging the indivisible units in this order minimizes memory usage. 3 + 20. elements may be appended together to form new elements containing a larger number of nodes. Whenever two adjacent elements are not in decreasing hi and increasing lo order.seq becomes (AB. i. TBD. and so on. In Section 5. (For clarity. hi.Memory-Optimal Evaluation of Expression Trees · 7 3. This means that indivisible units can be treated as a whole and we only need to consider the relative order of indivisible units from different subtrees. finds an evaluation order of the tree that minimizes the memory usage.seq. we combine them into one element by concatenating the nodelists and taking the highest hi and the second lo. The same rules are recursively applied to the remainder of the traversal to form the second element. We can also visualize the elements in a traversal as follows. 23. append a new element (B. Thus. meaning that 20 units of memory are needed during the evaluation of A and immediately afterwards. (B.. The optimal subtree traversal that it computes has a special property: it is not only locally optimal for the subtree. lo) triple. Each element (or indivisible unit) in a traversal is a (nodelist. The algorithm always maintains the elements of a traversal in decreasing hi and increasing lo order. and get (A. 23. we write nodelists in a sequence as strings). the order of indivisible units in a traversal stays invariant. 20. As we have seen in Section 2. denoted by A. given an expression tree whose nodes are large data objects. B.

for each node v. 30) (CD. 15) (G. 23 + 9. This order guarantees that the indivisible units are arranged to minimize memory usage. TBD. E. (E. computes an optimal traversal v. 25. The optimal traversal v.seq and E. The optimal merge is performed in a fashion similar to merge-sort. The input to the algorithm (the MinMemTraversal procedure) is an expression tree T . 15) (G. and finally (E. Note that these two adjacent elements cannot be combined because they are already in decreasing hi and increasing lo order. At the end. 2 unit. S1 and S2 are subtree traversals of two children nodes of the same parent. 39.8 · Chi-Chung Lam et al.size denoting the size of its data object.seq by the order of decreasing hi-lo difference. Optimal traversals for the subtrees in the expression tree in Fig. 9). 25 + 3. 6 shows the algorithm. (I. (E. we get E. 16) Fig. 5) (CD.seq (A. 39. 2). (F. The amount of adjustment for an element from S1 (S2) equals the lo value of the last merged element ACM Transactions on Programming Languages and Systems. 25). F. 30. (ABEF. 36. Similarly. 39. 20). (H. (AB. 23. which has two children B and E. G. 16) .seq[1]. (AB. 23. 9). 9). implying that B must follow A in some optimal traversal of the entire expression tree. 20).seq after the merge as no elements are combined so far. Since S1 and S2 are formed independently.root. 30). The merged elements are in the order (CD. TBD. 32. F. 16) (CD. 16) (CD. (E. 9) (CD. (B. the procedure returns a concatenation of all the nodelists in T. 39. 9). 30. 39. 3) (C. 20. (GH. 39. Elements from S1 and S2 are scanned sequentially and appended into S in the order of decreasing hi-lo difference. the hi-lo values in the elements from S1 and S2 must be adjusted before they can be appended to S. The MergeSeq procedure optimally merges two given traversals S1 and S2 and returns the merged result S. (ABEF. For node F . The maximum memory usage during the evaluation of this indivisible unit is 23 and the result of the evaluation occupies 3 units of memory. 15) . 9) (CD. (E. At the end. and then appending v. we merge B. 9). 25. we append to F.root. 39. 30. 5) (CDGHABEF I. 3 + 9).seq = (CD. 34. Vol. 20. 30. 19). 16) Optimal traversal v. 16 + 3) with the hi and lo values adjusted by the lo value of the last merged element from the other subtree. 20.seq as the optimal traversal for the given expression tree. 20) (AB. 9). Then. 15 + 19. 25) (GH. 28. 30) (C. The new element is combined with the last two elements in F. 15) for the root of the subtree. Node v A B C D E F G H I Subtree traversals merged in decreasing hi-lo order (A. the algorithm returns the optimal traversal C. 5. 25. 5. The procedure performs a bottom-up traversal of the tree and. TBD TBD. A. 39. They are the three elements in F. 9). .seq for the subtree rooted at v.seq is obtained by optimally merging together the optimal traversals u. 34.seq to ensure that the elements are in decreasing hi-lo difference. 25.seq from each child u of v. 39. 34. 20) (A. 39. 39. 25) (G. The memory usage of the optimal traversal is T. 4(b)). (ABEF.seq becomes (CD. D.hi. Hence. 39. 3) (C. B. 39. 14). Fig. 25. 30. H. The optimal traversals for the other nodes are computed in the same way and are shown in Fig. in which each node v has a field v. a sequence of only two indivisible units. 9). 25. (D. No. I for the entire expression tree (see Fig.seq the new element (F.

root. No.root.lo) // combine S[i] with E E = (S[i]. S2[j].hi + base2. max(S[i].seq[i]. v .hi). Variables in an example physics computation.nodelist.hi MergeSeq (S1. MinMemTraversal (T ): foreach node v in some bottom-up traversal of T v.seq|].lo + base1) base2 = S1[i]. hi.root.hi. hi.seq) if |v.size + base. Vol. nodelist.nodelist // + is concatenation return nodelist // memory usage is T.hi + base1.hi or E.size) nodelist = for i = 1 to |T. TBD TBD. S2): S= i=j=1 base1 = base2 = 0 while i ≤ |S1| or j ≤ |S2| if j > |S2| or (i ≤ |S1| and S1[i]. u. TBD. S1[i].seq| > 0 then // |x| is the length of x base = v.Memory-Optimal Evaluation of Expression Trees Variable RL r τ k G Orbital Discretized points in real space Time step K-point in a irreducible Brillouin zone Reciprocal lattice vector Range 103 105 102 10 103 · 9 Table I.seq.hi − S1[i].hi ≥ S[i].seq = MergeSeq(v.lo ≤ S[i].lo else base = 0 AppendSeq (v.nodelist + E.lo i++ else AppendSeq (S.seq. S1[i]. S1[i].lo + base2) base1 = S1[j].seq = foreach child u of v v. TBD. S2[j].nodelist. v.lo) then AppendSeq (S.lo j++ end while return S AppendSeq (S. lo) // new element to append to S i = |S| while i ≥ 1 and (E.seq[1].hi − S2[j].seq[|v.nodelist. E.lo) remove S[i] from S i–– end while S =S+E // |S| is now i + 1 Fig. E.seq| nodelist = nodelist + T. . 6. v.lo > S2[j]. Procedure for finding an memory-optimal traversal of an expression tree ACM Transactions on Programming Languages and Systems. S2[j]. lo): E = (nodelist.

E. We represent a sequence of indivisible units as a red-black tree with the indivisible units at the leaves. For example.seq (see Fig.seq. therefore. Since there can be at most n − 1 of these combine operations. a degenerate expression tree consists of small nodes and large nodes such that every small node has as its only child a large node. The combined element has the concatenated nodelist and the highest hi but the original E. the total cost is O(n log n).10 · Chi-Chung Lam et al. The cost for building the traversal for the entire expression tree is. A pair of such nodes will form an indivisible unit with the hi being the size of the large node and the lo being the size of the small node. from S2 (S1). Inserting a node into a sequence represented as a red-black tree costs O(log n) time. The AppendSeq procedure appends the new element specified by the triple (nodelist. In this case. In addition. No. we simply append the node to the sequence of its subtree. In the best case. which is in turn a subsequence in I. This algorithm has the property that the traversal it finds for a subtree T is not only optimal for T but must also appear as a subsequence in some optimal traversal for any larger tree that contains T as a subtree. which costs O(log n) time. COMPLEXITY OF THE ALGORITHM The time complexity of our algorithm is Θ(n log2 n) for an n-node expression tree. we merge the sequences of the subtrees by inserting the nodes from the smaller sequences into the largest sequence. the algorithm only takes linear time. which is kept in variable base1 (base2). Vol. indivisible units are always combined such that each sequence contains a single element. The cost of the whole algorithm is. Two individual indivisible units can be combined in constant time. In this case. For a unary interior node. can be inserted into a sequence at most O(log n) times. O(n log2 n). In the worst case. When combining two adjacent indivisible units within a sequence. every time a given node of the expression tree gets inserted into a sequence the size of the sequence containing this node at least doubles. For an m-ary interior node. hi. 5). the leaves are linked in sorted order in a doubly-linked list. it is combined with elements at the end of S whose hi is not higher than E. TBD. which costs O(log n) time.seq is a subsequence in F. TBD TBD. 4. lo) to the given traversal S. TBD. one of them must be deleted from the sequence and the red-black tree must be re-balanced. therefore. the time required for merging and combining sequences. the worst-case cost for building sequences can be analyzed separately from the worst-case cost for combining indivisible units. For finding an upper bound on the cost of the algorithm.hi or whose lo is not lower than E. which is O(n log2 n). therefore. Each node. Such a tree can be constructed such that these indivisible units will not further combine into larger indivisible units. with each insertion costing O(log n) time. the algorithm will result ACM Transactions on Programming Languages and Systems. Before the new element E is appended to S. The tree is sorted by decreasing hi-lo difference. Since we always insert the nodes of the smaller sequences into the largest one. Combining indivisible units into larger ones reduces the number of elements in the sequences and. and a count of the number of indivisible units in the sequence is maintained.lo. The sequence for a leaf node of the expression tree can be constructed in constant time. dominated by the cost for building sequences.lo. therefore. The cost of constructing the final evaluation order consists of the cost for building sequences of indivisible units and the cost for combining indivisible units into larger indivisible units. .

CORRECTNESS OF THE ALGORITHM We now show the correctness of the algorithm. Let v be any node in an expression tree.e. (2) S[i].e. we do not need to consider breaking it up later. P ). for all k < m. L EMMA 5. . i.lo = lomem(vn . (1) S[i]. Thus. TBD TBD. The first lemma establishes some important invariants about the indivisible units in an ordered list v.lo < lomem(vk . TBD.nodelist appear in the same order as they are in S[i]. TBD. (3) for all m ≤ k ≤ n.nodelist = v1 . v2 ..seq that represents a traversal. Lemma 1 characterizes the indivisible units in a subtree traversal by providing some invariants. (c) S[j]. the indivisible units in a sequence are in decreasing hi and increasing lo order. the algorithm requires Ω(n log2 n) time for computing the optimal traversal. Then. S[j]. any nodes that are in between the nodes in ACM Transactions on Programming Languages and Systems. vn and vm be the last node in S[i]. P = S[1]. P ). Given a sequence of nodes. Proof The above invariants are true by construction.seq.1. P ). This lemma allows us to treat each traversal as a sequence of indivisible units (each containing one or more nodes) instead of a list of the individual nodes. and P be the traversal represented by S of the subtree rooted at v. S[j]. P ) < himem(vm . but not contiguously. The second lemma asserts the ‘indivisibility’ of an indivisible unit by showing that unrelated nodes inserted in between the nodes of an indivisible unit can always be moved to the beginning or the end of the indivisible unit without increasing memory usage. The hi value of an indivisible unit is the highest himem of the nodes in the indivisible unit. let S[i].hi. P ). S = v. If such a degenerate expression tree is also unbalanced. an indivisible unit extends to the last node with the lowest lomem following the last node with the highest himem. (b) for all 1 ≤ k ≤ n. Lemma 3 deals with the optimal ordering of indivisible units from different subtrees. Finally. each indivisible unit can be considered as a whole. (4) for all 1 ≤ j < i. No. himem(vk . Let v be a node in an expression tree T . P ). P ) ≤ himem(vm .nodelist.nodelist + · · · + S[|S|]. lomem(vk .hi > himem(vk . . using the three lemmas we prove the correctness of the algorithm by arguing that any traversal of an expression tree can be transformed in a series of steps into the optimal traversal found by the algorithm without increasing memory usage. In addition. we have.lo. (a) for all 1 ≤ k ≤ n. 5. and (d) S[j].hi = himem(vm . S = v. and P be a traversal of T in which the nodes from S[i]. Vol. P ) and for all k > m.lo < S[i]. P ) ≥ lomem(vn .2. Lemma 2 shows that.Memory-Optimal Evaluation of Expression Trees · 11 in a sequence of n/2 indivisible units containing two nodes each. The proof proceeds as follows.hi > S[i].. The algorithm maintains the following invariants: For all 1 ≤ i ≤ |S|.nodelist. once an indivisible unit is formed. P ). Then. L EMMA 5.seq. i. . himem(vk . . once formed.nodelist that has the maximum himem value. . The lo value of an indivisible unit is the lomem of the last node in the indivisible unit.

S) ≤ himem(vj . . . S) ≤ himem(vj . . . . S) + lomem(va .3. . va+1 . It shows that arranging indivisible units from different subtrees in the order of decreasing hi-lo difference minimizes memory usage. . v1 . . the memory usage for P at vk is no higher than the memory usage for P at vk . one from each of S and S . vb . . . Q) + lomem(v0 . va (not necessarily contiguously) before vm and va+1 . vb to the end of S[i]. . vb . the overall maximum memory usage cannot increase. . since for all m ≤ k ≤ n. Then. The next lemma deals with the ordering of indivisible units. lomem(vk . S = v. . . Let Q be the traversal obtained from P by removing the nodes in S[i]. . (3) For all 1 ≤ j ≤ a. . since two indivisible units that are not in that order can be interchanged in the merged traversal without increasing memory usage. . . the memory usage for P at vj is no higher than the memory usage for P at vj . Q)+lomem(vn . which remains unchanged.nodelist. . vn in P with v1 . . Q) ≤ himem(vm . . Q) = himem(vm . Traversals P and P differ in memory usage only at the nodes {v1 . Vol. Then there exists an adjacent pair of elements. TBD. . vn .nodelist such that for all k < m. . . S) + lomem(va . that are not in that order.e. In other words. S) < lomem(vk . TBD TBD. provided that none of the nodes that are in between the nodes in S[i]. .nodelist and va+1 . vn . P ) = himem(vk . P ) ≤ himem(vm . vm be the node in S[i]. since himem(vk . among all possible merges of S and S . . . Q) = himem(vm . the merge that arranges the elements from S and S in the order of decreasing hi-lo difference uses the least memory. S[i]. We construct another traversal P of T from P by moving v1 . Proof Let M be a merge of S and S that is not in the order of decreasing hi-lo difference. vb be the ‘foreign’ nodes.nodelist are ancestors or descendants of any nodes in S[i]. . . lomem(v0 . va . P ). P ) = himem(vm . himem(vm . . . . Let v1 . P ) = himem(vj . with v1 . . vm . . since for all 1 ≤ k ≤ m. vn after moving the foreign nodes cannot exceed that of vm . and the memory usage of the foreign nodes does increase as a result of moving them. . S) ≤ himem(vm . . P does not use more memory than P because: (1) The memory usage for P at vm is the same as the memory usage for P at vm since himem(vm . . .nodelist. . L EMMA 5.nodelist without increasing memory usage. P ). . . (2) For all 1 ≤ k ≤ n. P ). the memory usage for P at vj is no higher than the memory usage for P at vj . S) (by invariant 4(b) in Lemma 1) implies himem(vj . va to the beginning of S[i]. . v1 . Let v and v be two nodes in an expression tree that are siblings of each other. S) > himem(vk . . . . va+1 . S) (by invariant 3 in Lemma 1) implies himem(vj . . . i. . . P ). Without loss of generality. . P ) = himem(vj .nodelist = v1 .nodelist. . . to form P . v1 . vm . . . . P ) and for all k > m. . and S = v . . No. . .seq. v0 be the node before v1 in S. va .seq.. S) ≥ lomem(vn . . . . vb (not necessarily contiguously) after vm in P . .12 · Chi-Chung Lam et al. Proof Let S[i]. . S).nodelist in P . . S) implies that himem(vk . . . the nodes that are in between the nodes in S[i]. S) + lomem(va . . we replace v1 . . TBD. himem(vk . (4) For all a < j ≤ b. vn . . . Since the memory usage of any node in v1 . . . . . . . vb }.nodelist can always be moved to the beginning or the end of S[i]. we assume the first element is S [j] from S and ACM Transactions on Programming Languages and Systems.

lo. .Memory-Optimal Evaluation of Expression Trees element . The first kind of units are the indivisible units in u. P ) and E. (b) Sequence M Memory usage comparison of two traversals in Lemma 5. To simplify the notation. In the following steps.hi. Invariant 4 in Lemma 1 gives us Lj > Lj−1 .seq.e. . . Hi + Lj ) while the memory usage of M at the same two elements is max(Hj + Li . Given an expression tree. . 7. . . By switching all adjacent pairs in M that are out of order until no such pair exists. TBD TBD. . Lr = S[r]. . S [j] S[i] . .lo = lomem(wn . Hj + Li−1 Hi + L j . which implies Hi + Lj > Hi + Lj−1 .lo < E. . The memory usage of M and M differs only at S [j] and S[i] and is compared in Fig. For each unit E of the second kind. . This implies Hi + Lj > Hj + Li . . element . for each non-leaf node v in T . . . Hi + Lj−1 Hj + L i . and Lr = S [r]. . (2) First. we get an optimal order without increasing memory usage.seq. . . Hi + Lj−1 ).lo. himem . M cannot use more memory than M . The memory usage of M at the two elements is max(Hj + Li−1 .. . Lj + Li−1 Li + Lj . .3 the second one is S[i] from S. Hi + Lj ) ≥ Hi + Lj > max(Hj + Li . ACM Transactions on Programming Languages and Systems. i. . we sort the components of the indivisible units in v. lomem . For this sorting step. let Hr = S[r]. TBD. we will rearrange the nodes in P such that the nodes that form an indivisible unit in v. the algorithm presented in Section 3 computes a traversal that uses the least memory.seq so that they are in the same order as in v. Vol. . The sorting process involves rearranging two kinds of units. . T HEOREM 5. No.hi. The second kind of units are the contiguous sequences of nodes in P which are from T − T . Li + Lj−1 Lj + Li . we perform the following steps: (1) Let T be the subtree rooted at v and P be the minimal substring of P that contains all the nodes from T − {v}. TBD. Hr = S [r]. . we take E. Therefore. we visit the nodes in T in a bottom-up manner and. Thus. lomem .hi − E . Given a traversal P for an expression tree T . S[i] S [j] . Hi + Lj−1 ). While there exist two adjacent units E and E in P such that E is before E and E . himem . Proof We prove the correctness of the algorithm by describing a procedure that transforms any given traversal to the traversal found by the algorithm without increase in memory usage in any transformation step. the hi-lo difference of S [j] must be less than that of S[i]. P ) where wn is the last node in E. we temporarily treat each such maximal contiguous sequence of nodes as a unit.seq are contiguous and the indivisible units are in the same order as they are in v.4. Hj − Lj < Hi − Li .seq for each child u of v.hi = maxw∈E himem(w. . The sorting process is as follows. Since the two elements are out of order.hi − E.lo. Consider the merge M obtained from M by interchanging S [j] and S[i]. 7. max(Hj + Li−1 . · 13 (a) Sequence M Fig. .

By induction.seq. Vol. v and its ancestors do not appear in P . Upon completion.seq. Hence. no traversal can use less memory and the algorithm is correct. if there are nodes from T − T in between the nodes from E. all units of the first kind. When the above sorting process finishes. (3) Now that the components of the indivisible units in v. No. Also. (a) Swap E and E.seq. each indivisible unit in v.14 · Chi-Chung Lam et al. . this does not increase the memory usage. We use induction to show that the above procedure correctly transforms any given traversal P into an optimal traversal found by the algorithm. Since any traversal P can be transformed into a traversal found by the algorithm without increasing memory usage in any transformation step. their relative order is not changed. According to Lemma 2. TBD. (a) In the traversal P . the sorting process does not violate parent-child dependencies. H(v) becomes true.seq is contiguous in P and the order in P of the indivisible units is the same as they are in v. The procedure rearranges the nodes in P such that the nodes that form an indivisible unit in v.root) is true and a traversal found by the algorithm is obtained.seq. H(u) is true for every leaf node u because there is only one traversal order for a leaf node. (b) Make the contiguous sequence of nodes from E an indivisible unit. CONCLUSION In this paper.seq are in the correct order. For each indivisible unit E in v. for each child u of v. indivisible units in u. when the procedure finishes processing a node v. This problem arose in ACM Transactions on Programming Languages and Systems. 6. and nodes in units of the first kind are not ancestors or descendants of any nodes in units of the second kind. and the order among the nodes that are not in the subtree rooted at v is preserved. moving ‘foreign’ nodes out of an indivisible unit does not increase the memory usage. Therefore. Since. By Lemma 3.seq appear contiguously in P and are in the same order as they are in u.seq. move them either to the beginning or the end of E as specified by Lemma 2. combine the two units into one and recompute its new hi and lo. the sets of nodes corresponding to the indivisible units are in the same order in P as they are in v. TBD TBD.seq is the same as they are in u. As the induction step. H(T. and —the order in P of the indivisible units in u. (b) If two units of the second kind become adjacent to each other as a result of the swapping. Also. we have considered the memory usage optimization problem in the evaluation of expression trees involving large objects of different sizes. assume H(u) is true for each child u of a node v.seq have been in the correct order before the sorting process. the combining process does not violate parent-child dependencies. are in the order of decreasing hi-lo difference. which are components of the indivisible units in v. we make the indivisible units contiguous using the following combining process.seq are contiguous in P . Initially.seq. the order of the nodes from T − T is preserved. The order of the nodes from T − T is preserved because the sorting process never swaps two units of the second kind. Thus. TBD. The induction hypothesis H(u) for each node u is that: —the nodes in each indivisible unit in u.

226–248. 1995.. 417–421. AND L OUIE . 17. C. 157–168. T. SIAM J. C. TX. TBD. P. of Computer and Information Science. 1996. 1999.. Exper. IEEE Computer Society Press. In such situations. L AM . Parall. S ADAYAPPAN . AND N EEDS . revised TBD TBD. TBD TBD.).-C.D. R. S. AND W ENGER .-C. Phys. 3 (Sept. Process. In Ninth SIAM Conference on Parallel Processing for Scientific Computing. J. S. 715–728. H.. The generation of optimal code for arithmetic expressions. S ETHI .. A. R. thesis. Ph. In Eighth SIAM Conference on Parallel Processing for Scientific Computing. W.D.. N. B 34. C OCIORVA . Israel. 4. University Saarbr¨ cken. accepted TBD TBD ACM Transactions on Programming Languages and Systems. 492–494. 1987. J. On optimizing a class of multi-dimensional loops with reductions for parallel execution. Generalizations of the Sethi-Ullman algorithm for register allocation. 1990a. Society for Industrial and Applied Mathematics. A LOUANI . R. ACM 17. T. H YBERTSEN . The developed solution would apply in any context involving the evaluation of an expression tree. R AUBER . M. J. R. G. The Ohio State University. Received April 2002. . Ph. G ODBY. it is necessary to dynamically allocate and deallocate space for the intermediate results and to find an evaluation order that uses the least memory. 467–473. 7. J. ACM 10. C. Parallel implementation of quasiparticle calculations of semiconductors and insulators. C. S ETHI .). 6 (June). in which intermediate results are so large that it is impossible to keep all of them in memory at the same time. D. AND W ILKINS . Optimal Evaluation of Vector Expression Trees. TBD. 1997a. REFERENCES A PPEL . Ein Compiler f¨ r Vektorrechner mit Optimaler Auswertung von vektoriellen Ausu drucksb¨ umen. ROJAS . J. D. L AM . AND S UPOWIT. We have developed an efficient algorithm that finds an optimal evaluation order in Θ(n log2 n) time for an expression tree containing n nodes and proved its correctness. 1990b. Commun. 1970.-C. K. Rev. Society for Industrial and Applied Mathematics. I. a u R AUBER . OSU-CISRC-8/99-TR22. Acknowledgments We would like to thank Tamal Dey for help with the complexity proof and for proofreading the paper. Jerusalem. The Ohio State University. Pract. Lett. AND U LLMAN . Minneapolis. thesis. Optimization of a class of multi-dimensional integrals on parallel machines. On compiling algorithms for arithmetic expressions. Vol. 1975. P. 1 (Oct..-C. San Antonio. Rev. AND W ENGER . Dept. Ph. NAKATA . Electronic correlation in semiconductors and insulators: Band gaps and quasiparticle energies. thesis. No. 1967.D. Phys. R. R. 1986... In Proceedings of the 5th Jerusalem Conference on Information Technology. 2. S ADAYAPPAN . 5390. Lett. “Complete register allocation problems”.. 1827. S ADAYAPPAN . L AM . Performance optimization of a class of loops involving sums of products of sparse arrays. Softw. Also available as Technical Report No. 1999. W.Memory-Optimal Evaluation of Expression Trees · 15 the context of optimizing electronic structure calculations.. Comput. August 1999. 1997b. AULBUR .. We are grateful to the National Science Foundation for support of this work through the Information Technology Research (ITR) program. L AM . M. Performance optimization of a class of loops implementing multi-dimensional integrals. P. 74. Space-time method for Ab-initio calculations of self-energies and dielectric response functions of solids. W. MN. The Ohio State University.

Sign up to vote on this title
UsefulNot useful