You are on page 1of 16

Reprinted from the

Proceedings of the
GCC Developers’ Summit

June 17th–19th, 2008


Ottawa, Ontario
Canada
Conference Organizers
Andrew J. Hutton, Steamballoon, Inc., Linux Symposium,
Thin Lines Mountaineering
C. Craig Ross, Linux Symposium

Review Committee
Andrew J. Hutton, Steamballoon, Inc., Linux Symposium,
Thin Lines Mountaineering
Ben Elliston, IBM
Janis Johnson, IBM
Mark Mitchell, CodeSourcery
Toshi Morita
Diego Novillo, Google
Gerald Pfeifer, Novell
Ian Lance Taylor, Google
C. Craig Ross, Linux Symposium

Proceedings Formatting Team


John W. Lockhart, Red Hat, Inc.

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights
to all as a condition of submission.
A Superoptimizer Analysis of Multiway Branch Code Generation
Roger Anthony Sayle
OpenEye Scientific Software
roger@eyesopen.com

Abstract Over the last 25 years, microprocessor architecture has


advanced considerably, the cycle penalties of memory
The high level of abstraction of the multi-way branch, latency and branch misprediction have changed, and
exemplified by C/C++’s switch statement, Pascal/Ada’s several novel implementations have been proposed.
case, and FORTRAN’s computed GOTO and SELECT,
allows an optimizing compiler a large degree of free- 2 Definitions
dom in its implementation. Typically compilers gener-
ate code selected from a few common patterns, such as
A switch statement (or multiway branch) is a con-
jump tables and/or trees of conditional branches. How-
struct found in most high-level languages for selecting
ever, for most switch statements there exists a vast num-
from one of several possible blocks of code or branch
ber of possible implementations.
destinations depending on the value of an index expres-
To assess the relative merits of these alternatives, and sion. It can be considered a generalization of the if
evaluate the selection strategies used by different com- statement that conditionally selects between two pos-
pilers, a switch-statement “superoptimizer” has been de- sible blocks of code depending upon the value of a
veloped. This exhaustively enumerates code sequences Boolean expression. In this work, we restrict the index
implementing a specified multiway branch. Each se- expression to be of an integral type.
quence is evaluated against a simple cost model for
Mathematically, a switch statement behaves like a func-
code-size, average and worst-case performance (option-
tion or mapping, F : Z → M, that for each element of
ally using profile information). Backend timings for
the set Z, the set of all K-bit values where K is a positive
several architectures are parameterized by microbench-
integer, selects a single outcome from the set M of des-
marking. The results and insights from applying this su-
tination labels. The function F is surjective, such that
peroptimizer to a corpus of “real-world” examples will
|M| ≤ |Z| = 2K .
be discussed.
The semantics of Pascal’s and Algol’s multiway branch,
the case statement, permit that any index value not han-
1 Introduction
dled by the case statement may result in undefined be-
havior. We do not consider such partial functions, and
Although numerous optimizing compilers were imple- instead restrict F to be a total function, even though this
mented prior to its publication, the first paper to describe may restrict some the potential optimizations allowed in
the issues of code generation for multiway branches was those languages. Instead the function F is made total
published by Arthur Sale in 1981 [Sale81]. This now- by mapping any value z ∈ Z that is not in the original
classic paper compared several implementation strate- domain to a unique default label, md ∈ M.
gies for the University of Tasmania’s Pascal compiler
for the Burroughs B6700/7700 series. Ultimately this By this definition any pure const function of a single in-
evaluation recommended the use of either a table jump teger argument, such as factorial or Fibonacci,
or a balanced search tree of comparisons. It is inter- may be represented or canonicalized using a single
esting, and perhaps slightly disappointing, that over a switch statement. This ability can be generalized using
quarter of a century later the switch code generation in a currying-like approach such that any pure const func-
the GNU GCC compiler is little changed from Sale’s tion can be represented by nested switch statements, or
original exposition. a single switch statement with a sufficiently large K.

• 103 •
104 • A Superoptimizer Analysis of Multiway Branch Code Generation

When |M| = 1, the switch statement F is said to be un- goto L0;


conditional. When |M| = 2, then F is said to be binary.
jmp L0
In practice, we are mostly interested in switch state-
Unconditional jump instructions typically have low
ments where the size of the codomain M is small, with
overhead on modern processors, where branch predic-
only a few labels, |M|  |Z|. In such cases, a significant
tion is unnecessary and the instruction prefetch machin-
fraction of the elements of Z map to a default label md ∈
ery can avoid stalling the execution pipeline. Addition-
M. In such cases, it makes sense to talk of the set of case
ally, basic block reordering can frequently eliminate the
values V ⊆ Z that are defined as the values that don’t
jump altogether, by placing the destination basic block
map to md , i.e. v ∈ V iff F (v) 6= md . We can thus rep-
at the site of the jump (if it has only a single incoming
resent F as a set of ordered pairs FV ⊆ V × (M − md )
edge) or by duplicating the basic block otherwise (if it
such that for all (vi , mi ) ∈ FV , every vi is associated with
has more than one incoming edge) [Mueller02].
a single label mi , such that F (vi ) = mi .

Additionally, when interpreted as either signed or un- 3.2 Sequential Tests


signed integers, sequentially consecutive values in Z of-
ten map to the same destination label. Thus the switch The simplest universal implementation strategy for
statement F can also be efficiently represented by the switch statements is to check for equality against each
set of ordered triples FR ⊆ V × V × (M − md ) such case value sequentially.
that every (li , hi , mi ) ∈ FR represents a contiguous case
if (x == 1)
range, where li ≤ hi and for all vi such that li ≤ vi ≤ hi , goto L1;
we have F (vi ) = mi and the case ranges are maximal, if (x == 0)
i.e. F (li − 1) 6= mi and F (hi + 1) 6= mi . goto L2;

We shall refer to the number of case values, |FV |, and cmpl $1, %eax
the number of case ranges, |FR |. je L1
testl %eax, %eax
je L2
3 Switch Lowering
On many architectures, the performance of integer com-
This section lists several possible compilation strategies parisons may be dependent on the value being com-
for implementing a multiway branch. The first few are pared. As shown in the assembly example above,
well known and commonly employed by existing com- the i386 provides special instructions for comparison
pilers, whilst the later strategies are perhaps more novel. against zero. On MIPS, comparisons against zero and
For each implementation strategy, the high-level C (pos- one are cheaper than comparisons against signed 16-bit
sibly using GCC extensions) and the equivalent Intel constants, which are cheaper than general 32-bit con-
x86 assembler are given. We use the conventions that stants.
the index variable is x, the default label is called L0,
and the case target labels are L1 ... Ln. In this work, we 3.3 Jump Tables
assume the target labels are opaque, and have no useful
mathematical relationships (though some early compil- A common method for implementing a switch statement
ers could rearrange code to simplify the task of switch is via an indirect jump (also known as a table jump).
implementation). This uses the switch variable to index an array of desti-
nation addresses, and then branch to that location.
3.1 Unconditional Branch In the general case, to keep the size of the jump table
reasonable, the minimum case value needs to be sub-
The simplest of switch statement implementations and tracted, and the result compared against the table range
one of the fundamental building blocks is the uncondi- (as an array bounds check) before indexing the jump ta-
tional jump. This is used for switch statements that con- ble. Fortunately, the minimum case value is often one or
tain only a default outcome, or during switch lowering two allowing the subtraction to be omitted for a minor
after all other cases have been handled. increase in the size of the jump table [Sayle01].
2008 GCC Developers’ Summit • 105

unsigned int t = x - 10; as (x>=lo) && (x<=hi) where lo and hi are the
if (t > 5) lower and upper bounds of the range respectively. An
goto L0; often more efficient way of doing this is to subtract the
static const void *T1[6] = lower bound, and then perform an unsigned comparison
{ &&L1, &&L2, &&L3,
against the integer constant hi-lo which achieves the
&&L4, &&L5, &&L6 };
goto *T1[t]; same thing with a single comparison. Obviously, if lo is
zero the subtraction can be omitted. Warren [Warren02]
also mentions the efficient use of shifts to perform range
subl $10, %eax tests for suitable high and low bounds.
cmpl $5, %eax
ja L0 if (x < 8)
jmp ∗T1(,%eax,4) goto L1;
T1: .long L1, L2, L3
.long L4, L5, L6 unsigned int t = x >> 3;
if (t == 0)
An obvious free parameter and difference between com- goto L1;
pilers is the density threshold used to decide whether
to use a table jump or not. When the case values are
sequentially consecutive values that each jump to dis- 3.5 Balanced Binary Trees
tinct labels the decision to use a table jump is obvious.
However, as the density (approximately the number of A more efficient use of compare and branch compar-
case values divided by the difference between their up- isons than the sequential testing described above is per-
per and lower bounds) decreases, the memory to perfor- form a binary search on the target case values. By com-
mance trade-off eventually reaches a critical limit where paring against a (median) pivot value, the case values
the additional cost in bytes provides insufficient benefit can be partitioned into those values less than or greater
in clock cycles. than the pivot. This reduces the time to perform a multi-
way branch from O(n) with the sequential method to
A less obvious difference is the lack of uniformity in the O(log n), where n is the number of case values.
way that switch table density is calculated. One obvi-
ous approach is to use the number of case values, |FV |, Each comparison in such a binary search refines the up-
as the numerator. However, as mentioned in the next per or lower bounds of the remaining partitions. Whilst
section, ranges of consecutive case values that branch many compilers perform only signed or only unsigned
to the same label can be efficiently as range tests, that comparisons depending upon the type of the switch ex-
are independent of the number of case values they con- pression, it is potentially beneficial to use both forms in
sider. Hence, density may correlate better with the num- a single search. Another more commonly implemented
ber of case ranges than with the number of case values. improvement is to use the established upper and lower
The approach currently used by GCC is to count non- bounds to avoid comparisons. For example, if a bi-
singleton case ranges as twice as expensive as singleton nary search has already confirmed that the index value is
case ranges. This provides a better measure of density greater than three, but less than five, there is no point in
for comparison against other implementation strategies. explicitly comparing against the value four. A naïve im-
In practice, the real non-singleton to singleton cost ra- plementation of binary tree lowering using these bounds
tio is less than two, even approaching one for switches may inadvertently emit conditional branches to uncon-
consisting of consecutive adjoining ranges. ditional branches and similar inefficient idioms. These
may be avoiding by either using the usual CFG jump
3.4 Range Tests optimizations as a clean-up step, or by improvements in
the binary tree lowering code.

When two or more consecutive integer values branch to Depending upon the architecture, it may be beneficial
the same target label, it is possible to test for a range (or not) to perform three-way branches at each parti-
values rather than each value independently. One obvi- tion step, reusing the condition codes (or register) from
ous possibility is via a pair of conditional branches such a single comparison against an integer constant for both
106 • A Superoptimizer Analysis of Multiway Branch Code Generation

an equality test and an ordering test. Currently, GCC An often overlooked aspect of performing binary com-
always performs three-way branches, whilst LLVM al- parisons both in sequential comparisons and in binary
ways performs two-way (binary) branching. Even with search trees is that it can sometimes be advantageous to
three-way branching there is the decision of whether to compare against integer values (and ranges) not in the
perform the equality/inequality comparison before or af- case set, V , i.e. those that map to the default label md .
ter the ordering comparison. Unless the pivot value oc- For example, when ranges on both adjoining sides of
curs frequently, it is theoretically better to perform the a ‘gap’ branch to the same destination, testing for the
partitioning first and incur the cost of the equality com- gap allows the adjoining ranges to be merged. Likewise
parison on only one of the following paths. for binary switch tables, it can occasionally be easier
to identify the (complementary) default values than the
On many processors there is a significant asymmetry in (positive) case values.
the cycle counts for taken vs. not-taken cycle counts
(occasionally even when the outcome is correctly pre-
3.6 Bit Tests
dicted). This difference means that a perfectly balanced
binary tree will not have the best worst-case behavior.
Instead appropriately skewed binary trees can lead to A relatively recent innovation in switch statement im-
better performance. This is one reason why the sequen- plementation is the use of bit tests [Sayle03a]. This
tial testing strategy described above is sometimes com- technique allows several case values that share the same
petitive for switches with four or more values. label to be tested simultaneously by representing each
case by its own bit.
switch (x) {
This technique requires that the range of case values not
case 100: goto L1;
be larger than the number of bits in a machine word.
case 200: goto L2;
Like jump tables, it is often necessary to perform a sub-
case 300: goto L3;
traction followed by a bounds check to eliminate values
case 400: goto L4;
outside of the allowed range.
}
if (x > 8)
if (x == 200) goto L0;
goto L2; unsigned int t = 1 << x;
if (x > 200) { // cases 1, 4, and 8
if (x == 300) if ((t & 274) != 0)
goto L3; goto L1;
if (x == 400) // cases 0, 2, and 5
if ((t & 37) != 0)
goto L4;
goto L2;
}
goto L0;
else {
if (x == 100) cmpl $8, %eax
goto L1; ja L0
} movl %eax, %ecx
movl $1, %eax
cmpl $200, %eax sall %cl, %eax
je L2 testl $274, %eax
jbe LT1 je L1
testb $31, %al
cmpl $300, %eax
je L2
je L3
cmpl $400, %eax The performance of bit testing can be improved by test-
je L4 ing the most frequent bit patterns (masks) first. If all
jmp L0 cases are equally probable, this equates to testing the
LT1: cmpl $100, %eax masks that have more bits set before those that have a
je L1 few bits (or just a single one) set.
2008 GCC Developers’ Summit • 107

3.7 Tabular Methods decl %eax


je L3
For completeness, we briefly describe the applicability On MIPS, comparisons against one and zero are cheaper
of table-driven methods for implementing switch state- than other values that require loading of integer con-
ments. In these techniques, the task of implement- stants into registers. Hence on this target, it is conve-
ing a multiway branch is reduced to a static search nient for dense sequential cases to place the constant two
problem. A good review of tabular methods, includ- into a register, then repeatedly compare against zero and
ing linear search, binary search, and multiplicative bi- one between subtractions of two.
nary search, is given by Spuler [Spuler94]. The ma-
jor utility of tabular approaches is in reducing code This strategy may potentially benefit from the use of
size, especially when the table search implementation decrement and branch instructions on those architec-
can be shared between multiple switches and placed in tures that support them.
the compiler’s run-time library. In the Java virtual ma-
chine, the lookupswitch bytecode provides a tabular movl %eax, %ecx
search implementation. loop LT1
jmp L1
static const int T1[4] = LT1: loop LT2
{ 4, 6, 9, 11 }; jmp L2
static const void ∗T2[4] = LT2: loop L0
{ &&L1, &&L2, &&L3, &&L4 }; jmp L3
for (int i = 0; i < 4; i++)
if (x == T1[i]) 3.9 Index Mapping
goto ∗T2[i];
Index mapping is a table-based technique similar to
3.8 Decrement Chains jump tables but with a wider range of applicability. In-
stead of indexing directly into a sparse table of labels
An interesting implementation technique implemented (typically each four bytes long), this method uses a nar-
by some compilers, but not previously described in rower lockup table of bytes to first transform the index
the literature, is the use of decrement (or subtraction) variable into dense zero-based range.
chains. On some architectures, a decrement instruction
Switch statements with up to 256 unique labels can be
followed by a test for zero is more convenient than a
handled by a byte-wide look-up table. This technique
comparison against an arbitrary constant. On i386 this
can also be extended to switch statements with even
leads to dense code, and on some RISC architectures
more unique labels by using two-byte shorts instead.
this provides a useful instruction for the branch delay
slot. unsigned int t = x − 10;
if (t > 7)
x−−;
goto L0;
if (x == 0)
static const char T1[8] =
goto L1;
{ 0, 1, 1, 2, 0, 2, 2, 1 };
x−−;
t = (unsigned char)T1[t];
if (x == 0)
if (t == 0)
goto L2; goto L1;
x−−; if (t == 1)
if (x == 0) goto L2;
goto L3;
movzbl T1(%eax), %eax
decl %eax
je L1 Often index mapping is immediately followed by a reg-
decl %eax ular table jump. This idiom, called double dispatch, has
je L2 the advantage that the range of values in the first table
108 • A Superoptimizer Analysis of Multiway Branch Code Generation

is known and therefore the usual bounds test on the fol- 3.11 Imperfect Hashing
lowing table jump can be omitted. On the SPARC archi-
tecture, which usually needs to multiply the table jump Hashing functions provide a method of handling sparse
index by four to access the correct destination word, switch statements. By quickly calculating an auxiliary
this multiplication can be precomputed and stored in the value and branching or multiway branching on its value,
byte map (provided that the switch statement has less we can ‘divide and conquer’ the original switch. A re-
than 64 unique destinations). view of suitable (i.e., cheap) hash functions is described
by Dietz [Dietz92]. More recently, Erlingsson et al.
Mathematically, index mapping provides a minimal per- have described a practical method for quickly selecting
fect hash function when each target label is unique and a suitable bitwise-and hash function called “Multiway
there are no ranges. Radix Search Trees (MRST)” [Erlingsson96].

On many CISC processors, such as the x86 family, bi- Modern processor architectures contain a number of in-
nary index mapping (i.e. when there is only a single structions that perform useful hashing functions. Ex-
non-default label and the byte array contains only ones amples include popcount, ffs (find first set), clz
and zeros) can take advantage of instructions that com- (count leading zeros), and ctz (count trailing zeros).
pare directly against memory. On the i386, for exam- One simple use of imperfect hashing is handling switch
ple, the cmpb $0, T1(%eax) instruction avoids the statements where the index expression has a multi-word
usual load followed by a compare or test. type (such as long long). In such cases, the im-
plementation could be considered a switch of the high-
word, dispatching to switches on the low-word. This is
3.10 Condition Codes
especially advantageous on narrow processors such as
AVR.
Depending upon the architecture, and upon the fre- The MRST scheme that extracts consecutive bits from
quency of case values, it may be beneficial to manip- the index value for the hash (i.e. where hash functions
ulate explicit Boolean expressions rather than the more of the form t = (x >> C1) & C2) is particularly
common ‘short-circuit evaluation’ conditional branches. well suited to architectures that have bit-field extraction
instructions, such as the ARM and the PowerPC. For the
char t1 = (x == 6); special case where the field is a single bit, i.e. C2 is one,
the comparison can be performed using a bit-wise ‘and’
char t2 = (x == 11);
instruction, or special-purpose bit-test instruction where
if (t1 | t2) available.
goto L1;
3.12 Perfect Hashing
cmpl $6, %eax
In the field of switch statement code generation, perfect
sete %dl
hashing describes any transformation of the original in-
cmpl $11, %eax teger index expression that is a bijection. A bijective in-
sete %al teger operation has the property that every output value
orb %al, %dl is generated by a unique input value, enabling the ex-
istence of an inverse. Such an operation just permutes
je L1
the integer values, but as mentioned above, these per-
muted case values may allow more efficient implemen-
In the example above, the i386 architecture manipulates tation than its original isomorphism.
the condition codes as bytes, but on many RISC archi-
tectures with multiple condition code or predicate regis- Examples of suitable bijective operations (permuta-
ters, they can be manipulated directly. As an example, tions) on K-bit words include logical inversion (∼x),
the PowerPC architecture provides a suitable cror in- integer negation (-x), addition and subtraction of an in-
struction. teger constant (modulo 2K ), bitwise-exclusive-or (xor)
2008 GCC Developers’ Summit∼ • 109

of an integer constant, byte-swap, bitwise rotation, and 3.13 Safe Hashing


even multiplication by any odd number (proof left to the
reader).
Somewhere between perfect (bijective) hashing and im-
switch (x) { perfect hashing is safe hashing. These are many-to-one
case 0: index transformations with the fortunate property that
case 4192257: all index values that map to the same output value have
case 8384514:
the same outcome (target label).
case 12576771:
case 16769028:
Examples of suitable safe hashing functions are divi-
case 20961285:
goto L1; sion, bitwise logical and arithmetic shifts, bitwise-and,
} and bitwise-or.

unsigned int t = x ∗ 2049; switch (x) {


if (t < 6)
case 0:
goto L1;
case 1: goto L1;
movl %eax, %edx case 2:
sall $11, %eax
addl %edx, %eax case 3: goto L2;
cmpl $5, %eax case 4:
ja L1 case 5: goto L3;
A far more frequent application of perfect hashing is }
the use of rotation (instead of subtraction) to reduce the
unsigned int t = x >> 1;
range of target case values.
switch (t) {
switch (x) { case 0: goto L1;
case 16: goto L1;
case 32: goto L2; case 1: goto L2;
case 48: goto L3; case 2: goto L3;
case 64: goto L4; }
}

int t = (x >> 4) | (x << 28); shrl %eax


switch (t) {
case 1: goto L1; Another example of the potential benefit of safe hashing
case 2: goto L2; is the classic Has30Days function.
case 3: goto L3;
case 4: goto L4; switch (x) {
} case 4: // April
rorl $4, %eax case 6: // June
case 9: // September
Instruction set support for multimedia functions can
case 11: // November
be a rich source of suitable bijective instructions.
For example, the IA-64’s mux1.rev, mux1.mix, return true;
mux1.shuf, and mux1.alt can all potentially be }
used to improve the performance of a switch statement
implementation. unsigned int t = x | 2;
switch (t) {
It should be clear that the modulo-2K subtractions, used case 6:
in range tests as bounds checks for table jumps and bit
case 11:
tests and functionally in decrement chains, are just a
return true;
special case of the use of subtraction as a perfect hash
function. }
110 • A Superoptimizer Analysis of Multiway Branch Code Generation

3.14 Conditional Moves one for a taken branch and one for a not-taken branch.
In this work, we restrict the representation on control
Modern processors can often avoid the penalty of condi- flow graphs to binary trees. Looping constructs (such
tional branching by the use of conditional move instruc- as those in inlined tabular implementation methods) are
tions. represented by a single “macro instruction” hiding the
internal backward edges from the superoptimizer. Al-
void ∗t = &&L0; though there may be some benefit in extending the cur-
if (x == 1) rent enumeration strategy to general directed acyclic
t = &&L1; graphs (DAGs), the restriction to binary trees helps limit
if (x == 2) the combinatorial explosion associated with exhaustive
t = &&L2; search.
if (x == 3)
The multiway branch superoptimizer works by a divide-
t = &&L3;
and-conquer approach, repeatedly simplifying or trans-
goto ∗t;
forming the remaining switch cases that need to be im-
movl %eax, %ecx plemented. At each node in the search/tree, the problem
movl $L0, %edx state consists of the set of case nodes that remain to be
movl $L1, %eax considered, and the set of potential values the index ex-
cmpl $1, %ecx pression may have. At the start of the recursive search,
cmove %eax, %edx the root of the tree, the set of case values are those of
movl $L2, %eax the original switch, and the set of potential values are all
cmpl $2, %ecx 2K values representable in the index type. The terminal
cmove %eax, %edx nodes at the leaves of the tree are unconditional jump
movl $L3, %eax nodes or table jump nodes. In theory, the value range
cmpl $3, %ecx propagation pass of a compiler may be able to refine the
cmove %eax, %edx initial set of potential index values.
jmp ∗%edx
Some example solution trees enumerated by the super-
optimizer are given in Figure 1. These depict some of
4 Superoptimization the possible balanced binary tree implementations for a
switch statement with four unique case values.
The term superoptimization was first coined by Henry
Massalin in [Massalin87] and describes the task of find- 5 Cost Model
ing the optimal code sequence for a single, loop-free
sequence of instructions. Whilst most compiler opti- An important aspect of superoptimization is the objec-
mizations attempt to improve code by applying heuris- tive function that is used to assess the merit of each solu-
tic transformations, superoptimization performs an ex- tion. When optimizing for code size, a size in bytes can
haustive search in the space of valid instruction se- be associated with each node in a solution tree, and the
quences. An outstanding example of superoptimization total code size is simply the sum of each node size in the
is the GNU superoptimizer [Granlund92]. A less rigor- tree. When optimizing for performance, a cycle count
ous but practical application of superoptimization is the can be associated with each edge in the tree. For con-
task of generating instruction sequences for multiplica- ditional branches, different timings can be associated
tion by integer constants, known in GCC as the routine with the taken and the not-taken edges. The worst-case
synth_mult [Bernstein86, Lefevre01]. performance is calculated by determining the maximum
edge cost from the root to any leaf node. By associating
Unlike most traditional superoptimizers, the superopti- a frequency with each original case value, we can deter-
mizer implementation described in this work needs to mine the average or expected cost of a multiway branch.
handle explicit control flow instructions. In addition to The expected average cost of a switch statement is the
the strict linear instruction sequences enumerated in pre- sum over all leaves of the probability of the leaf multi-
vious efforts, a representation of conditional branches plied by the cost of the leaf. In practice, the use of float-
introduces instructions with two successor instructions; ing point probabilities can be avoided by using integer
2008 GCC Developers’ Summit • 111

2 2 3 2
? ? ? ?
= < < ≤
@R
@ @R
@ @R
@ + QQ
 s
< 1 = 1 = = 3
@
R
@ ? @
R
@ ? @
R
@ @
R
@ ?
1 3 = 3 = 4 1 =
? ? @R
@ ? @R
@ ? ? @R
@
= = = 2 = = 4
@R
@ @R
@ @R
@ ? @R
@ @R
@ ?
4 4 = =
? ? @R
@ @ R
@
= =
@R
@ @R
@

Figure 1: Four alternate balanced binary tree solutions for implementing a unique switch statement with case values
one, two, three, and four. Digits represent comparison instructions; relational operators represent conditional branch
instructions with the taken edge below to the left and the not-taken edge below to the right.

frequencies. These frequencies can be taken from pro- 6 Microbenchmarking


filing information or assumed to be equal when profile
information is not available. In order to both parameterize the cost model described
previously and confirm the predictions made by the su-
peroptimizer, it is possible to measure the performance
In practice, rather than purely minimize average perfor- of possible code sequences on real hardware.
mance, worst-case performance, or code size, it is of-
ten more useful to use Pareto frontier methods to per- The use of microbenchmarking is even applicable to tar-
form multiobjective optimization. If two implementa- gets where real hardware isn’t available. The methodol-
tions have the same performance, then the shorter so- ogy can be used with virtual machines, such as a JVM,
lution is preferable. Often when optimizing for size, or with simulators, such as GNU sim or SIMH. One
we’re not interested in the absolute shortest code if the application of this technique was to evaluate the bene-
performance is terrible. Likewise, when optimizing for fits of the VAX’s casel instruction by running VAX
performance, solutions that use an unrealistic amount of Net/BSD on SIMH.
memory may be inappropriate. The restriction of jump
The advanced branch prediction and memory caching
tables based on density can be seen as an instance of
strategies performed by modern processors can often
this. Likewise, if the training data used to generate pro-
significantly skew and bias the results of microbench-
file edge data doesn’t cover all of the cases, the per-
marks, where memory (for jump tables) is often held in
formance of cases with a zero frequency are not used
cache, and where hot branches can often by predicted
in estimating the average case performance. Whenever
from history tables. To avoid (or more accurately to ac-
two implementations have the same average case behav-
count for) these effects, explicit cache flushing can be
ior, we can avoid potential issues with pathological input
used [Whaley08].
values by preferring the one with the better worst case
performance.
7 Benchmark Corpus

An interesting additional cost metric not considered in To evaluate whether the advanced switch statement im-
the present study is the optimization of code to reduce plementations described above are applicable on real-
power consumption [Tiwari94]. Much like microbench- world code, a survey of existing switch statement usage
marking, the cost of each node could be parameterized was undertaken. By using a specially instrumented ver-
by chip-level simulation. sion of GCC, the switch statements were extracted from
112 • A Superoptimizer Analysis of Multiway Branch Code Generation

the source code of 5,953 packages of the Debian Linux Range Count Fraction
distribution. 1 51804 9.91%
2 62757 12.01%
These 5,953 Debian packages contained, or more ac-
3 70489 13.49%
curately compiled, 522,681 switch statements. These
4 49311 9.43%
switch statements contained 3,024,896 case ranges, giv-
5–8 80043 15.31%
ing a mean of 5.79 case ranges per switch statement. A
9–16 66019 12.63%
case range is defined a consecutive sequence of case val-
17–32 47947 9.17%
ues that map to the same non-default target label. The
33–64 31998 6.12%
largest switch statement contained 4,541 case ranges.
> 64 62313 11.92%
Table 1 summarizes the distribution of switch sizes, as
measured by the number of case ranges. Table 2: Distribution of Switch Ranges

# Cases Count Fraction


1 61193 11.71%
of density, calculated as n/r where r is the range of val-
2 115980 22.19%
ues and n the number of values, we instead report its
3 103623 19.83%
integral reciprocal sparsity, which is defined as dr/ne.
4 65671 12.56%
Using this measure, all switch statements with a density
5–8 102898 19.69%
greater than 33.3̇% have a sparsity less than or equal to
9–16 48575 9.29%
3.
17–32 17462 3.34%
> 32 7279 1.39% Sparsity Count Fraction
Table 1: Distribution of Switch Sizes 1 286427 54.80%
2 85380 16.34%
3 24429 4.67%
As shown by these figures, the vast majority of switch 4 16701 3.20%
statements contain relatively few case ranges. Over a 5 16081 3.08%
third have only one or two cases, and less than two per- 6 10374 1.98%
cent have more than 32 cases. However, the distribution 7 7509 1.44%
has a long tail with 19 switch statements having more 8 5301 1.01%
than a thousand case ranges. 9 5459 1.04%
10 3271 0.63%
Of the 522,681 switch statements, 424,600 (81.24%) > 10 61749 11.81%
are unique, where every case range branches to a dis-
tinct destination label. These ‘unique’ switch statements Table 3: Distribution of Switch Densities
don’t benefit from bit tests or safe hashing methods.
There were 78,925 (15.10%) binary switch statements
that had a single destination label other than the default. These figures show that extending the range of applica-
There were 94,523 (18.08%) switch statements that had bility of jump tables by decreasing the density threshold
non-singleton ranges, i.e. consecutive case values that has rapidly diminishing returns. GCC currently uses a
branched to the same non-default destination. density threshold of 10% when optimizing for speed,
and 33% when optimizing for size. LLVM currently
Table 2 is a summary of the ranges or spans of the ob- uses a more conservative density threshold of 40%.
served switch statements. These figures show that over
eighty percent of all switch statements span a range of
less than 32 values. This reveals that techniques such 8 Results
as index mapping and bit testing should frequently be
applicable. An example x86 solution found by the described super-
optimizer is given below.
Table 3 provides an overview of switch statement den-
sity. Rather than the more usual (floating point) measure switch (x) {
2008 GCC Developers’ Summit • 113

case 20: Method Count Fraction


case 21: binary trees 202633 38.77%
case 23: (without ranges) 163548 31.29%
case 24: goto L1; (with ranges) 39085 7.48%
} table jumps 151893 29.06%
cmpl $22, %eax
(without delta) 99932 19.12%
je L0 (with delta) 51961 9.94%
subl $20, %eax sequential tests 153211 29.31%
cmpl $4, %eax bit tests 14944 2.86%
ja L0 (without delta) 11575 2.21%
jmp L1 (with delta) 3369 0.64%

A more interesting example is taken from compiling the Table 4: GCC Implementation Statistics
Linux kernel on IA-64. A frequent idiom is to switch on
powers of two.
that perform table jumps or bit tests to implement the re-
switch (x) {
maining comparisons. This allows more than one jump
case 1: goto L1;
table to be used to implement a single switch statement.
case 2: goto L2;
The threshold for switching from comparisons to jump
case 4: goto L3;
tables is slightly lower for LLVM, at a maximum of
case 8: goto L4;
...
three comparisons vs. four for GCC. LLVM also has a
case 16384: goto L15; stricter density requirement for jump tables than GCC.
case 32768: goto L16;
Of the 522,681 switch statements in the survey corpus,
}
5,135 have two singleton case values that branch to the
The fastest solution found uses count-trailing-zeros as a same label, and of these, 1,564 (0.30%) have case values
imperfect hash function, which in turn makes use of the that differ by a single bit and are therefore suitable for
Itanium’s popcount instruction. ior safe hashing.

int t = __builtin_ctz(x); Of the GCC table jumps, 1,056 (0.20%) are binary
static const void ∗T1[32] = { and have a single non-default label and could therefore
&&LT1, &&LT2, &&&LT3, ... be more efficiently implemented with a Boolean index
&&L0, &&L0, &&L0, &&L0 }; map.
goto ∗T1[t];
LT1: When bit testing, when the number of bits set in the im-
if (x == 1) mediate constant is more than half of the available bits,
goto L1; on some architectures it may be cheaper to test the com-
goto L0; plement bit pattern and reverse the sense of the condi-
LT2: tional branch. Analysis of the Debian data set reveals
if (x == 2) that this occurs extremely infrequently (62 times for bi-
goto L2; nary bit tests, and 516 times for multiway bit tests).
goto L0;
... Of the switch statement implement by GCC using bi-
nary trees, 29,155 (5.58% of the total) have more than
The switch statement lowering code in the current ver- four case ranges, but failed to be implemented by a table
sion of LLVM differs from that used by the current jump for density reasons. We shall refer to this set as the
version of GCC in several minor respects. Instead of difficult subset. The largest difficult switch statement,
making the decision whether to use a table jump or se- found in the conversions.c source file of Debian’s
quence of bit tests once for the whole switch statement, msort-8.42 package, had 1,080 case ranges, cover-
LLVM uses a divide-and-conquer approach, allowing a ing 1,202 case values branching to 320 unique destina-
sequence of binary comparisons to lead to leaf nodes tion labels.
114 • A Superoptimizer Analysis of Multiway Branch Code Generation

Of this difficult subset, 1,806 (0.35%) were “powers of Clusters Count Fraction
two” where each case value had only a single bit set. 0 14024 48.10%
There were also 3,083 (0.59%) switch statements where 1 12599 43.21%
every case value had trailing zero bits, and in 2,066 2 2005 6.88%
(0.40%) of these, rotating by the number of common 3 356 1.22%
trailing zero bits reduced the range (and therefore the 4 67 0.23%
density) enough to be implemented by a table jump. 5–8 85 0.29%
9–16 10 0.03%
Index mapping turns out to be a particularly effective
> 16 9 0.03%
implementation strategy for dealing with these 29,155
difficult cases. Using GCC’s current 10% density crite- Table 5: Dense Clusters in Difficult Subset
rion, which permits using up to 40 bytes of memory per
case value (on a 32-bit machine), index mapping can ex-
tend the applicability of table jumps to another 15,346 reasonable pivot element in LLVM. A common failing
(2.94%) switch statements (or over half of the difficult with these two “greedy” approaches is that they inadver-
cases). Additionally, of the 155,893 switch statements tently skew binary search trees even when there are no
currently implemented by table jumps, 77,958 (14.92%) dense clusters to be found. These improve the perfor-
would have reduced memory consumption using “dou- mance of the rare (15,131 or 2.89%) binary trees that
ble dispatch” index mapping. Likewise, 19,790 (3.79%) contain dense clusters at the expense of the far more
of the current table jumps have fewer than 5 unique des- common (187,502 or 35.87%) binary trees that should
tinations, allowing the table jump to be eliminated com- attempt to be suitably balanced independently of their
pletely by index mapping followed by sequential com- case value’s integer values. A better global approach
parisons or a binary search tree. has been proposed by Delorie [Delorie04] that initially
identifies the dense clusters and treats these much like
One approach to improving GCC’s implementations of
case ranges, as leaves (or potentially internal nodes)
the difficult subset is to use a binary tree of com-
when building a balanced binary search tree. These re-
parisons to split/subset the original switch statement
main heuristics, as optimal switch statement lowering
into partitions that are dense enough to be imple-
may occasionally require intentional splitting of dense
mented by table jumps. Analysis of the difficult set re-
clusters, much like testing frequent cases prior to table
veals that 15,131 (2.89%) contain one (or more) sub-
jumps can improve average case performance.
ranges that contain enough case ranges and are dense
enough to satisfy GCC’s table jump criteria. Ta-
ble 5 gives the histogram of the number of table- 9 Future Work
jump clusters in the difficult subset. The maximum
number of dense clusters observed in a switch state- The C# programming language extends the switch state-
ment is 28, found in both in the inventor_2.1.5 ment to allow switching of strings, where each case
package’s SoFieldConvertors.c++ and in the contains a constant string literal. The implementation
netrik_1.15.3 package’s parse-syntax.c. of string switches is analogous to the classic computer
science problem of keyword recognition. Additional
An area of active research in switch statement lower- strategies include perfect and imperfect hashing (as used
ing is how best to select the pivot element to best parti- by the GNU utility gperf), Patricia tries, or finite
tion switch statements to take advantage of these dense state machines (as used by the UNIX utilities flex and
clusters. An easy-to-implement approach is to use the lex).
existing mechanism of selecting the median case value
(or frequency-weighted median) as is traditionally done In this work, the implementation of a switch statement
for constructing binary search trees. This has the un- has been equated with that of a multiway branch. How-
wanted effect of occasionally splitting dense clusters, ever, for many uses of the switch statement in real code,
reducing the number of potential jump tables. Bern- it is possible avoid branching altogether and replace the
stein [Bernstein85] suggested splitting at the largest gap switch with one or more table look-ups. For example,
between case values. Korobeynikov [Korobeynikov07] the Has30Days example presented earlier can be im-
implemented a density-balancing metric used to select a plemented as the following:
2008 GCC Developers’ Summit • 115

if ((unsigned)x > 11) 10 Acknowledgments


return 0;
static const int T[12] = The author would like to thank OpenEye Scientific Soft-
{ 0, 0, 0, 0, 1, 0, ware for the freedom to investigate code generation is-
1, 0, 0, 1, 0, 1 }; sues as a hobby; IBM, HP, SGI, Compaq, Apple, SUN,
return T[x]; and Intel for providing hardware; and the GCC com-
munity for inspiring this research. Particularly, Jan Hu-
This mechanism can be extended to more diverse targets bička, Andi Kleen, DJ Delorie, Martin Jambor, and
through the use of unification. Kazu Hirata have discussed the issue of GCC’s switch
statement generation. The author is especially grate-
switch (x) {
ful to Martin Michlmayr for actually running the in-
case 0: foo(2,4); break; strumented GCC over the Debian Linux distribution’s
case 1: foo(8,3); break; source code repository.
default: foo(5,9); break;
}
References
. . . can be implemented as. . .
[Bernstein85] Robert L. Bernstein, “Producing Good
int t1, t2; Code for the Case Statement,” Software – Practice
if (x < 2) { and Experience, Vol. 15, No. 10, pp. 1021–1024,
static const int T1[2] = { 2, 8 }; October 1985.
static const int T2[2] = { 4, 3 };
t1 = T1[x]; [Bernstein86] Robert L. Bernstein, “Multiplication by
t2 = T2[x]; Integer Constants,” Software – Practice and
} else { Experience, Vol. 16, No. 7, pp. 641–652, July 1986.
t1 = 5;
[Delorie04] DJ Delorie, “Case Clustering”
t2 = 9;
gcc-patches:2004-07/msg01234, July 2004.
}
foo(t1,t2); [Dietz92] H.G. Dietz, “Coding Multiway Branches
Using Customized Hash Functions,” ECE Technical
Another possible area of research is to combine the tab- Report, School of Electrical Engineering, Purdue
ular implementation methods described earlier with dy- University, 1992.
namic data structures such as splay-trees, self-balancing
binary search trees, or move-to-front lists. The advan- [Erlingsson96] Ulfar Erlingsson, Mukkai
tage of such a hybrid technique is that the performance Krishnamoorthy and T.V. Raman, “Efficient
of the implementation could dynamically adapt to the Multiway Radix Search Trees,” Information
frequency distribution of the index expression. Processing Letters, Vol. 60, No. 3, pp. 115–120,
November 1996.
Finally, to borrow from Newton’s third law, “For ev-
ery optimization, there is an equal and opposite pes- [Granlund92] Torbjön Granlund and Richard Kenner,
simization.” For each of the transformations described “Eliminating Branches using a Superoptimizer and
in this article, it would be nice to perceive the original the GNU C Compiler,” Proceedings of the
higher-level switch statement from its optimized (low- Conference on Programming Language Design and
ered) equivalent. Programmers by necessity often at- Implementation (PLDI), ACM SIGPLAN Notices,
tempt to hand-optimize code to the current generation of Vol. 27, No. 7, pp. 341–352, July 1992.
computer hardware. The ability for a compiler to auto-
matically undo these (premature) optimizations and then [Hennessy82] J.L. Hennessy and N. Mendelsohn,
re-lower them using the target’s optimizer could poten- “Compilation of the Pascal Case Statement,”
tially avoid the performance problems frequently asso- Software – Practice and Experience, Vol. 12, pp.
ciated with legacy code. 879–992, September 1982.
116 • A Superoptimizer Analysis of Multiway Branch Code Generation

[Lefevre01] Vincent Lefèvre, “Multiplication by an IEEE Transactions on Very Large Scale Integration
Integer Constant,” citeseer:491491.html, (VLSI) Systems, Vol. 2, No. 4, pp. 437–445, 1994.
2001.
[Warren02] Henry S. Warren, Jr., “Hacker’s Delight,”
[Kannan94] Sampath Kannan and Todd A. Proebsting, Addison-Wesley Publishers, 2002.
“Correction to ‘Producing Good Code for the Case
Statement’,” Software – Practice and Experience, [Whaley08] R. Clint Whaley and Anthony M.
Short Communication, Vol. 24, No. 2, pp. 233, Castaldo, “Achieving Accurate and
February 1994. Context-Sensitive Timing for Code Optimization,”
Software – Practice and Experience, Accepted for
[Korobeynikov07] Anton Korobeynikov, “Improved publication, 2008.
Switch Lowering for the LLVM Compiler Compiler
[Wienskoski06] Edmar Wienskoski, “Switch
System,” Proceedings of the 2007 Spring Young
Statement Case Reordering FDO,” GCC Summit
Researchers Colloquium on Software Engineering
2006, Ottawa, Canada, pp. 235–241, June 2006.
(SYRCoSE’2007), Moscow, Russia, May 2007.

[Massalin87] Henry Massalin, “Superoptimizer: A


Look at the Smallest Program,” ACM SIGPLAN
Notices, Vol. 22, No. 10, pp. 122–126, October
1987.

[Mueller02] Frank Mueller and David B. Whalley,


“Avoiding Unconditional Jumps by Code
Replication,” Proceedings of the Conference on
Programming Language Design and Implementation
(PLDI), ACM SIGPLAN Notices, Vol. 27, No. 7,
pp. 322–330, July 1992.

[Sale81] Arthur Sale, “The Implementation of Case


Statements in Pascal,” Software – Practice and
Experience, Vol. 11, pp. 929–942, 1981.

[Sayle01] Roger A. Sayle, “Optimize Tablejumps for


Switch Statements,” gcc-patches:2001-10/
msg01234, October 2001.

[Sayle03a] Roger A. Sayle, “Implement Switch


Statements with Bit Tests,” gcc-patches:
2003-01/msg01733, January 2003.

[Sayle03b] Roger A. Sayle, “Tune Jump Tables for


-Os,” gcc-patches:2003-08/msg02054,
August 2003.

[Spuler94] David A. Spuler, “Compiler Code


Generation for Multiway Branch Statements as a
Static Search Problem,” Technical Report,
Department of Computer Science, James Cook
University, Australia, January 1994.

[Tiwari94] Vivek Tiwari, Sharad Malik and Andrew


Wolfe, “Power Analysis of Embedded Software: A
First Step Towards Software Power Minimization,”

You might also like