Professional Documents
Culture Documents
Proceedings of the
GCC Developers’ Summit
Review Committee
Andrew J. Hutton, Steamballoon, Inc., Linux Symposium,
Thin Lines Mountaineering
Ben Elliston, IBM
Janis Johnson, IBM
Mark Mitchell, CodeSourcery
Toshi Morita
Diego Novillo, Google
Gerald Pfeifer, Novell
Ian Lance Taylor, Google
C. Craig Ross, Linux Symposium
Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights
to all as a condition of submission.
A Superoptimizer Analysis of Multiway Branch Code Generation
Roger Anthony Sayle
OpenEye Scientific Software
roger@eyesopen.com
• 103 •
104 • A Superoptimizer Analysis of Multiway Branch Code Generation
We shall refer to the number of case values, |FV |, and cmpl $1, %eax
the number of case ranges, |FR |. je L1
testl %eax, %eax
je L2
3 Switch Lowering
On many architectures, the performance of integer com-
This section lists several possible compilation strategies parisons may be dependent on the value being com-
for implementing a multiway branch. The first few are pared. As shown in the assembly example above,
well known and commonly employed by existing com- the i386 provides special instructions for comparison
pilers, whilst the later strategies are perhaps more novel. against zero. On MIPS, comparisons against zero and
For each implementation strategy, the high-level C (pos- one are cheaper than comparisons against signed 16-bit
sibly using GCC extensions) and the equivalent Intel constants, which are cheaper than general 32-bit con-
x86 assembler are given. We use the conventions that stants.
the index variable is x, the default label is called L0,
and the case target labels are L1 ... Ln. In this work, we 3.3 Jump Tables
assume the target labels are opaque, and have no useful
mathematical relationships (though some early compil- A common method for implementing a switch statement
ers could rearrange code to simplify the task of switch is via an indirect jump (also known as a table jump).
implementation). This uses the switch variable to index an array of desti-
nation addresses, and then branch to that location.
3.1 Unconditional Branch In the general case, to keep the size of the jump table
reasonable, the minimum case value needs to be sub-
The simplest of switch statement implementations and tracted, and the result compared against the table range
one of the fundamental building blocks is the uncondi- (as an array bounds check) before indexing the jump ta-
tional jump. This is used for switch statements that con- ble. Fortunately, the minimum case value is often one or
tain only a default outcome, or during switch lowering two allowing the subtraction to be omitted for a minor
after all other cases have been handled. increase in the size of the jump table [Sayle01].
2008 GCC Developers’ Summit • 105
unsigned int t = x - 10; as (x>=lo) && (x<=hi) where lo and hi are the
if (t > 5) lower and upper bounds of the range respectively. An
goto L0; often more efficient way of doing this is to subtract the
static const void *T1[6] = lower bound, and then perform an unsigned comparison
{ &&L1, &&L2, &&L3,
against the integer constant hi-lo which achieves the
&&L4, &&L5, &&L6 };
goto *T1[t]; same thing with a single comparison. Obviously, if lo is
zero the subtraction can be omitted. Warren [Warren02]
also mentions the efficient use of shifts to perform range
subl $10, %eax tests for suitable high and low bounds.
cmpl $5, %eax
ja L0 if (x < 8)
jmp ∗T1(,%eax,4) goto L1;
T1: .long L1, L2, L3
.long L4, L5, L6 unsigned int t = x >> 3;
if (t == 0)
An obvious free parameter and difference between com- goto L1;
pilers is the density threshold used to decide whether
to use a table jump or not. When the case values are
sequentially consecutive values that each jump to dis- 3.5 Balanced Binary Trees
tinct labels the decision to use a table jump is obvious.
However, as the density (approximately the number of A more efficient use of compare and branch compar-
case values divided by the difference between their up- isons than the sequential testing described above is per-
per and lower bounds) decreases, the memory to perfor- form a binary search on the target case values. By com-
mance trade-off eventually reaches a critical limit where paring against a (median) pivot value, the case values
the additional cost in bytes provides insufficient benefit can be partitioned into those values less than or greater
in clock cycles. than the pivot. This reduces the time to perform a multi-
way branch from O(n) with the sequential method to
A less obvious difference is the lack of uniformity in the O(log n), where n is the number of case values.
way that switch table density is calculated. One obvi-
ous approach is to use the number of case values, |FV |, Each comparison in such a binary search refines the up-
as the numerator. However, as mentioned in the next per or lower bounds of the remaining partitions. Whilst
section, ranges of consecutive case values that branch many compilers perform only signed or only unsigned
to the same label can be efficiently as range tests, that comparisons depending upon the type of the switch ex-
are independent of the number of case values they con- pression, it is potentially beneficial to use both forms in
sider. Hence, density may correlate better with the num- a single search. Another more commonly implemented
ber of case ranges than with the number of case values. improvement is to use the established upper and lower
The approach currently used by GCC is to count non- bounds to avoid comparisons. For example, if a bi-
singleton case ranges as twice as expensive as singleton nary search has already confirmed that the index value is
case ranges. This provides a better measure of density greater than three, but less than five, there is no point in
for comparison against other implementation strategies. explicitly comparing against the value four. A naïve im-
In practice, the real non-singleton to singleton cost ra- plementation of binary tree lowering using these bounds
tio is less than two, even approaching one for switches may inadvertently emit conditional branches to uncon-
consisting of consecutive adjoining ranges. ditional branches and similar inefficient idioms. These
may be avoiding by either using the usual CFG jump
3.4 Range Tests optimizations as a clean-up step, or by improvements in
the binary tree lowering code.
When two or more consecutive integer values branch to Depending upon the architecture, it may be beneficial
the same target label, it is possible to test for a range (or not) to perform three-way branches at each parti-
values rather than each value independently. One obvi- tion step, reusing the condition codes (or register) from
ous possibility is via a pair of conditional branches such a single comparison against an integer constant for both
106 • A Superoptimizer Analysis of Multiway Branch Code Generation
an equality test and an ordering test. Currently, GCC An often overlooked aspect of performing binary com-
always performs three-way branches, whilst LLVM al- parisons both in sequential comparisons and in binary
ways performs two-way (binary) branching. Even with search trees is that it can sometimes be advantageous to
three-way branching there is the decision of whether to compare against integer values (and ranges) not in the
perform the equality/inequality comparison before or af- case set, V , i.e. those that map to the default label md .
ter the ordering comparison. Unless the pivot value oc- For example, when ranges on both adjoining sides of
curs frequently, it is theoretically better to perform the a ‘gap’ branch to the same destination, testing for the
partitioning first and incur the cost of the equality com- gap allows the adjoining ranges to be merged. Likewise
parison on only one of the following paths. for binary switch tables, it can occasionally be easier
to identify the (complementary) default values than the
On many processors there is a significant asymmetry in (positive) case values.
the cycle counts for taken vs. not-taken cycle counts
(occasionally even when the outcome is correctly pre-
3.6 Bit Tests
dicted). This difference means that a perfectly balanced
binary tree will not have the best worst-case behavior.
Instead appropriately skewed binary trees can lead to A relatively recent innovation in switch statement im-
better performance. This is one reason why the sequen- plementation is the use of bit tests [Sayle03a]. This
tial testing strategy described above is sometimes com- technique allows several case values that share the same
petitive for switches with four or more values. label to be tested simultaneously by representing each
case by its own bit.
switch (x) {
This technique requires that the range of case values not
case 100: goto L1;
be larger than the number of bits in a machine word.
case 200: goto L2;
Like jump tables, it is often necessary to perform a sub-
case 300: goto L3;
traction followed by a bounds check to eliminate values
case 400: goto L4;
outside of the allowed range.
}
if (x > 8)
if (x == 200) goto L0;
goto L2; unsigned int t = 1 << x;
if (x > 200) { // cases 1, 4, and 8
if (x == 300) if ((t & 274) != 0)
goto L3; goto L1;
if (x == 400) // cases 0, 2, and 5
if ((t & 37) != 0)
goto L4;
goto L2;
}
goto L0;
else {
if (x == 100) cmpl $8, %eax
goto L1; ja L0
} movl %eax, %ecx
movl $1, %eax
cmpl $200, %eax sall %cl, %eax
je L2 testl $274, %eax
jbe LT1 je L1
testb $31, %al
cmpl $300, %eax
je L2
je L3
cmpl $400, %eax The performance of bit testing can be improved by test-
je L4 ing the most frequent bit patterns (masks) first. If all
jmp L0 cases are equally probable, this equates to testing the
LT1: cmpl $100, %eax masks that have more bits set before those that have a
je L1 few bits (or just a single one) set.
2008 GCC Developers’ Summit • 107
is known and therefore the usual bounds test on the fol- 3.11 Imperfect Hashing
lowing table jump can be omitted. On the SPARC archi-
tecture, which usually needs to multiply the table jump Hashing functions provide a method of handling sparse
index by four to access the correct destination word, switch statements. By quickly calculating an auxiliary
this multiplication can be precomputed and stored in the value and branching or multiway branching on its value,
byte map (provided that the switch statement has less we can ‘divide and conquer’ the original switch. A re-
than 64 unique destinations). view of suitable (i.e., cheap) hash functions is described
by Dietz [Dietz92]. More recently, Erlingsson et al.
Mathematically, index mapping provides a minimal per- have described a practical method for quickly selecting
fect hash function when each target label is unique and a suitable bitwise-and hash function called “Multiway
there are no ranges. Radix Search Trees (MRST)” [Erlingsson96].
On many CISC processors, such as the x86 family, bi- Modern processor architectures contain a number of in-
nary index mapping (i.e. when there is only a single structions that perform useful hashing functions. Ex-
non-default label and the byte array contains only ones amples include popcount, ffs (find first set), clz
and zeros) can take advantage of instructions that com- (count leading zeros), and ctz (count trailing zeros).
pare directly against memory. On the i386, for exam- One simple use of imperfect hashing is handling switch
ple, the cmpb $0, T1(%eax) instruction avoids the statements where the index expression has a multi-word
usual load followed by a compare or test. type (such as long long). In such cases, the im-
plementation could be considered a switch of the high-
word, dispatching to switches on the low-word. This is
3.10 Condition Codes
especially advantageous on narrow processors such as
AVR.
Depending upon the architecture, and upon the fre- The MRST scheme that extracts consecutive bits from
quency of case values, it may be beneficial to manip- the index value for the hash (i.e. where hash functions
ulate explicit Boolean expressions rather than the more of the form t = (x >> C1) & C2) is particularly
common ‘short-circuit evaluation’ conditional branches. well suited to architectures that have bit-field extraction
instructions, such as the ARM and the PowerPC. For the
char t1 = (x == 6); special case where the field is a single bit, i.e. C2 is one,
the comparison can be performed using a bit-wise ‘and’
char t2 = (x == 11);
instruction, or special-purpose bit-test instruction where
if (t1 | t2) available.
goto L1;
3.12 Perfect Hashing
cmpl $6, %eax
In the field of switch statement code generation, perfect
sete %dl
hashing describes any transformation of the original in-
cmpl $11, %eax teger index expression that is a bijection. A bijective in-
sete %al teger operation has the property that every output value
orb %al, %dl is generated by a unique input value, enabling the ex-
istence of an inverse. Such an operation just permutes
je L1
the integer values, but as mentioned above, these per-
muted case values may allow more efficient implemen-
In the example above, the i386 architecture manipulates tation than its original isomorphism.
the condition codes as bytes, but on many RISC archi-
tectures with multiple condition code or predicate regis- Examples of suitable bijective operations (permuta-
ters, they can be manipulated directly. As an example, tions) on K-bit words include logical inversion (∼x),
the PowerPC architecture provides a suitable cror in- integer negation (-x), addition and subtraction of an in-
struction. teger constant (modulo 2K ), bitwise-exclusive-or (xor)
2008 GCC Developers’ Summit∼ • 109
3.14 Conditional Moves one for a taken branch and one for a not-taken branch.
In this work, we restrict the representation on control
Modern processors can often avoid the penalty of condi- flow graphs to binary trees. Looping constructs (such
tional branching by the use of conditional move instruc- as those in inlined tabular implementation methods) are
tions. represented by a single “macro instruction” hiding the
internal backward edges from the superoptimizer. Al-
void ∗t = &&L0; though there may be some benefit in extending the cur-
if (x == 1) rent enumeration strategy to general directed acyclic
t = &&L1; graphs (DAGs), the restriction to binary trees helps limit
if (x == 2) the combinatorial explosion associated with exhaustive
t = &&L2; search.
if (x == 3)
The multiway branch superoptimizer works by a divide-
t = &&L3;
and-conquer approach, repeatedly simplifying or trans-
goto ∗t;
forming the remaining switch cases that need to be im-
movl %eax, %ecx plemented. At each node in the search/tree, the problem
movl $L0, %edx state consists of the set of case nodes that remain to be
movl $L1, %eax considered, and the set of potential values the index ex-
cmpl $1, %ecx pression may have. At the start of the recursive search,
cmove %eax, %edx the root of the tree, the set of case values are those of
movl $L2, %eax the original switch, and the set of potential values are all
cmpl $2, %ecx 2K values representable in the index type. The terminal
cmove %eax, %edx nodes at the leaves of the tree are unconditional jump
movl $L3, %eax nodes or table jump nodes. In theory, the value range
cmpl $3, %ecx propagation pass of a compiler may be able to refine the
cmove %eax, %edx initial set of potential index values.
jmp ∗%edx
Some example solution trees enumerated by the super-
optimizer are given in Figure 1. These depict some of
4 Superoptimization the possible balanced binary tree implementations for a
switch statement with four unique case values.
The term superoptimization was first coined by Henry
Massalin in [Massalin87] and describes the task of find- 5 Cost Model
ing the optimal code sequence for a single, loop-free
sequence of instructions. Whilst most compiler opti- An important aspect of superoptimization is the objec-
mizations attempt to improve code by applying heuris- tive function that is used to assess the merit of each solu-
tic transformations, superoptimization performs an ex- tion. When optimizing for code size, a size in bytes can
haustive search in the space of valid instruction se- be associated with each node in a solution tree, and the
quences. An outstanding example of superoptimization total code size is simply the sum of each node size in the
is the GNU superoptimizer [Granlund92]. A less rigor- tree. When optimizing for performance, a cycle count
ous but practical application of superoptimization is the can be associated with each edge in the tree. For con-
task of generating instruction sequences for multiplica- ditional branches, different timings can be associated
tion by integer constants, known in GCC as the routine with the taken and the not-taken edges. The worst-case
synth_mult [Bernstein86, Lefevre01]. performance is calculated by determining the maximum
edge cost from the root to any leaf node. By associating
Unlike most traditional superoptimizers, the superopti- a frequency with each original case value, we can deter-
mizer implementation described in this work needs to mine the average or expected cost of a multiway branch.
handle explicit control flow instructions. In addition to The expected average cost of a switch statement is the
the strict linear instruction sequences enumerated in pre- sum over all leaves of the probability of the leaf multi-
vious efforts, a representation of conditional branches plied by the cost of the leaf. In practice, the use of float-
introduces instructions with two successor instructions; ing point probabilities can be avoided by using integer
2008 GCC Developers’ Summit • 111
2 2 3 2
? ? ? ?
= < < ≤
@R
@ @R
@ @R
@ + QQ
s
< 1 = 1 = = 3
@
R
@ ? @
R
@ ? @
R
@ @
R
@ ?
1 3 = 3 = 4 1 =
? ? @R
@ ? @R
@ ? ? @R
@
= = = 2 = = 4
@R
@ @R
@ @R
@ ? @R
@ @R
@ ?
4 4 = =
? ? @R
@ @ R
@
= =
@R
@ @R
@
Figure 1: Four alternate balanced binary tree solutions for implementing a unique switch statement with case values
one, two, three, and four. Digits represent comparison instructions; relational operators represent conditional branch
instructions with the taken edge below to the left and the not-taken edge below to the right.
An interesting additional cost metric not considered in To evaluate whether the advanced switch statement im-
the present study is the optimization of code to reduce plementations described above are applicable on real-
power consumption [Tiwari94]. Much like microbench- world code, a survey of existing switch statement usage
marking, the cost of each node could be parameterized was undertaken. By using a specially instrumented ver-
by chip-level simulation. sion of GCC, the switch statements were extracted from
112 • A Superoptimizer Analysis of Multiway Branch Code Generation
the source code of 5,953 packages of the Debian Linux Range Count Fraction
distribution. 1 51804 9.91%
2 62757 12.01%
These 5,953 Debian packages contained, or more ac-
3 70489 13.49%
curately compiled, 522,681 switch statements. These
4 49311 9.43%
switch statements contained 3,024,896 case ranges, giv-
5–8 80043 15.31%
ing a mean of 5.79 case ranges per switch statement. A
9–16 66019 12.63%
case range is defined a consecutive sequence of case val-
17–32 47947 9.17%
ues that map to the same non-default target label. The
33–64 31998 6.12%
largest switch statement contained 4,541 case ranges.
> 64 62313 11.92%
Table 1 summarizes the distribution of switch sizes, as
measured by the number of case ranges. Table 2: Distribution of Switch Ranges
A more interesting example is taken from compiling the Table 4: GCC Implementation Statistics
Linux kernel on IA-64. A frequent idiom is to switch on
powers of two.
that perform table jumps or bit tests to implement the re-
switch (x) {
maining comparisons. This allows more than one jump
case 1: goto L1;
table to be used to implement a single switch statement.
case 2: goto L2;
The threshold for switching from comparisons to jump
case 4: goto L3;
tables is slightly lower for LLVM, at a maximum of
case 8: goto L4;
...
three comparisons vs. four for GCC. LLVM also has a
case 16384: goto L15; stricter density requirement for jump tables than GCC.
case 32768: goto L16;
Of the 522,681 switch statements in the survey corpus,
}
5,135 have two singleton case values that branch to the
The fastest solution found uses count-trailing-zeros as a same label, and of these, 1,564 (0.30%) have case values
imperfect hash function, which in turn makes use of the that differ by a single bit and are therefore suitable for
Itanium’s popcount instruction. ior safe hashing.
int t = __builtin_ctz(x); Of the GCC table jumps, 1,056 (0.20%) are binary
static const void ∗T1[32] = { and have a single non-default label and could therefore
&<1, &<2, &&<3, ... be more efficiently implemented with a Boolean index
&&L0, &&L0, &&L0, &&L0 }; map.
goto ∗T1[t];
LT1: When bit testing, when the number of bits set in the im-
if (x == 1) mediate constant is more than half of the available bits,
goto L1; on some architectures it may be cheaper to test the com-
goto L0; plement bit pattern and reverse the sense of the condi-
LT2: tional branch. Analysis of the Debian data set reveals
if (x == 2) that this occurs extremely infrequently (62 times for bi-
goto L2; nary bit tests, and 516 times for multiway bit tests).
goto L0;
... Of the switch statement implement by GCC using bi-
nary trees, 29,155 (5.58% of the total) have more than
The switch statement lowering code in the current ver- four case ranges, but failed to be implemented by a table
sion of LLVM differs from that used by the current jump for density reasons. We shall refer to this set as the
version of GCC in several minor respects. Instead of difficult subset. The largest difficult switch statement,
making the decision whether to use a table jump or se- found in the conversions.c source file of Debian’s
quence of bit tests once for the whole switch statement, msort-8.42 package, had 1,080 case ranges, cover-
LLVM uses a divide-and-conquer approach, allowing a ing 1,202 case values branching to 320 unique destina-
sequence of binary comparisons to lead to leaf nodes tion labels.
114 • A Superoptimizer Analysis of Multiway Branch Code Generation
Of this difficult subset, 1,806 (0.35%) were “powers of Clusters Count Fraction
two” where each case value had only a single bit set. 0 14024 48.10%
There were also 3,083 (0.59%) switch statements where 1 12599 43.21%
every case value had trailing zero bits, and in 2,066 2 2005 6.88%
(0.40%) of these, rotating by the number of common 3 356 1.22%
trailing zero bits reduced the range (and therefore the 4 67 0.23%
density) enough to be implemented by a table jump. 5–8 85 0.29%
9–16 10 0.03%
Index mapping turns out to be a particularly effective
> 16 9 0.03%
implementation strategy for dealing with these 29,155
difficult cases. Using GCC’s current 10% density crite- Table 5: Dense Clusters in Difficult Subset
rion, which permits using up to 40 bytes of memory per
case value (on a 32-bit machine), index mapping can ex-
tend the applicability of table jumps to another 15,346 reasonable pivot element in LLVM. A common failing
(2.94%) switch statements (or over half of the difficult with these two “greedy” approaches is that they inadver-
cases). Additionally, of the 155,893 switch statements tently skew binary search trees even when there are no
currently implemented by table jumps, 77,958 (14.92%) dense clusters to be found. These improve the perfor-
would have reduced memory consumption using “dou- mance of the rare (15,131 or 2.89%) binary trees that
ble dispatch” index mapping. Likewise, 19,790 (3.79%) contain dense clusters at the expense of the far more
of the current table jumps have fewer than 5 unique des- common (187,502 or 35.87%) binary trees that should
tinations, allowing the table jump to be eliminated com- attempt to be suitably balanced independently of their
pletely by index mapping followed by sequential com- case value’s integer values. A better global approach
parisons or a binary search tree. has been proposed by Delorie [Delorie04] that initially
identifies the dense clusters and treats these much like
One approach to improving GCC’s implementations of
case ranges, as leaves (or potentially internal nodes)
the difficult subset is to use a binary tree of com-
when building a balanced binary search tree. These re-
parisons to split/subset the original switch statement
main heuristics, as optimal switch statement lowering
into partitions that are dense enough to be imple-
may occasionally require intentional splitting of dense
mented by table jumps. Analysis of the difficult set re-
clusters, much like testing frequent cases prior to table
veals that 15,131 (2.89%) contain one (or more) sub-
jumps can improve average case performance.
ranges that contain enough case ranges and are dense
enough to satisfy GCC’s table jump criteria. Ta-
ble 5 gives the histogram of the number of table- 9 Future Work
jump clusters in the difficult subset. The maximum
number of dense clusters observed in a switch state- The C# programming language extends the switch state-
ment is 28, found in both in the inventor_2.1.5 ment to allow switching of strings, where each case
package’s SoFieldConvertors.c++ and in the contains a constant string literal. The implementation
netrik_1.15.3 package’s parse-syntax.c. of string switches is analogous to the classic computer
science problem of keyword recognition. Additional
An area of active research in switch statement lower- strategies include perfect and imperfect hashing (as used
ing is how best to select the pivot element to best parti- by the GNU utility gperf), Patricia tries, or finite
tion switch statements to take advantage of these dense state machines (as used by the UNIX utilities flex and
clusters. An easy-to-implement approach is to use the lex).
existing mechanism of selecting the median case value
(or frequency-weighted median) as is traditionally done In this work, the implementation of a switch statement
for constructing binary search trees. This has the un- has been equated with that of a multiway branch. How-
wanted effect of occasionally splitting dense clusters, ever, for many uses of the switch statement in real code,
reducing the number of potential jump tables. Bern- it is possible avoid branching altogether and replace the
stein [Bernstein85] suggested splitting at the largest gap switch with one or more table look-ups. For example,
between case values. Korobeynikov [Korobeynikov07] the Has30Days example presented earlier can be im-
implemented a density-balancing metric used to select a plemented as the following:
2008 GCC Developers’ Summit • 115
[Lefevre01] Vincent Lefèvre, “Multiplication by an IEEE Transactions on Very Large Scale Integration
Integer Constant,” citeseer:491491.html, (VLSI) Systems, Vol. 2, No. 4, pp. 437–445, 1994.
2001.
[Warren02] Henry S. Warren, Jr., “Hacker’s Delight,”
[Kannan94] Sampath Kannan and Todd A. Proebsting, Addison-Wesley Publishers, 2002.
“Correction to ‘Producing Good Code for the Case
Statement’,” Software – Practice and Experience, [Whaley08] R. Clint Whaley and Anthony M.
Short Communication, Vol. 24, No. 2, pp. 233, Castaldo, “Achieving Accurate and
February 1994. Context-Sensitive Timing for Code Optimization,”
Software – Practice and Experience, Accepted for
[Korobeynikov07] Anton Korobeynikov, “Improved publication, 2008.
Switch Lowering for the LLVM Compiler Compiler
[Wienskoski06] Edmar Wienskoski, “Switch
System,” Proceedings of the 2007 Spring Young
Statement Case Reordering FDO,” GCC Summit
Researchers Colloquium on Software Engineering
2006, Ottawa, Canada, pp. 235–241, June 2006.
(SYRCoSE’2007), Moscow, Russia, May 2007.