You are on page 1of 6

386 24.

BitHacksforGames

24
to the C++ standard1, but all compilers likely to be encountered in game devel-
opment give the expected outcome—if the bits are shifted to the right by n bit
positions, then the value of the highest bit of the input is replicated to the highest
n bits of the result.

AbsoluteValue
BitHacksforGames Most CPU instruction sets do not include an integer absolute value operation.
The most straightforward way of calculating an absolute value involves compar-
Eric Lengyel ing the input to zero and branching around a single instruction that negates the
input if it happens to be less than zero. These kinds of code sequences execute
Terathon Software
with poor performance because the branch prevents good instruction scheduling
and pollutes the branch history table used by the hardware for dynamic branch
prediction.
Game programmers have long been known for coming up with clever tricks that
A better solution makes clever use of the relationship
allow various short calculations to be performed more efficiently. These tricks
are often applied inside tight loops of code, where even a tiny savings in CPU x ~ x  1, (24.1)
clock cycles can add up to a significant boost in speed overall. The techniques
usually employ some kind of logical bit manipulation, or “bit twiddling,” to ob- where the unary operator ~ represents the bitwise NOT operation that inverts
tain a result in a roundabout way with the goal of reducing the number of instruc- each bit in x. In addition to using the ~ operator in C/C++, the NOT operation can
tions, eliminating expensive instructions like divisions, or removing costly be performed by taking the exclusive OR between an input value and a value
branches. This chapter describes a variety of interesting bit hacks that are likely whose bits are all 1s, which is the representation of the integer value 1. So we
to be applicable to game engine codebases. can rewrite Equation (24.1) in the form
Many of the techniques we describe require knowledge of the number of bits
used to represent an integer value. The most efficient implementations of these x x ^ 1  1 , (24.2)
techniques typically operate on integers whose size is equal to the native register
where the reason for subtracting 1 at the end will become clear in a moment.
width of the CPU running the code. This chapter is written for integer registers
If we shift a signed integer right by 31 bits, then the result is a value that is
that are 32 bits wide, and that is the size assumed for the int type in C/C++. All
all ones for any negative integer and all zeros for everything else. Let m be the
of the techniques can be adapted to CPUs having different native register widths
value of x shifted right by 31 bits. Then a formula for the absolute value of x is
(most commonly, 64 bits) by simply changing occurrences of the constants 31
given by
and 32 in the code listings and using the appropriately sized data type.
x x ^ m  m. (24.3)
24.1IntegerSignManipulation
If x  0, then m 1 and this is equivalent to Equation (24.2). If x t 0, then m 0,
We begin with a group of techniques that can be used to extract or modify infor- and the right-hand side of Equation (24.3) becomes a no-op because performing
mation about the sign of an integer value. These techniques, and many more an exclusive OR with zero does nothing and subtracting zero does nothing. So
throughout this chapter, rely on the assumption that a signed right shift using the now we have a simple formula that calculates the absolute value in three instruc-
>> operator preserves the sign of the input by smearing the original sign bit tions, as shown in Listing 24.1. The negative absolute value operation is also
across the high bits opened by the shift operation. Strictly speaking, shifting a
1
negative number to the right produces implementation-defined results according See Section 5.8, Paragraph 3 of the C++ standard.

 385
24.1IntegerSignManipulation 387 388 24.BitHacksforGames

shown, and it is achieved by simply reversing the subtraction on the right-hand both. If x is zero, then both shifts also produce the value zero. This leads us to the
side of Equation (24.3). code shown in Listing 24.2, which requires four instructions. Note that on Pow-
Since there is no branching in Listing 24.1 (and the functions are inlined), the erPC processors, the sign function can be evaluated in three instructions, but that
basic block sizes are larger in the calling code, and the compiler is free to sched- sequence makes use of the carry bit, which is inaccessible in C/C++. (See [Hoxey
ule other instructions among the instructions used for the absolute value. This et al. 1996] for details.)
results in higher instruction throughput, and thus faster code.
inline int Sgn(int x)
inline int Abs(int x) {
{ return ((x >> 31) - (-x >> 31));
int m = x >> 31; // m = (x < 0) ? -1 : 0 }
return ((x ^ m) - m);
}
Listing 24.2. This function calculates the sign function given by Equation (24.4).

inline int Nabs(int x) SignExtension


{
int m = x >> 31; // m = (x < 0) ? -1 : 0 Processors typically have native instructions that can extend the sign of an 8-bit
return (m - (x ^ m)); or 16-bit integer quantity to the full width of a register. For quantities of other bit
} sizes, a sign extension can be achieved with two shift instructions, as shown in
Listing 24.3. An n-bit integer is first shifted right by 32  n bits so that the value
Listing 24.1. These functions calculate the absolute value and negative absolute value. occupies the n most significant bits of a register. (Note that this destroys the bits
of the original value that are shifted out, so the state of those bits can be ignored
in cases when an n-bit quantity is being extracted from a larger data word.) The
Note that the absolute value function breaks down if the input value is
result is shifted right by the same number of bits, causing the sign bit to be
0x80000000. This is technically considered a negative number because the most
smeared.
significant bit is a one, but there is no way to represent its negation in 32 bits.
On some PowerPC processors, it’s important that the value of n in Listing
The value 0x80000000 often behaves as though it were the opposite of zero, and
24.3 be a compile-time constant because shifts by register values are microcoded
like zero, is neither positive nor negative. Several other bit hacks discussed in
and cause a pipeline flush.
this chapter also fail for this particular value, but in practice, no problems typical-
ly arise as a result.
template <int n> inline int ExtendSign(int x)
{
SignFunction return (x << (32 - n) >> (32 - n));
The sign function sgn x is defined as }

­1, if x ! 0; Listing 24.3. This function extends the sign of an n-bit integer to a full 32 bits.
°
sgn x ®0, if x 0; (24.4)
°1, if x  0.
¯ 24.2Predicates
This function can be calculated efficiently without branching by realizing that for We have seen that the expression x >> 31 can be used to produce a value of 0 or
a nonzero input x, either x >> 31 is all ones or -x >> 31 is all ones, but not 1 depending on whether x is less than zero. There may also be times when we
24.2Predicates 389 390 24.BitHacksforGames

want to produce a value of 0 or 1, and we might also want to produce these val- On PowerPC processors, the cntlzw (count leading zeros word) instruction
ues based on different conditions. In general, there are six comparisons that we can be used to evaluate the first expression in Table 24.1, (a == 0), by calculat-
can make against zero, and an expression generating a value based on these com- ing cntlzw(a) >> 5. This works because the cntlzw instruction produces the
parisons is called a predicate. value 32 when its input is zero and a lower value for any other input. When shift-
Table 24.1 lists the six predicates and the branchless C/C++ code that can be ed right five bits, the value 32 becomes 1, and all other values are completely
used to generate a 0 or 1 value based on the boolean result of each comparison. shifted out to produce 0. A similar instruction called BSR (bit scan reverse) exists
Table 24.2 lists negations of the same predicates and the code that can be used to on x86 processors, but it produces undefined results when the input is zero, so it
generate a mask of all 0s or all 1s (or a value of 0 or 1) based on the result of cannot achieve the same result without a branch to handle the zero case.
each comparison. The only difference between the code shown in Tables 24.1 Predicates can be used to perform a variety of conditional operations. The
and 24.2 is that the code in the first table uses unsigned shifts (a.k.a. logical expressions shown in Table 24.1 are typically used to change some other value
shifts), and the second table uses signed shifts (a.k.a. arithmetic or algebraic by one (or a power of two with the proper left shift), and the expressions shown
shifts). in Table 24.2 are typically used as masks that conditionally preserve or clear
some other value. We look at several examples in the remainder of this section.

Predicate Code Instructions ConditionalIncrementandDecrement


x = (a == 0); x = (unsigned) ~(a | -a) >> 31; 4 To perform conditional increment or decrement operations, the expressions
x = (a != 0); x = (unsigned) (a | -a) >> 31; 3 shown in Table 24.1 can simply be added to or subtracted from another value,
respectively. For example, the conditional statement
x = (a > 0); x = (unsigned) -a >> 31; 2
if (a >= 0) x++;
x = (a < 0); x = (unsigned) a >> 31; 1
x = (a >= 0); x = (unsigned) ~a >> 31; 2 can be replaced by the following non-branching code:
x = (a <= 0); x = (unsigned) (a - 1) >> 31; 2 x += (unsigned) ~a >> 31;

Table 24.1. For each predicate, the code generates the value 1 if the condition is true and
generates the value 0 if the condition is false. The type of a and x is signed integer. ConditionalAdditionandSubtraction
Conditional addition and subtraction can be performed by using the expressions
shown in Table 24.2 to mask the operations. For example, the conditional state-
Predicate Code Instructions ment
x = -(a == 0); x = ~(a | -a) >> 31; 4
if (a >= 0) x += y;
x = -(a != 0); x = (a | -a) >> 31; 3
x = -(a > 0); x = -a >> 31; 2
can be replaced by the following non-branching code:

x = -(a < 0); x = a >> 31; 1 x += y & (~a >> 31);

x = -(a >= 0); x = ~a >> 31; 2


IncrementorDecrementModuloN
x = -(a <= 0); x = (a - 1) >> 31; 2
The mask for the predicate (a < 0) can be used to implement increment and
Table 24.2. For each predicate, the code generates the value  1 if the condition is true decrement operations modulo a number n. Incrementing modulo 3 is particularly
and generates the value 0 if the condition is false. The type of a and x is signed integer. common in game programming because it’s used to iterate over the vertices of a
24.2Predicates 391 392 24.BitHacksforGames

triangle or to iterate over the columns of a 3 u 3 matrix from an arbitrary starting MinimumandMaximum
index in the range > 0, 2 @.
To increment a number modulo n, we can subtract n  1 from it and compare Branchless minimum and maximum operations are difficult to achieve when they
against zero. If the result is negative, then we keep the new value; otherwise, we must work for the entire range of 32-bit integers. (However, see [1] for some ex-
wrap around to zero. For the decrement operation, we subtract one and then amples that use special instructions.) They become much easier when we can
compare against zero. If the result is negative, then we add the modulus n. These assume that the difference between the two input operands doesn’t underflow or
operations are shown in Listing 24.4, and both generate four instructions when n overflow when one is subtracted from the other. Another way of putting this is to
is a compile-time constant. say that the input operands always have two bits of sign or that they are always in
the range > 2 30 , 2 30  1@. When this is the case, we can compare the difference to
zero in order to produce a mask used to choose the minimum or maximum value.
template <int n> inline int IncMod(int x)
The code is shown in Listing 24.6. Both functions generate four instructions if a
{
return ((x + 1) & ((x - (n - 1)) >> 31));
logical AND with complement is available; otherwise, the Min() function gener-
}
ates five instructions.

template <int n> inline int DecMod(int x) 24.3MiscellaneousTricks


{
x--; This section presents several miscellaneous tricks that can be used to optimize
return (x + ((x >> 31) & n)); code. Some of the tricks are generic and can be applied to many different situa-
} tions, while others are meant for specific tasks.

Listing 24.4. These functions increment and decrement the input value modulo n. CleartheLeastSignificant1Bit
The least significant 1 bit of any value x can be cleared by logically ANDing it
ClampingtoZero
with x  1. This property can be used to count the number of 1 bits in a value by
Another use of masks is clamping against zero. The minimum and maximum repeatedly clearing the least significant 1 bit until the value becomes zero. (See
functions shown in Listing 24.5 take a single input and clamp to a minimum of [Anderson 2005] for more efficient methods, however.)
zero or a maximum of zero. On processors that have a logical AND with com-
plement instruction, like the PowerPC, both of these functions generate only two inline int Min(int x, int y)
instructions. {
int a = x - y;
inline int MinZero(int x) return (x - (a & ~(a >> 31)));
{ }
return (x & (x >> 31));
} inline int Max(int x, int y)
{
inline int MaxZero(int x) int a = x - y;
{ return (x - (a & (a >> 31)));
return (x & ~(x >> 31)); }
}
Listing 24.6. These functions return the minimum and maximum of a pair of integers when we
Listing 24.5. These functions take the minimum and maximum of the input with zero. can assume two bits of sign.
24.3MiscellaneousTricks 393 394 24.BitHacksforGames

TestforPowerofTwo different position in an 8-bit byte, as shown in Listing 24.8. We then exclusive
OR the case code with any one of the voxel values shifted right seven bits. The
If we clear the least significant 1 bit and find that the result is zero, then we know result is zero for exactly the cases 0x00 and 0xFF, and it’s nonzero for everything
that the original value was either zero or was a power of two because only a sin- else.
gle bit was set. Functions that test whether a value is a power of two, with and
without returning a positive result for zero, are shown in Listing 24.7.
unsigned long caseCode = ((corner[0] >> 7) & 0x01)
| ((corner[1] >> 6) & 0x02)
inline bool PowerOfTwo(int x) | ((corner[2] >> 5) & 0x04)
{ | ((corner[3] >> 4) & 0x08)
int y = x - 1; // y is negative only if x == 0. | ((corner[4] >> 3) & 0x10)
return ((x & y) - (y >> 31) == 0); | ((corner[5] >> 2) & 0x20)
} | ((corner[6] >> 1) & 0x40)
| (corner[7] & 0x80);
inline bool PowerOfTwoOrZero(int x)
{ if ((caseCode ^ ((corner[7] >> 7) & 0xFF)) != 0)
return ((x & (x - 1)) == 0); {
} // Cell has a nontrivial triangulation.
}
Listing 24.7. These functions test whether a value is a power of two.
Listing 24.8. The case code for a cell is constructed by shifting the sign bits from the eight corner
voxel values into specific bit positions. One of the voxel values is shifted seven bits right to
TestforPowerofTwoMinusOne
produce a mask of all 0s or all 1s, and it is then exclusive ORed with the case code to determine
The bits to the right of the least significant 0 bit of any value x can be cleared by whether a cell contains triangles.
logically ANDing it with x  1. If the result is zero, then that means that the value
was composed of a contiguous string of n 1 bits with no trailing 0 bits, which is
the representation of 2 n  1. Note that this test gives a positive result for zero,
which is correct because zero is one less than 2 0 .
DeterminetheIndexoftheGreatestValueinaSetofThree
Given a set of three values (which could be floating-point), we sometimes need
DetermineWhetheraVoxelContainsTriangles to determine the index of the largest value in the set, in the range > 0, 2 @. In partic-
In the marching cubes algorithm, an 8-bit code is determined for every cell in a ular, this arises when finding the support point for a triangle in a given direction
voxel grid, and this code maps to a set of triangulation cases that tell how many for the Gilbert-Johnson-Keerthi (GJK) algorithm. It’s easy to perform a few
vertices and triangles are necessary to extract the isosurface within the cell. The comparisons and return different results based on the outcome, but we would like
codes 0x00 and 0xFF correspond to cells containing no triangles, and such cells to eliminate some of the branches involved in doing that.
are skipped during the mesh generation process. We would like to avoid making For a set of values ^v 0 , v1 , v 2 `, Table 24.3 enumerates the six possible combi-
two comparisons and instead make only one so that empty cells are skipped more nations of the truth values for the comparisons v1 ! v 0, v 2 ! v 0, and v 2 ! v1, which
efficiently. we label as b 0, b1, and b 2 , respectively. As it turns out, the sum of b0 | b1 and
Suppose that voxel values are signed 8-bit integers. We form the case code b1 & b 2 produces the correct index for all possible cases. This leads us to the code
by shifting the sign bits from the voxels at each of the eight cell corners into a shown in Listing 24.9.
24.4LogicFormulas 395 396 24.BitHacksforGames

Case b0 v1 ! v 0 b1 v2 ! v0 b2 v 2 ! v1 b0 | b1 b 0 & b1 Sum Formula Operation / Effect Notes


v 0 largest 0 0 0 0 0 0 x & (x - 1) Clear lowest 1 bit. If result is 0, then x is 2 n.
0 0 1 0 0 0 x | (x + 1) Set lowest 0 bit.

v1 largest 1 0 0 1 0 1 x | (x - 1) Set all bits to right of lowest 1 bit.

1 1 0 1 0 1 x & (x + 1) Clear all bits to right of lowest 0 bit. If result is 0, then x is 2 n  1.

v 2 largest 0 1 1 1 1 2 x & -x Extract lowest 1 bit.

1 1 1 1 1 2 ~x & (x + 1) Extract lowest 0 bit (as a 1 bit).

Impossible 0 1 0 1 0 1 ~x | (x - 1) Create mask for bits other than lowest 1 bit.

1 0 1 1 0 1 x | ~(x + 1) Create mask for bits other than lowest 0 bit.


x | -x Create mask for bits left of lowest 1 bit, inclusive.
Table 24.3. This table lists all possible combinations for the truth values b0 , b1, and b 2 relating the
values v 0 , v1, and v 2 . The sum of b0 | b1 and b1 & b 2 gives the index of the largest value. x ^ -x Create mask for bits left of lowest 1 bit, exclusive.
~x | (x + 1) Create mask for bits left of lowest 0 bit, inclusive.
~x ^ (x + 1) Create mask for bits left of lowest 0 bit, exclusive. Also x Ł (x + 1).

template <typename T> int GetGreatestValueIndex(const T *value) x ^ (x - 1) Create mask for bits right of lowest 1 bit, inclusive. 0 becomes 1.
{ ~x & (x - 1) Create mask for bits right of lowest 1 bit, exclusive. 0 becomes 1.
bool b0 = (value[1] > value[0]);
x ^ (x + 1) Create mask for bits right of lowest 0 bit, inclusive. 1 remains 1.
bool b1 = (value[2] > value[0]);
bool b2 = (value[2] > value[1]); x & (~x - 1) Create mask for bits right of lowest 0 bit, exclusive. 1 remains 1.
return ((b0 | b1) + (b1 & b2));
} Table 24.4. Logic formulas and their effect on the binary representation of a signed integer.
Listing 24.9. This function returns the index, in the range >0,2@, corresponding to the largest value
in a set of three.
References
[Anderson 2005] Sean Eron Anderson. “Bit Twiddling Hacks.” 2005. Available at http://
graphics.stanford.edu/~seander/bithacks.html.
24.4LogicFormulas [Ericson 2008] Christer Ericson. “Advanced Bit Manipulation-fu.” realtimecollision de-
tection.net - the blog, August 24, 2008. Available at http://realtimecollision detec-
We end this chapter with Table 24.4, which lists several simple logic formulas tion.net/blog/?p=78.
and their effect on the binary representation of a signed integer. (See also [Eric-
son 2008].) With the exception of the last entry in the table, all of the formulas [Hoxey et al. 1996] Steve Hoxey, Faraydon Karim, Bill Hay, and Hank Warren, eds. The
can be calculated using two instructions on the PowerPC processor due to the PowerPC Compiler Writer’s Guide. Palo Alto, CA: Warthman Associates, 1996.
availability of the andc, orc, and eqv instructions. Other processors may require
up to three instructions.

You might also like