Bit Twiddling Hacks

By Sean Eron Anderson seander@cs.stanford.edu
Individually, the code snippets here are in the public domain (unless otherwise noted) — feel free to use them however you please. The aggregate collection and descriptions are © 1997-2005 Sean Eron Anderson. The code and descriptions are distributed in the hope that they will be useful, but WITHOUT ANY WARRANTY and without even the implied warranty of merchantability or fitness for a particular purpose. As of May 5, 2005, all the code has been tested thoroughly. Thousands of people have read it. Moreover, Professor Randal Bryant, the Dean of Computer Science at Carnegie Mellon University, has personally tested almost everything with his Uclid code verification system. What he hasn't tested, I have checked against all possible (32-bit) inputs. To the first person to inform me of a legitimate bug in the code, I'll pay a bounty of US$10 (by check or Paypal).

Contents
• • • • • •

• • •

About the operation counting methodology Compute the sign of an integer Compute the integer absolute value (abs) without branching Compute the minimum (min) or maximum (max) of two integers without branching Determining if an integer is a power of 2 Sign extending o Sign extending from a constant bit width o Sign extending from a variable bit-width o Sign extending from a variable bit-width in 3 operations Conditionally set or clear bits without branching Merge bits from two values according to a mask Counting bits set o Counting bits set, naive way o Counting bits set by lookup table o Counting bits set, Brian Kernighan's way o Counting bits set in 12, 24, or 32-bit words using 64-bit instructions o Counting bits set, in parallel Computing parity (1 if an odd number of bits set, 0 otherwise) o Compute parity of a word the naive way o Compute parity by lookup table o Compute parity of a byte using 64-bit multiply and modulus division o Compute parity in parallel Swapping Values o Swapping values with XOR o Swapping individual bits with XOR Reversing bit sequences o Reverse bits the obvious way o Reverse bits in word by lookup table

• • • •

• • •

Reverse the bits in a byte with 3 operations (64-bit muliply and modulus division) o Reverse the bits in a byte with 4 operations (64-bit multiply, no division) o Reverse the bits in a byte with 7 operations (no 64-bit, only 32) o Reverse an N-bit quantity in parallel with 5 * lg(N) operations Modulus division (aka computing remainders) o Computing modulus division by 1 << s without a division operation (obvious) o Computing modulus division by (1 << s) - 1 without a division operation o Computing modulus division by (1 << s) - 1 in parallel without a division operation Finding integer log base 2 of an integer (aka the position of the highest bit set) o Find the log base 2 of an integer with the MSB N set in O(N) operations (the obvious way) o Find the integer log base 2 of an integer with an 64-bit IEEE float o Find the log base 2 of an integer with a lookup table o Find the log base 2 of an N-bit integer in O(lg(N)) operations o Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup Find integer log base 10 of an integer Find integer log base 2 of a 32-bit IEEE float Find integer log base 2 of the pow(2, r)-root of a 32-bit IEEE float (for unsigned integer r) Counting consecutive trailing zero bits (or finding bit indices) o Count the consecutive zero bits (trailing) on the right in parallel o Count the consecutive zero bits (trailing) on the right by binary search o Count the consecutive zero bits (trailing) on the right by casting to a float o Count the consecutive zero bits (trailing) on the right with modulus division and lookup o Count the consecutive zero bits (trailing) on the right with multiply and lookup Round up to the next highest power of 2 by float casting Round up to the next highest power of 2 Interleaving bits (aka computing Morton Numbers) o Interleave bits the obvious way o Interleave bits by table lookup o Interleave bits with 64-bit multiply o Interleave bits by Binary Magic Numbers Testing for ranges of bytes in a word (and counting occurances found) o Determine if a word has a zero byte o Determine if a word has byte less than n o Determine if a word has a byte greater than n o Determine if a word has a byte between m and n
o

About the operation counting methodology
When totaling the number of operations for algorithms here, any C operator is counted as one operation. Intermediate assignments, which need not be written to RAM, are not counted. Of course, this operation counting approach only serves as an approximation of the actual number of machine instructions and CPU time. All operations are assumed to take the same amount of time, which is not true in reality, but CPUs have been heading increasingly in this direction over time. There are many nuances that determine how fast a system will run a given sample of code, such as cache sizes, memory bandwidths, instruction sets, etc. In the end, benchmarking is the best way to determine whether one method is really faster than another, so consider the techniques below as possibilities to test on your target architecture.

Compute the sign of an integer
int v; int sign; // we want to find the sign of v // the result goes here

// CHAR_BIT is the number of bits per byte (normally 8). sign = -(v < 0); // if v < 0 then -1, else 0. // or, to avoid branching on CPUs with flag registers (IA32): sign = -(int)((unsigned int)((int)v) >> (sizeof(int) * CHAR_BIT - 1)); // or, for one less instruction (but not portable): sign = v >> (sizeof(int) * CHAR_BIT - 1);

The last expression above evaluates to sign = v >> 31 for 32-bit integers. This is one operation faster than the obvious way, sign = -(v < 0). This trick works because when signed integers are shifted right, the value of the far left bit is copied to the other bits. The far left bit is 1 when the value is negative and 0 otherwise; all 1 bits gives -1. Unfortunately, this behavior is architecture-specific. Alternatively, if you prefer the result be either -1 or +1, then use:
sign = +1 | (v >> (sizeof(int) * CHAR_BIT - 1)); else +1 // if v < 0 then -1,

Alternatively, if you prefer the result be either -1, 0, or +1, then use:
sign = (v != 0) | -(int)((unsigned int)((int)v) >> (sizeof(int) * CHAR_BIT - 1)); // Or, for more speed but less portability: sign = (v != 0) | (v >> (sizeof(int) * CHAR_BIT - 1)); // -1, 0, or +1 // Or, for portability, brevity, and (perhaps) speed: sign = (v > 0) - (v < 0); // -1, 0, or +1

Caveat: On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C specification leaves the result of signed right-shift implementation-defined, so on some systems this hack might not work. For greater portability, Toby Speight suggested on September 28, 2005 that CHAR_BIT be used here and throughout rather than assuming

On August 13. the Internet Archive also has an old link to it. int r. 2003. Angus Duggan pointed out that the 1989 ANSI C specification leaves the result of signed right-shift implementation-defined. // we want to find the absolute value of v int r. It works because if x < y. even though the number of operations is the same. On January 30. Yuriy also mentioned that this document was translated to Russian in 1997. Keith H. r = (x < y) ? x : y. Patented variation: r = (v ^ mask) . this method has been patented in the USA on June 6. Duggar sent me the patented variation above. Compute the minimum (min) or maximum (max) of two integers without branching int x. 2006. even though it involves two more instructions.1. // min(x.y) & ~0 = y + x . Unfortunately.y = . 2000 by Vladimir Yu Volkonsky and assigned to Sun Microsystems. 2004. involving casting on March 4. On machines where branching is expensive. 9. r=(+1|(v>>(sizeof(int)*CHAR_BIT-1)))*v. Angus recommended the more portable versions above. r = (v + mask) ^ mask. so it may not work for that reason as well (on a diminishingly small number of old machines that still use one's complement).bytes were 8 bits long. because a multiply is not used. Some CPUs don't have an integer absolute value instruction (or the compiler fails to use them). Compute the integer absolute value (abs) without branching int v. 2007. so r = y + (x . the above expression can be faster than the obvious approach. so on some systems this hack might not work. // we want to find the minimum of x and y // the result goes here r = y + ((x . then -(x < y) will be all ones. On March 14. it is superior to the one I initially came up with. the above expression can be faster than the obvious approach. int y. Yuriy Kaminskiy told me that the patent is likely invalid because the method was published well before the patent was even filed. such as in How to Optimize for the Pentium Processor by Agner Fog. 1996. Caveats: On March 7. // the result goes here int const mask = v >> sizeof(int) * CHAR_BIT . 2006. dated November.mask. y) On machines where branching is expensive.y) & -(x < y)). r = (v < 0) ? -v : v. which Vladimir could have read. Peter Kankowski shared with me an abs version he discovered that was inspired by Microsoft's Visual C++ compiler output. Moreover. It is featured here as the primary solution. I've read that ANSI C does not require values to be represented as two's complement.

so there may be no advantage. then -3 is represented as 1101 in binary.y) only needs to be evaluated once. // we want to see if v is a power of 2 bool f. In C. since bit fields may be specified in structs or unions. To find the maximum. suppose you want to convert x to an int. Otherwise. Both of these issues concern only the quick and dirty version. which are faster because (x . 2005.y) & ((x . such as chars and ints.((x . // max(x. this is sign extending. so these aren't portable.y) & 0 = y. Bryant alerted me to the need for the precondition. so r = y + (x . then you can use the following. // the result goes here f = (v & (v . x. Angus Duggan pointed out the right-shift portability issue.y) & ((x . INT_MIN <= x . to convert from 5 bits to an full integer: . If we have 8 bits. Moreover. Determining if an integer is a power of 2 unsigned int v. use: r = x . but if negative.y <= INT_MAX.1))). use: f = !(v & (v . A simple copy will work if x is positive. On some machines.x. For example. then -(x < y) will be all zeros. y) Quick and dirty versions: If you know that INT_MIN <= x .1))). if x >= y. Randal E. For example. y) On March 7. the sign must be extended.) r = y + ((x .((x . y) r = x .y <= INT_MAX.y) >> (sizeof(int) * CHAR_BIT . sign extension from a constant bit width is trivial. Sign extending from a constant bit width Sign extension is automatic for built-in types. and suggested the non-quick and dirty version as a fix.y) & -(x < y)). // max(x. Note that 0 is incorrectly considered a power of 2 here. On May 3. 2003. 2005 that gcc produced the same code on a Pentium as the obvious solution because of how it evaluates (x < y). // min(x. The most significant bit of the 4-bit representation is replicated sinistrally to fill in the destination when we convert to a representation with more bits. then -3 is 11111101.y) >> (sizeof(int) * CHAR_BIT .1)) == 0.1)) && v. To remedy this. But suppose you have a signed two's complement number. Nigel Horspoon pointed out on July 6. that is stored using only b bits. if we have only 4 bits to store a number. which has more than b bits. evaluating (x < y) as 0 or 1 requires a branch instruction. (Note that the 1989 ANSI C specification doesn't specify the result of signed right-shift.

which lacks bitfields. M(3).x = x. r = s. unsigned B> signextend(const T x) {T x:B.} s. otherwise.1. Sign extending from a variable bit-width Sometimes we need to extend the sign of a number but we don't know a priori the number of bits. r // sign extend 5 bit number x to John Byrd caught a typo in the code (attributed to html formatting) on May 2. is greater than 1.int x. 2005. M(9). M(7). The code above requires five operations. M(10). template inline T { struct return } <typename T. M(5). due to the effort required for multiplication and division. the sign is undefined. The following is a C++ template function that uses the same language feature to convert from B bits in one operation (though the compiler is generating more. M(1). s. // sign extend this b-bit number to r int r. If you know that your initial bit width. // sign extend this b-bit number to r int r.5>(x).x = x.B)) // CHAR_BIT=bits/byte int const multipliers[] = { 0. as a starting point from which I optimized. M(13). b. int r = signextend<signed int. of course). (Or we could be programming in Java. On March 4. M(11). which requires only one array lookup.1)) . you might do this type of sign extension in 3 operations by using r = (x * multipliers[b]) / multipliers[b]. M(4). Sean A. M(15).1). // resulting sign-extended number #define M(B) (1 << ((sizeof(x) * CHAR_BIT) . M(8). This version is 4 operations.) unsigned b. 2006. in which it is represented. // convert this from using 5 bits to a full int int r. . M(2). M(12). M(6). M(14). // mask can be pre-computed if b is fixed r = -(x & m) | x. Irvine suggested that I add sign extension methods to this page on June 13. // number of bits representing the number in x int x. r = -(x & ~m) | x. unsigned b. // number of bits representing the number in x int x. 2004. Pat Wood pointed out that the ANSI C standard requires that the bitfield have the keyword "signed" to be signed. Sign extending from a variable bit-width in 3 operations The following may be slow on some machines. b. // resulting sign extended number goes here struct {signed int x:5. // resulting sign-extended number int const m = 1 << (b .} s. and he provided m = (1 << (b .

// the word to modify: w ^= (-f ^ w) & m. On some architectures. M(22). M(21). Conditionally set or clear bits without branching bool f. M(22). then there may be no advantage. M(15). If the mask is a constant.b. M(31). M(26). M(9). The following variation is not portable. // OR: r = (x << s) >> s. M(10). M(8). // conditional flag unsigned int m. // (add more for 64 bits) #undef M r = (x * multipliers[b]) / divisors[b]. informal speed tests on an AMD Athlon™ XP 2100+ indicated it was 5-10% faster. M(18). M(5). where it failed on the case of x=1 and b=1. // the bit mask unsigned int w. M(32) }. M(13). Randal E. // (add more if using more than 64 bits) int const divisors[] = { 1. M(30). M(3). M(16). sizeof(x) * CHAR_BIT . For instance. else w &= ~m. const int s = -b. // 1 where bits from b should be selected. // result of (a & ~mask) | (b & mask) goes here r = a ^ ((a ^ b) & mask). 2006. M(2). M(30). it should be fast. Bryant pointed out a bug on May 3. Glenn Slayden informed me of this expression on December 11. M(23). M(21). maintaining the sign. M(6). if (f) w |= m. M(29). // value to merge in masked bits int mask. M(28). M(14). M(27). M(32) }. M(24). Merge bits from two values according to a mask unsigned unsigned unsigned from a. 0 where int r. M(20). M(4). M(29). the lack of branching can more than make up for what appears to be twice as many operations. M(26). M(27). // value to merge in non-masked bits int b. M(19). 2005 in an earlier version (that used multipliers[] for divisors[]). M(11). M(25). but on architectures that employ an arithmetic right-shift. M(25). M(28). M(31). M(19). M(23). ~M(1). M(24). M(17). Ron Jeffery sent this to me on February 9. M(12). M(20). 2003.M(16). . M(7). M(17). M(18). This shaves one operation from the obvious way of combining two sets of bits according to a bit mask. unsigned int a.

// Option 2: unsigned char * p = (unsigned char *) &v. So on a 32bit word with only the high set. 3. 4. 3. 5. 5. 4. 5. 3. 3. 5. 4. 3. 6. 1. i < 256. 5. 3. 6. 2. 4. 5. 6. 4. 7. 6. 5. 5. 4. 6. 3. 6. 4. }. 6. 7. 4. 4. 3. 3. 2. 2. 5. 4. 6. 3. 3. 2. 4. 5. 4. 3. 5. v. 4. 5. 6. 2. 2. 5. 1. 3. 4. 6. 2. 5. 3. 3. 2. 4. 4. 2. 3. 6. 4. 4. 4. 2. 2. 4. 7. 5. 3. 2. 5. 2. v >>= 1) { c += v & 1. 4. 4. 3. 5. 5. 4. 6. 5. 4. 6. 4. 5. 4. 5. 3. 3. 3. 6. 4. 1. 4. 3. 4. 5. 6. 2. } The naive approach requires one iteration per bit. 4. 6. 6. 4. 5. 3. 3. 3. 3. 5. 5. 4. 3. 5. 6. 5. 5. 1. 5. it will go through 32 iterations. 5. 4. 4. 5. 5. 4. 3. 5. 3. 3. 6. 2. 4. 6. 1. 5. 7. 5. 4. 3. 5. 4. 2. 4. 4. 3. 7. 4. 4. 3. 2. 7. 4. 4. 2. 4. until no more bits are set. 5. // c is the total bits set in v // Option 1: c = BitsSetTable256[v & 0xff] + BitsSetTable256[(v >> 8) & 0xff] + BitsSetTable256[(v >> 16) & 0xff] + BitsSetTable256[v >> 24]. 6. 4. 2. 8 unsigned int v. 6. 3. 3. 4. 3. 5. 4. 3. 3. 3. 3. 2. 5. // count the number of bits set in 32-bit value v unsigned int c. 5. 2. 5. 3. 5. 4. Counting bits set by lookup table const unsigned char { 0. 4. 4. 5. 5. 6. 3. 2. 4. 6. 5. 6. 3. 4. 2. 4. 4. 4. 3.Counting bits set (naive way) unsigned int v. 6. 2. 5. 4. 3. 3. 2. 2. 2. // To initially generate the table algorithmically: BitsSetTable256[0] = 0. 5. 4. 4. 3. 3. 4. 5. 3. 4. 6. i++) . BitsSetTable256[] = 2. 3. 4. 7. // c accumulates the total bits set in v for (c = 0. 1. 4. 5. 4. 5. 4. 3. 7. 5. 4. 4. // count the number of bits set in v unsigned int c. 3. 6. 5. 5. 3. 4. 5. 3. 5. 1. for (int i = 0. 3. 4. 1. 5. 4. c = BitsSetTable256[p[0]] + BitsSetTable256[p[1]] + BitsSetTable256[p[2]] + BitsSetTable256[p[3]]. 3.

(by Brian W. c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f. Gosper. Bryant offered a couple bug fixes . or 32-bit words using 64-bit instructions unsigned int v. and Schroeppel. So if we have a 32-bit word with only the high bit set. 1972.. // count the number of bits set in v unsigned int c.. c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f. Published in 1988. (Also discovered independently by Derrick Lehmer and published in 1964 in a book edited by Beckenbach. Brian Kernighan's way unsigned unsigned for (c = { v &= v } int v. Ritchie) mentions this in exercise 2-9. 24. R. // c accumulates the total bits set in v 0. 29. MIT AI Memo 239. HAKMEM.)" Counting bits set in 14. } Counting bits set. see the Programming Hacks section of Beeler. On April 19. // option 3. similiar to option 1. Feb. His method was the inspiration for the variants above.1. Randal E. the second option takes 10.{ BitsSetTable256[i] = (i & 1) + BitsSetTable256[i / 2]. W. The first option takes only 3 operations. devised by Sean Anderson. Rich Schroeppel originally created a 9-bit version. then it will only go once through the loop. for at most 32-bit values in v: c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f. R. // clear the least significant bit set Brian Kernighan's method goes through as many iterations as there are set bits. for at most 14-bit values in v: c = (v * 0x200040008001ULL & 0x111111111111111ULL) % 0xf. c++) . and the third option takes 15. // c accumulates the total bits set in v // option 1. Kernighan and Dennis M. v. This method requires a 64-bit CPU with fast modulus division to be efficient. the C Programming Language 2nd Ed. for at most 24-bit values in v: c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f. 322. c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f. 2006 Don Knuth pointed out to me that this method "was first published by Peter Wegner in CACM 3 (1960). // count the number of bits set in v int c. M. // option 2.

2007. (v + (v >> 4)) & (T)~(T)0/255*15. >> S[3]) + c) & B[3]. 0x0F0F0F0F. though it doesn't use 64-bit instructions. // temp c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24. A generalization of the best bit counting method to integers of bit widths upto 128 (parameterized by type T) is this: v v v c = = = = v .((v >> 1) & 0x55555555). ((v >> 1) & B[0]). It is a hybrid between the purely parallel method above and the earlier methods using multiplies (in the section on counting bits with 64-bit instructions). = = = = = 01010101 00110011 00001111 00000000 00000000 01010101 00110011 00001111 11111111 00000000 01010101 00110011 00001111 00000000 11111111 01010101 00110011 00001111 11111111 11111111 The B array. (v & (T)~(T)0/15*3) + ((v >> 2) & (T)~(T)0/15*3). Bruce Dawson tweaked what had been a 12-bit version and made it suitable for 14 bits using the same number of operations on Feburary 1. The counts of bits set in the bytes is done in parallel.on May 3. const int S[] = const int B[] = 0x0000FFFF}. and we must compute the same number of expressions for c as S or B are long. is: B[0] B[1] B[2] B[3] B[4] 0x55555555 0x33333333 0x0F0F0F0F 0x00FF00FF 0x0000FFFF We can adjust the method for larger integer sizes by continuing with the patterns for the Binary Magic Numbers. 0x33333333. 16 operations are used. 2. // reuse input as temporary v = (v & 0x33333333) + ((v >> 2) & 0x33333333). >> S[2]) + c) & B[2]. 8. // Magic Binary Numbers {0x55555555. 16}. which is the same as the lookuptable method. // // // // temp temp temp count . >> S[4]) + c) & B[4]. but avoids the memory and potential cache misses of a table. c c c c c c = = = = = = v. expressed as binary. unsigned int c.((v >> 1) & (T)~(T)0/3). The best method for counting bits in a 32-bit integer v is the following: v = v . If there are k bits. in parallel unsigned int v.1) * CHAR_BIT. B and S. Counting bits set. // count The best bit counting method takes only 12 operations. v ((c ((c ((c ((c = = = = = // count bits set in this (32-bit value) // store the total here {1. 0x00FF00FF. (T)(v * ((T)~(T)0/255)) >> (sizeof(v) . 4. 2005. For a 32-bit v. and the sum total of the bits set in the bytes is computed by multiplying by 0x1010101 and shifting right 24 bits. then we need the arrays S and B to be ceil(lg(k)) elements long. >> S[1]) & B[1]) + (c & B[1]).

The method above takes around 4 operations. The best bit counting method was brought to my attention on October 5. v = v & (v . Computing parity the naive way unsigned int v. I made a typo with Don's suggestion that Eric Cole spotted on January 8. bool parity = false. } // word value to compute the parity of // parity will be the parity of b The above code uses an approach like Brian Kernigan's bit counting. v ^= v >> 8. but only works on bytes.". v &= 0xf. // byte value to compute the parity of bool parity = (((b * 0x0101010101010101ULL) & 0x8040201008040201ULL) % 0x1FF) & 1. he found it in pages 187-188 of Software Optimization Guide for AMD Athlon™ 64 and Opteron™ Processors. 2005. He also pointed out that if rotate operations were available. Charlie Gordon suggested a way to shave off one operation from the purely parallel version on December 14. The time it takes is proportional to the number of bits set. the binary number . and works for 32-bit words.1). Eric later suggested the arbitrary bit-width generalization to the best method on November 17. while (v) { parity = !parity. 2006. 2005 by Andrew Shapira. v ^= v >> 4. // word value to compute the parity of v ^= v >> 16. changing the shifts to rotates would eliminate the need for the AND masks because the overcounted sum total could instead be right-shifted 3 at the end. Compute parity in parallel unsigned int v. The method above takes around 9 operations. 2005. 2006. return (0x6996 >> v) & 1. leaving the result in the lowest nibble of v. The method first shifts and XORs the eight nibbles of the 32bit value together.See Ian Ashdown's nice newsgroup post for more information on counting the number of bits set (also known as sideways addition). above. It may be optimized to work just on bytes in 5 operations by removing the two lines immediately following "unsigned int v. Compute parity of a byte using 64-bit multiply and modulus division unsigned char b. Next. and Don Clugston trimmed three more from it on December 30.

1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 1. 0. 0. 0. v ^= v >> 8. 1. 1. 0. 1. 1. 1. 0. 0. 0. 0. and he received a $10 bug bounty. 0. 1. 0. 2006. 1. 1. 1. 0. Fabrice Bellard suggested the 32-bit variations above. 0. 1. 1. The result has the parity of v in bit 1. // byte value to compute the parity of bool parity = ParityTable[b]. 1. 0. 1. the previous version had four lookups (one per byte) and were slower. 1. 1. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. which require only one table lookup. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 0. 1. That optimization shaves two operations off using only shifting and XORing to find the parity. 0. 2005. 0. 2002. 0. 1. 1. 0. 1. 0. 1. 0. 2005. 1. 0. Bruce Rawles found a typo in an instance of the table variable's name on September 27. 0. 0. 1. Thanks to Mathew Hendry for pointing out the shift-lookup idea at the end on Dec. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 0. bool parity = ParityTable[v & 0xff]. Byrant encouraged the addition of the (admittedly) obvious last variation with variable p on May 3. 0. 1. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. // OR. 0. 0. 0. 0. 0. }. 0. ParityTable[] = 0. 0. 1. On October 9. 1. 0. 0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 1. 1. This number is like a miniature 16-bit parity-table indexed by the low four bits in v. parity = ParityTable[p[0] ^ p[1] ^ p[2] ^ p[3]]. 0. 0. 15. 0. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 0.0110 1001 1001 0110 (0x6996 in hex) is shifted to the right by the value represented in the lowest nibble of v. 0. 0. 1. 0. 1. 0. Compute parity by lookup table const bool { 0. 1. 0. 0. 1. 1. 1. 1. Randal E. 0. Swapping values with XOR . 1. 1. 1. 0. 1. 1. 0. for 32-bit words: v ^= v >> 16. 1. 1. 1. // Variation: unsigned char * p = (unsigned char *) &v. 0. which is masked and returned. 0. 0. 1. 1. 0. 0 unsigned char b. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0.

Reverse bits the obvious way unsigned int v. . // shift when v's highest bits are zero On October 15. As an example of swapping ranges of bits suppose we have have b = 00101111 (expressed in binary) and we want to swap the n = 3 consecutive bits starting at i = 1 (the second bit from the right) with the 3 consecutive bits starting at j = 5. such as SWAP(a[i]. 2005. Michael Hoisie pointed out a bug in the original version. b) (((a) ^= (b)). first get LSB of v int s = sizeof(v) * CHAR_BIT . So if that may occur. s--. Behdad Esfabod suggested a slight change that eliminated one iteration of the loop on May 18. This method of swapping is similar to the general purpose XOR swap trick. // r will be reversed bits of v. 2004. // extra shift needed at end for (v >>= 1. the result is undefined if the sequences overlap. Swapping individual bits with XOR unsigned unsigned unsigned unsigned int int int int i. // positions of bit sequences to swap n. a[j]) with i == j. r |= v & 1. // number of consecutive bits in each sequence b. Iain A. ((a) ^= (b)))). The variable x stores the result of XORing the pairs of bit values we want to swap. // XOR temporary r = b ^ ((x << i) | (x << j)).1). Randal E.#define SWAP(a. ((b) ^= (a)). v >>= 1) { r <<= 1. // bit-swapped result goes here int x = ((b >> i) ^ (b >> j)) & ((1 << n) . } r <<= s. v. 2005. but intended for operating on individual bits. 2005. the result would be r = 11100011 (binary). Byrant suggested removing an extra operation on May 3. // input bits to be reversed unsigned int r = v. consider defining the macro as (((a) == (b)) || (((a) ^= (b)). ((a) ^= (b))) This is an old trick to exchange the values of the variables a and b without using extra space for a temporary variable. Of course. ((b) ^= (a)). On January 20. and then the bits are set to the result of themselves XORed with x. j. Fleming pointed out that the macro above doesn't work when you swap with the same memory location. // bits to swap reside in b r.1.

0xFF }. 0x6D. 0xA0. 0x29. 0x9A. 0x88. 0x32. 0x54. 8 bits at time unsigned int c. 0x2F. 0x61. 0xC3. 0xB3. 0x55. 0x04. 0x3F. 0x80. 0xA7. 0x12. 0x1C. 0xC8. 0xA2. 0x56. 0x60. 0x33. 0xAB. 0x42. 0x0B. 0x0E. 0x11. 0x98. 0x5D. 0x2C. 0x4C. 0x01. 0xF1. 0xE2. 0x22. 0x1B. 0x5E. 0x71. 0xE8. 0xFD. 0x7C. 0xF8. 0x6C. 0x2D. 0xEF. 0xAE. 0x96. 0xC1. 0xAA. 0xCD. 0x35. 0x75. 0x97. 0x48. 0xDA. 0x2B. so rather than iterating over all bits it stops early. 0xE6. 2007. 0x90. 0xBC. 0x0F. 0x94. 0xBD. 0xD3. 0x30. 0xE7. 0xF9. 0x27. 0x10. 0xBF. Reverse bits in word by lookup table const unsigned char BitReverseTable256[] = { 0x00. 0xD5. 0x82. 0x0D. 0xA6. 0xA3. 0xFE. 0x79. 0x58. 0x7D. 0x3D. 0xE3. 0xF3. 0xA8. Liyong Zhou suggested a better version that loops while v is not 0. 0x46. 0xE0. 0xDC. 0x8A. 0x05. 0x91. 0xEB. 0xF7. 0xA4. 0x6A. 0xB7. on February 6. . 0x45. 0x6F. 0xE9. 0xD8. 0x1D. 0x40. 0xCE. 0x9E. 0x7A. 0xAD. 0x49. 0x67. 0xCA. 0x4D. // c will get v reversed // Option 1: c = (BitReverseTable256[v & 0xff] << 24) | (BitReverseTable256[(v >> 8) & 0xff] << 16) | (BitReverseTable256[(v >> 16) & 0xff] << 8) | (BitReverseTable256[(v >> 24) & 0xff]). 0x4B. 0xFA. 0xC2. 0x50. 0x4E. 0x07. 0x8F. 0x4A. 0xBB. 0x77. 0x5C. 0x3E. 0x8E. 0x3A. 0xFC. 0x16. 0x7E. 0xDD. 0x13. 0x4F. 0xB6. 0xE5. unsigned char * q = (unsigned char *) &c. 0xDE. 0xB8. 0x53. 0xB9. 0x08. 0xD0. 0x9B. unsigned int v. 0x43. 0x1E. 0xF4. 0x8B. 0x62. 0x03. 0xC4. 0x92. 0x5B. 0x24. 0xD2. 0x38. 0x74. 0xE1. 0x26. 0xDB. 0xAF. 0x59. 0xCB. 0x76. 0xE4. 0x83. 0x8D. 0x5F. // Option 2: unsigned char * p = (unsigned char *) &v. 0xEE. 0x87. 0xBA. 0xAC. 0x6B. 0xD4. 0x5A. 0x89. 0xF6. 0x25. 0x19. 0x72. 0xB5. 0xCC. 0x0A. 0x85. 0xF2. 0xEA. 0x23. 0x70. 0x51. 0x93. 0x73. 0x2A. 0xC6. 0x21. 0x7B. 0xED. 0x18. 0xB0. 0xB2. 0xA1. 0x78. 0xB4. 0x52. 0x68. 0xC9. 0x84. 0xA5. 0x66. 0x31. 0x95. 0xCF. 0x06. 0xF5. 0xD1. 0x28. 0x20. 0x99. 0x65. 0x7F. 0x14. 0xD7. 0x6E. 0x41. 0x0C. 0xB1. 0x86. 0x02. 0xC0. 0xD6. 0xC5. 0xC7. 0x3B. 0x15. 0xDF. 0x3C. 0x47.Then. 0x9C. 0x8C. 0xFB. 0x63. 0xD9. 0x44. 0x81. 0x09. 0x37. 0x34. 0x57. 0xEC. 0x36. 0x9F. 0x64. 0xF0. 0x39. 0xA9. 0x2E. 0x69. 0x17. 0x1A. // reverse 32-bit value. 0x1F. 0x9D. 0xBE.

BitReverseTable256[p[3]]. // reverse this byte b = ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32. The following shows the flow of the bit values with the boolean variables a. The last step. This method was attributed to Rich Schroeppel in the Programming Hacks section of Beeler. M. The AND operation selects the bits that are in the correct (reversed) positions. The first method takes about 17 operations. which comprise an 8-bit byte. has the effect of merging together each set of 10 bits (from positions 0-9. MIT AI Memo 239. which involves modulus division by 2^10 . e. 1972. d.. f..q[3] q[2] q[1] q[0] = = = = BitReverseTable256[p[0]]. HAKMEM.. and h. The multiply operation creates five separate copies of the 8-bit byte pattern to fan-out into a 64-bit value. . assuming your CPU can load and store bytes easily. The multiply and the AND operations copy the bits from the original byte so they each appear in only one of the 10-bit sets.. R. 29. and the second takes about 12. Reverse the bits in a byte with 4 operations (64-bit multiply. BitReverseTable256[p[1]]. g. Notice how the first multiply fans out the bit pattern to multiple copies. so the addition steps underlying the modulus division behave like or operations. while the last multiply combines them in the fifth byte from the right. The reversed positions of the bits from the original byte coincide with their relative positions within any 10-bit set. c.) in the 64-bit value. b. abcd efgh (-> hgfe dcba) * 1000 0000 0010 0000 0000 1000 0000 0010 (0x80200802) -----------------------------------------------------------------------------------------------0abc defg h00a bcde fgh0 0abc defg h00a bcde fgh0 & 0000 1000 1000 0100 0100 0010 0010 0001 0001 0000 (0x0884422110) . 10-19.1. Gosper. Reverse the bits in a byte with 3 operations (64-bit multiply and modulus division): unsigned char b. // reverse this (8-bit) byte b = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023. W. BitReverseTable256[p[2]]. They do not overlap. relative to each 10-bit groups of bits. R. 20-29. no division): unsigned char b. Feb. and Schroeppel.

Reverse the bits in a byte with 7 operations (no 64-bit): b = ((b * 0x0802LU & 0x22110LU) | (b * 0x8020LU & 0x88440LU)) * 0x10101LU >> 16 Devised by Sean Anderson. Thus.-----------------------------------------------------------------------------------------------0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 * 0000 0001 0000 0001 0000 0001 0000 0001 0000 0001 (0x0101010101) -----------------------------------------------------------------------------------------------0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 -----------------------------------------------------------------------------------------------0000 d000 h000 dc00 hg00 dcb0 hgf0 dcba hgfe dcba hgfe 0cba 0gfe 00ba 00fe 000a 000e 0000 >> 32 -----------------------------------------------------------------------------------------------0000 d000 h000 dc00 hg00 dcb0 hgf0 dcba hgfe dcba & 1111 1111 -----------------------------------------------------------------------------------------------hgfe dcba Note that the last two steps can be combined on some processors because the registers can be accessed as bytes. January 3. Reverse an N-bit quantity in parallel in 5 * lg(N) operations: unsigned int v. 2002. Devised by Sean Anderson. Typo spotted and correction supplied by Mike Keith. . 2001. 2001. July 13. July 13. it may take only 6 operations. just multiply so that a register stores the upper 32 bits of the result and the take the low byte. // 32 bit word to reverse bit order // swap odd and even bits v = ((v >> 1) & 0x55555555) | ((v & 0x55555555) << 1).

4. // const unsigned int d = (1 << s) .. . but it was included for the sake of completeness. . } } numerator s > 0 so d is either 1. 15. The second variation was suggested by Ken Raeburn on September 13. >> 4) & 0x0F0F0F0F) bytes >> 8) & 0x00FF00FF) 2-byte long pairs >> 16 ) | ((v & 0x33333333) << 2). 8. n % d goes here. 2005. n.1 without a division operator unsigned int n..// swap v = ((v // swap v = ((v // swap v = ((v // swap v = ( v consecutive pairs >> 2) & 0x33333333) nibbles . | ( v << 16).1. 7. unsigned int m. 2006. const unsigned int d = 1 << s. 32. n > d. // const unsigned int s. 31. however it requires more operations to reverse v. The following variation is also O(lg(N)). n = m) { for (m = 0.. n >>= s) { m += n & d. // numerator const unsigned int s. } These methods above are best suited to situations where N is large. // So d will be one of: 1.. 2. 16. | ((v & 0x0F0F0F0F) << 4). Its virtue is in taking less slightly memory by computing the constants on the fly. v = ((v >> s) & mask) | ((v << s) & ~mask). unsigned int s = sizeof(v) * CHAR_BIT. Compute modulus division by (1 << s) . while ((s >>= 1) > 0) { mask ^= (mask << s). Veldmeijer mentioned that the first version could do without ANDS in the last line on March 19. Dobb's Journal 1983. // . Compute modulus division by 1 << s without a division operator const unsigned int n. unsigned int m..). // for (m = n. Most programmers learn this trick early. | ((v & 0x00FF00FF) << 8).. must be power of 2 unsigned int mask = ~0. Edwin Freed's article on Binary Magic Numbers for more information. // m will be n % d m = n & (d . 3. See Dr.1). // bit size.

where N is the number of bits in the numerator. 5. 6. 23. {23. {15. 17}. 8. 19}. 7. Before Sean A. 0x1fffffff. 21. 2001. 4. 26. m = m == d ? 0 : m. 0x0f0f0f0f. 18. {10. 24. 6. 29. Irvine corrected me on June 17. 1}. In other words. 22. 0xf01fc07f. {26. but since with modulus division // we want m to be 0 when it is d. 28. 0x00ffffff. 7. {15. 20. {12. 27}. 17. 25}. 8. . 8. {28. 26}.1. 15}. 24. 21. Compute modulus division by (1 << s) . {18. 20. 13. 0x3fffffff. 0x55555555. 28. 6. Michael Miller spotted a typo in the code April 25. 28}. 8. 11}. 15. 0x7fffffff int Q[][6] = 0. I mistakenly commented that we could alternatively assign m = ((m + 1) & d) . {29. 5. 14}. 7. 0}. 18. 4. 0x00ff00ff. {16. 16. 0x0fffffff. 7}. 24. 0x01ffffff. 0x3ff003ff. 19. { 9. 0xc71c71c7. 0x0000ffff. 15. {19. 0x0003ffff. 24}. 11. 23}. 0x001fffff. at the end. 0xc1f07c1f. 29. 3}. 12. 22. 0. 22. const unsigned { { 0. 25. 13. 25. 13. 19. 28. {22. 12}. 23. 22. 13. 17. 27.// Now m is a value from 0 to d. {13. {15. {16. 14. 12. {16. 26. 0. This method of modulus division by an integer that is one less than a power of 2 takes at most 5 + (4 + 5 * ceil(N / s)) * ceil(lg(N / s)) operations. 20}. 11. 0x03ffffff. 2. 7. 26. 10}.1 in parallel without a division operator // The following is for a word size of 32 bits! const unsigned int M[] = { 0x00000000. 14. 9}. 0x000fffff. 0xc0007fff. 5. 6}. 20. 4}. 18}. }. 0. 4. {21. 2004. 0x0007ffff. 0x007fffff. 10. 17. 4. 16. 11. 9. 0x07fc01ff. Devised by Sean Anderson. 28. 2. 8. 25. 16. 22}. 12. 23. {12. {27. {20. 27. {11. 19. 15. 0x0001ffff. 10. 3. 29. 26. 0x3f03f03f. 10. 0xf0003fff. 8}. 24. it takes at most O(N * lg(N)) time. 17. {17. {14. August 15. 9. {14. 16}. 27. 4. {16. 9. 2. 8. 21. 11. 3. 18. 18. 0x07ffffff. 25. 3. 14. {16. 13}. 10. 0xffc007ff. 20. 6. 16. 2005. 8. 27. 12. 15. 29}. 21}. 0x003fffff. 5}. 9. 29. 1. 0x33333333. 2}. 14. 5. {25. 21. 23. 0xfc001fff. 6 . 0xff000fff. 19. {24.

0x0000001f. {0x0000ffff. 0x0001ffff. 0x00000000. 0x0000007f. 0x00007fff. 0x00000000. 0x000003ff. 0x000001ff. {0x007fffff. 0x0000001f. 0x001fffff. {0x000fffff. {31. 30}. {0x0000ffff. 0x00000003. 0x0000003f. 0x00000003. 0x003fffff. 0x01ffffff. 0x0001ffff. 0x00ffffff. 0x00003fff. 0x001fffff. 0x0003ffff. {0x00001fff. 0x0000ffff. 0x0003ffff}. 0x007fffff. 0x000001ff. 0x0000000f. 0x003fffff. 0x00000fff. 0x00000fff. 0x0007ffff. 0x0000ffff. 0x00ffffff. 0x01ffffff}. 0x00ffffff}. {0x00007fff. 0x000000ff. . 0x000000ff}. {0x0000ffff. {0x0003ffff. 0x000001ff}. 0x00000003}. {0x00007fff. 0x00007fff. 0x0000001f. {0x000007ff. 0x0007ffff. 0x0007ffff}. {0x00007fff. 0x0001ffff. {0x0007ffff. 0x0003ffff. 0x0007ffff. 0x0007ffff. 0x0000007f. 0x000003ff. 31. {0x00003fff. 0x00003fff}. 0x003fffff}. 0x00000007. 31} }. 0x0000000f. 0x00000007. 30. {0x00000fff. 0x0003ffff. 0x007fffff. 0x0000000f. int R[][6] = 0x00000000. 0x00001fff. 0x00000003. 0x00000000. 0x00ffffff. 0x00000007. {0x0000ffff. 0x000007ff. 0x0000007f. const unsigned { {0x00000000. 0x00007fff}. 0x000001ff. 0x00ffffff. 0x0001ffff. 0x007fffff}. 0x000003ff}. 0x00007fff. 0x00000007}. 0x000000ff. 30. 0x000003ff. 0x00000001. 0x00001fff}. 0x0000003f. 0x0003ffff. 0x001fffff. 0x0000003f. 0x000000ff. 0x0000003f}. 0x000fffff. 0x000007ff. 0x01ffffff. 0x00000001}. {0x0000ffff. {0x001fffff. 0x01ffffff. 0x003fffff. 0x000000ff. 0x000fffff}. 0x00000fff. {0x00003fff. 0x0001ffff}. 0x0000007f}. {0x000001ff. {0x0001ffff. 0x000007ff. 30. 0x0000ffff}. 0x007fffff. 0x000fffff. 0x00003fff. 0x0000ffff. 0x00007fff. 31. 0x0000000f. 31. 0x01ffffff. 0x000000ff. 0x003fffff. 0x00000000}. 0x0000003f. 0x000001ff. 0x0000001f}. 0x0000ffff. {0x000003ff. 0x0000003f. {0x00000fff. 0x0000000f. 0x000fffff. 0x007fffff. {0x00ffffff. {0x01ffffff. 0x000007ff. 0x00000fff. 0x001fffff}. 0x000000ff. 0x0000001f. 0x00001fff. 0x0000007f. 0x00003fff.{30. 0x0000000f}. 0x00001fff. 0x00000fff}. 0x000007ff}. 30. 0x000003ff. 0x00001fff. 0x00003fff. 0x001fffff. 0x000fffff. 0x000000ff. {0x003fffff. 31.

Imagine that the result is written on a piece of paper. and Don Knuth corrected me on April 19. 0x3fffffff. This method of finding modulus division by an integer that is one less than a power of 2 takes at most O(lg(N)) time.1. 0x7fffffff. I mistakenly commented that we could alternatively assign m = ((m + 1) & d) . {0x07ffffff. 0x03ffffff. 0x3fffffff. // s > 0 const unsigned int d = (1 << s) . until you cannot cut further. . 2005 (after pasting the code. 0x03ffffff. m = (n & M[s]) + ((n & ~M[s]) >> s). 0x0fffffff. 15. } m = m == d ? 0 : m. 0x0fffffff. less portably: m = m & -((signed)(m . August 20. so that half the values are on each cut piece. 0x07ffffff. at the end. for the code above). where N is the number of bits in the numerator (32 bits. 0x03ffffff}.d) >> s). q++. // numerator const unsigned int s. A typo was spotted by Randy E. 0x07ffffff. As in the previous hack.. .. // so d is either 1. 0x3fffffff}.d) >> s). 0x0fffffff. * r = &R[s][0]. 0x07ffffff. {0x1fffffff. I had later added "unsinged" to a variable declaration). 0x03ffffff. 0x03ffffff. r++) { m = (m >> *q) + (m & *r). Bryant on May 3. It may be easily extended to more bits. m > d.1. 0x7fffffff. First every other base (1 << s) value is added to the previous one. {0x7fffffff. // n % d goes here. 31. 0x0fffffff. 0x1fffffff}. 0x7fffffff. unsigned int m. 0x7fffffff. 2001. {0x0fffffff. for (const unsigned int * q = &Q[s][0]. // OR. 0x07ffffff. 0x07ffffff}. 0x7fffffff} }. 2006 and suggested m = m & -((signed)(m . After performing lg(N/s/2) cuts. Cut the paper in half. while there are at least two s-bit values.). 0x1fffffff. Align the values and sum them onto a new piece of paper. 0x0fffffff}. Devised by Sean Anderson. The tables may be removed if you know the denominator at compile time. 3.{0x03ffffff. The number of operations is at most 12 + 9 * ceil(lg(N)). 0x3fffffff. we cut no more. It finds the result by summing the values in base (1 << s) in parallel. just extract the few relevent entries and unroll the loop. {0x3fffffff. unsigned int n. 0x1fffffff. 0x1fffffff. 0x3fffffff. Repeat by cutting this paper in half (which will be a quarter of the size of the previous one) and summing. just continue to add the values and put the result onto a new piece of paper as before. 7. 0x1fffffff.

3. 6. } The log base 2 of an integer is the same as the position of the highest bit set (or most significant bit set. All that is left is shifting the exponent bits into position (20 bits right) and subtracting the bias. 6. 5. 5. 7. 5. The following log base 2 methods are faster than this one. The code above loads a 64-bit (IEEE-754 floating-point) double with a 32 bit integer by storing the integer in the mantissa while the exponent is set to 252. 6. 5.. 6. // 32-bit word to find the log base 2 of unsigned r = 0. 6. 7. 6. 5. 7. 7. 6. Evan Felix pointed out a typo on April 4. 7. 3. 7. 6. 6. 6. 6. 7. 6. 3. 5. 6. 6. 7. 5. 4. 6. 4. Find the integer log base 2 of an integer with an 64-bit IEEE float int v. 6. 7. From this newly minted double. 6. 6. 4. 7. 6. // result of log_2(v) goes here union { unsigned int u[2]. 6. 3.u[BYTE_ORDER==LITTLE_ENDIAN] >> 20) . 6. 6. 4. 6. 5. 5. 6. 5. 7. 6. but many CPUs are slow at manipulating doubles. 4. 6. 5. = 3. 3. 6. 6. 6. 6. 6. 5. 6.. 5. 7. 5. 2006. 7. // 32-bit integer to find the log base 2 of int r. 5. 7. double d. 2. 3. 7. 1. 6. 5. 4. 6. 5. 4. 4. 6. 6. 6. 6. 2006. 6. 5. 5. } t. 6. 7. 2. 6. 6. 7. 4. 4. 4. 0. 6. 6. 6. 7. 5. 6. 7. 6. 6. 6. 6. Find the log base 2 of an integer with a lookup table static const char LogTable256[] { 0. 5. 6. 5. 1. 6. t. 6.d -= 4503599627370496. Eric Cole sent me this on January 15.u[BYTE_ORDER!=LITTLE_ENDIAN] = v. 6. . 5. 4. 7. which sets the resulting exponent to the log base 2 of the input value. 7. 7. 4. // r will be lg(v) while (v >>= 1) // unroll for more speed. 5. 7. 4.Find the log base 2 of an integer with the MSB N set in O(N) operations (the obvious way) unsigned int v. 6. 7. 5. 7. 6. r = (t. v. 6. This technique only takes 5 operations. 7. 7. 6. 252 (expressed as a double) is subtracted. 6.u[BYTE_ORDER==LITTLE_ENDIAN] = 0x43300000. 5.0. 5. 6. 4. 5. 7. 5. 5. 2. // temp t.0x3FF. 2. 6. 6. { r++. 7. MSB). 7. t. 5. 7. 0x3FF (which is 1023 decimal). and the endianess of the architecture must be accommodated. 6. 3. 7. 5. 6. 4.

7. 7. Using int table elements may be faster. 7. i--) // unroll for speed. 7. 7. 7. 7. 7. 7. 7. 7. 7. // result of log2(v) will go here for (i = 4. } else { r = (t = v >> 8) ? 8 + LogTable256[t] : LogTable256[v]. 7. 4. 7. 7. 7. 7.. 7. tt. 7. 7. 7. 7. 7. 2005. 7. 7. 7. 7. 7. 7. 7. // 32-bit word to find the log of unsigned r. 7. 0xFFFF0000}. 7. with the possible additions incorporated into each. 7. { if (v & b[i]) { v >>= S[i]. Another operation can be trimmed off by using four tables. i++) { LogTable256[i] = 1 + LogTable256[i / 2]. 7. 7. 7.7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. int i. 7. // 32-bit value to find the log2 of const unsigned int b[] = {0x2. 2006 by Emanuel Hoogeveen. register unsigned int r = 0. 7. 8. 0xF0. 7. } Behdad Esfahbod and I shaved off a fraction of an operation (on average) on May 18. 7. 7. 7. // temporaries if (tt = v >> 16) { r = (t = tt >> 8) ? 24 + LogTable256[t] : 16 + LogTable256[tt]. 7. 7. 7. r |= S[i]. for (int i = 2. 7. If extended for 64-bit quantities. 7. 7. 7. 7. 7. }. } The lookup table method takes only about 7 operations to find the log of a 32-bit value. 7. 7. 7. 7. 7. i >= 0. 7. 7. 7. 7. 7. Yet another fraction of an operation was removed on November 14. 7. 7. 7. } } // OR (IF YOUR CPU BRANCHES SLOWLY): . const unsigned int S[] = {1. it would take roughly 9 operations. 0xFF00. 7. 7. depending on your architecture. 7. 7. 7. 2. 16}. 7. 7. 7. 7. 7. 7. // r will be lg(v) register unsigned int t. 7. i < 256. 7 unsigned int v. 7.. 0xC. Find the log base 2 of an N-bit integer in O(lg(N)) operations unsigned int v. // To initially generate the log table algorithmically: LogTable256[0] = LogTable256[1] = 0. 7. 7. 7. 7. 7. 7.

but not for the special-case version below it. shift. where v is a power of 2. 8. we would append another element.g. it's a good choice. but if you don't want big table or your architecture is slow to access memory.to 64-bit number. Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup unsigned int v. to b. 2006. 3. to extend the code to find the log of a 33. shift. // OR (IF YOU KNOW v IS A POWER OF 2): const unsigned int b[] = {0xAAAAAAAA. shift. shift. shift. 19. Ken Raeburn suggested improving the general case by using smaller numbers for b[]. 14. 0xFFFF0000}. shift. 26. 28. 17. then only one load instruction may be needed). i > 0. 1) + 1. Glenn Slayden brought this oversight to my attention on December 12. for (i = 4. 6. 1. 0xFF00FF00. 9 }. 2. and loop from 5 to 0. 25. 8. 15.register unsigned int shift. . but it is only suitable when the input is known to be a power of 2. These values work for the general version. 3. 18. 10. which load faster on some architectures (for instance if the word size is 16 bits. The second variation involves more operations. 29. 16. 5. 24. shift = ( ( v & 0xFFFF0000 ) shift = ( ( v & 0xFF00 ) shift = ( ( v & 0xF0 ) shift = ( ( v & 0xC ) shift = ( ( v & 0x2 ) != != != != != 0 0 0 0 0 ) ) ) ) ) << << << << << 4. 21. // find the log base 2 of 32-bit v int r. register unsigned int r = (v & b[0]) != 0. append 32 to S. 22. 2003. 31. 4. This method is much slower than the earlier table-lookup version. { r |= ((v & b[i]) != 0) << i. 2. 16. } Of course. 0xF0F0F0F0. 7. 13. 2002. 12. 2003. 1. i--) // unroll for speed. 0xCCCCCCCC. On May 25. 30. v v v v v v |= v |= v |= v |= v |= v = (v >> >> >> >> >> >> 1. shift. 23. // first round down to power of 2 2. 20. but it may be faster on machines with high branch costs (e.. PowerPC). 27. r r r r r |= |= |= |= |= shift. shift. The third variation was suggested to me by John Owens on April 24. it's faster. 0.. // result goes here static const int MultiplyDeBruijnBitPosition[32] = { 0. v v v v v >>= >>= >>= >>= >>= shift. 11. 0xFFFFFFFF00000000. and it was sent to me by Eric Cole on January 7. 4.

// (use a lg2 method from above) r = t . then only the last line is needed (3 operations). If v is known to be a power of 2. 2006 after reading about the entry below to round up to a power of 2 and the method below for computing the number of trailing bits with a multiply and lookup using a DeBruijn sequence. Finally. Eric Cole suggested I add a version of this on January 7. 1000000. 10000000. or 1233 followed by a right shift of 12. compared to (up to) 20 for the previous method. and shift). assuming 4 tables were used (one for each byte of v). Eric Cole devised this January 8. . the exact value is found by subtracting the result of v < PowersOf10[t]. 2006. // result goes here int t. since the value t is only an approximation that may be off by one. 1000. Doing so would require a total of only 9 operations to find the log base 10. but this offers a reasonable compromise between table size and speed. It requires only 15 operations. 100000000. pre-add. // find int(log2(v)).r = MultiplyDeBruijnBitPosition[(v * 0x077CB531UL) >> 27]. It may be sped up (on machines with fast memory access) by modifying the log base 2 table-lookup method above so that the entries hold what is computed for t (that is. t = (IntegerLogBase2(v) + 1) * 1233 >> 12. // non-zero 32-bit integer value to compute the log base 10 of int r.0 && finite(v) && isnormal(v) int c. where v > 0. 10. The integer log base 10 is computed by first using one of the techniques above for finding the log base 2. Find integer log base 2 of a 32-bit IEEE float const float v. we need to multiply it by 1/log2(10). 100000. // 32-bit int c gets the result. Adding one is needed because the IntegerLogBase2 rounds down. By the relationship log10(v) = log2(v) / log2(10). 10000.(v < PowersOf10[t]). -mulitply. The purely table-based method requires the fewest operations. The code above computes the log base 2 of a 32-bit integer with a small table lookup and multiply. Find integer log base 10 of an integer unsigned int v. 1000000000}. This method takes 6 more operations than IntegerLogBase2. which is approximately 1233/4096. 100. // temporary static unsigned int const PowersOf10[] = {1.

133. const float v. } } // find int(log2(v)). Falk Hüffner pointed out that ISO C99 6. He proposed using memcpy for maximum portability or a union with a float and an int for better code generation than memcpy on some compilers. and the mantissa is not normalized. &v. To accomodate for subnormal numbers. for portability: memcpy(&c. // where isnormal(v) and v > 0 int c. / pow(2. c = x >> 23.0x3f800000) >> r) + 0x3f800000) >> 23) . register unsigned int t. On June 11.141 : LogTable256[x] . if (c) { c -= 127. // OR. These have the exponent bits set to zero (signifying pow(2. c). On June 20. finite(v) int c. where v > 0. &v.9% of C compilers.c = *(const int *) &v. Find integer log base 2 of the pow(2. &v.0 && // 32-bit int c gets the result.5/7 specified undefined behavior for the common type punning idiom *(int *)&. } else { // subnormal. for portability: memcpy(&x.127. 1. // OR. so it contains leading zeros and thus the log2 must be computed from the mantissa. sizeof x). r)))). Irvine suggested that I include code to handle subnormal numbers. // OR. c = *(const int *) &v. . c = ((((c . 2004. 2005.149. Sean A. } else { c = (t = x >> 8) ? LogTable256[t] . c = (c >> 23) . // 32-bit int c gets the result. use the following: const float v. sizeof c). // find int(log2(pow((double) v. int x = *(const int *) &v. r)-root of a 32-bit IEEE float (for unsigned integer r) const int r.149. though it has worked on 99. sizeof The above is fast. but IEEE 754-compliant architectures utilize subnormal (also called denormal) floating point numbers. so recompute using mantissa: c = intlog2(x) .127.-127)). // temporary // Note that LogTable256 was defined earlier if (t = x >> 16) { c = LogTable256[t] . for portability: memcpy(&c.

Falk Hüffner pointed out that ISO C99 6. // so if v is 1101000 (base 2). then we have c = int(log2(sqrt((double) v))). // 32-bit word input to count zero bits on right unsigned int c = 32. } } if (v) { c--. const unsigned int S[] = {1. // 32-bit word input to count zero bits on right unsigned int c = 0. If r is 2. // c will be the number of zero bits on the right. i >= 0. 2. we shift v up rather than down. We also have the additional step at the end. The number of operations is at most 4 * lg(N) + 2. 1. If r is 1. 0x00FF00FF. but the values of b are inverted (in order to count from the right rather than the left). On June 11. // Our Magic Binary Numbers for (int i = 4. and c starts at the maximum and is decreased. } Here. // so if v is 1101000 (base 2). roughly. . Count the consecutive zero bits (trailing) on the right by binary search unsigned int v. then c will be 3 const unsigned int B[] = {0x55555555. and he suggested using memcpy. if r is 0. then c will be 3 // NOTE: if 0 == v.5/7 left the type punning idiom *(int *)& undefined. if (v & 0x1) { // special case for odd v (assumed to happen half of the time) } else { if ((v & 0xffff) == 0) { v >>= 16. 0x33333333. then c = 31.So. // c will be the number of zero bits on the right. we are basically doing the same operations as finding the log base 2 in parallel. --i) // unroll for more speed { if (v & B[i]) { v <<= S[i]. 4. 8. 2005. then we have c = int(log2(pow((double) v. 0x0F0F0F0F. c -= S[i]. 0x0000FFFF}./4))). decrementing c if there is anything left in v. we have c = int(log2((double) v)). Count the consecutive zero bits (trailing) on the right in parallel unsigned int v. 16}. for example. for N bit words.

c += 4. it checks if the bottom 16 bits of v are zeros. which reduces the number of bits in v to consider by half. c += 8. Each of the subsequent conditional steps likewise halves the number of bits until there is only 1. Although this only takes about 6 operations. 2006. This method is faster than the last one (by about 33%) because the bodies of the if statements are executed less often. but it computes the number of trailing zeros by accumulating c in a manner akin to binary search. In the first step. the time to convert an integer to a float can be high on some machines. shifts v right 16 bits and adds 16 to c. then the result is -127. } if ((v & 0xf) == 0) { v >>= 4. } if ((v & 0x3) == 0) { v >>= 2. // cast the least significant bit in v to a float r = (*(unsigned int *)&f >> 23) . and the bias is subtracted to give the position of the least significant 1 bit set in v.c += 16. Count the consecutive zero bits (trailing) on the right with modulus division and lookup unsigned int v. The exponent of the 32-bit IEEE floating point representation is shifted down. // find the number of trailing zeros in v int r. Count the consecutive zero bits (trailing) on the right by casting to a float unsigned int v. } } The code above is similar to the previous method.0x7f. // find the number of trailing zeros in v . and if so. c += 2. Matt Whitlock suggested this on January 25. } if ((v & 0x1) == 0) { c++. If v is zero. } if ((v & 0xff) == 0) { v >>= 8. // the result goes here float f = (float)(v & -v).

26. 0. 1. 29. 12. 2. 13. 10. 17. 22. 18. 3. 30. 28. 7. r = Mod37BitPosition[(-v & v) % 37]. When there are no bits set. r = MultiplyDeBruijnBitPosition[((v & -v) * 0x077CB531UL) >> 27]. 27.int r. 8. More information can be found by reading the paper Using de Bruijn Sequences to Index 1 in a Computer Word by Charles E. 27. 15. 7. It makes use of the fact that the first 32 bit position values are relatively prime with 37. 9 }. 24. 2005 Andrew Shapira suggested I add this. 0. it returns 0. Leiserson. 5. // Round this 32-bit value to the next highest power of 2 unsigned int r. On October 8. and Keith H. These numbers may then be mapped to the number of zeros using a small lookup table. which produces a unique pattern of bits into the high 5 bits for each possible bit position that it is multiplied against. 10. The expression (v & -v) extracts the least significant 1 bit from v. 14. 13. // find the number of trailing zeros in 32-bit v int r. 24. (So v=3 -> r=4. // Put the result here. 6. It uses only 4 operations. 18 }. 21. I came up with this independently and then searched for a subsequence of the table values. 22. 20. 14. 0. // result goes here const int MultiplyDeBruijnBitPosition[32] = { 0. 0. Count the consecutive zero bits (trailing) on the right with multiply and lookup unsigned int v. 31. v=8 -> r=8) . however indexing into a table and performing modulus division may make it unsuitable for some situations. Round up to the next highest power of 2 by float casting unsigned int const v. It requires one more operation than the earlier one involving modulus division. Randall. 23. 20. 17. 16. 19. 19. 2. 29. 31. 0. 23. so performing a modulus division with 37 gives a unique number from 0 to 36 for each. 1. 16. 15. 11. and found it was invented earlier by Reiser. 25. 5. 8. 4. The constant 0x077CB531UL is a de Bruijn sequence. so binary 0100 would produce 2. 6. 12. 9. 26. 21. 28. according to Hacker's Delight. Converting bit vectors to indices of set bits is an example use for this. 3. // put the result in r const int Mod37BitPosition[] = // maps a bit value mod 37 to its position { 32. 25. but the multiply may be faster. 30. 4. The code above finds the number of zeros that are trailing on the right. Harald Prokof. 11.

but in some situations. // compute the next highest power of 2 of 32-bit v v--. but works on all v <= (1<<31). The result may be expressed by the formula 1 << (lg(v .0x7f).126). 2. 8. (On a Athlon™ XP 2100+ I've found the above shift-left and then OR code is as fast as using a single BSR assembly language instruction.if (v > 1) { float f = (float)v. It would be faster by 2 operations to use the formula and the log base 2 methed that uses a lookup table. If the original . 2005 Andi Smithers suggested I include a technique for casting to floats to find the lg of a number for rounding up to a power of 2. } The code above uses 8 operations. Some CPUs will fare better with it. it returns 0. Note that in the edge case where v is 0. which scans in reverse to find the highest set bit. which results in carries that set all of the lower bits to 0 and one bit beyond the highest set bit to 1. } else { r = 1. it is roughly three times slower than the technique below (which involves 12 operations) when benchmarked on an Athlon™ XP 2100+ CPU. On September 27. so the above code may be best. Similar to the quick and dirty version here. though. but it used one more operation. you might append the expression v += (v == 0) to remedy this if it matters.1) + 1). his version worked with values less than (1<<25). r = t << (t < v).) It works by copying the highest set bit to all of the lower bits. v v v v v >> >> >> >> >> 1. Round up to the next highest power of 2 unsigned int v. r = 1 << ((*(unsigned int*)(&f) >> 23) . 16. which isn't a power of 2. and then adding one. Quick and dirty version. v |= v |= v |= v |= v |= v++. In 12 operations. this code computes the next highest power of 2 for a 32-bit integer. for domain of 1 < v < (1<<25): float f = (float)(v . due to mantissa rounding. unsigned int const t = 1 << ((*(unsigned int *)&f >> 23) . Although the quick and dirty version only uses around 6 operations. lookup tables are not suitable.1). 4.

0x0141. 0x0004. 0x1445. 0x1504. 0x1455. 0x1111. 0x0540. { z |= (x & 1 << i) << i | (y & 1 << i) << (i + 1). 0x4515. 0x1144. 0x0001. 0x0401. 0x0554. 0x1011. 0x1404. 0x1105. so x and y are combined into a single number that can be compared easily and has the property that a number is usually close to another if their x and y values are close. 0x0501. 0x0144. 0x0444.number was a power of 2. 0x1015. 0x4014. 0x4540. 0x4504. 0x4544. 0x0455. 0x4455. 0x0411. 0x0055. 0x1054. 0x1555. 0x0110. 0x1044. 0x1014. 0x1050. 0x1515. 0x5004. 0x0555. 0x0150. 0x1045. 0x1451. 0x4551. 0x0115. 0x1150. 0x1541. 0x4401. 0x0441. so that we round up to the same original value. 0x1554. 0x5015. 0x0440. 0x1551. 0x1540. 0x4405. 0x0400. 0x4044. 0x1100. 0x1140. 0x4410. 0x0051. 0x0505. // z gets the resulting Morton Number. 0x0145. 0x4445. 0x0111. 0x4541.. 0x0511. 0x1055. 0x4055. 0x0445. . 0x1110. 0x0100. 0x5014. 0x4045. 0x1511. 0x0510. 0x1401. 0x1450. 0x1151. 0x0101. 0x1405. unsigned int z = 0. 0x5001. 0x0500. 0x4114. 0x1005. // Interleave bits of x and y. 0x1051. Devised by Sean Anderson. 0x4041. 0x4444. 0x1104. 0x0541. 0x1115. then the decrement will reduce it to one less. 0x0044. 0x0105. 0x1544. 0x0544. 0x4505. 0x0451. 0x1410. 0x0415. 0x1510. 0x4004. 0x4154. 0x5005. 0x1500. 0x0405. 0x4454. 0x1444. so that all of the unsigned short y. 0x4110. 0x1155. 0x4141. 0x0551. for (int i = 0. Interleave bits by table lookup const unsigned short MortonTable256[] = { 0x0000. 0x4451. 0x1000. 0x1145. 0x4105. 0x4550. 0x5010. 0x4145. 0x4510. 0x1040. 0x4415. 0x4040. 0x1501. 0x1141. i < sizeof(x) * CHAR_BIT. 0x4115. 0x4155. 0x4440. 0x4514. 0x4554. 0x4101. 0x1415. 0x4414. 0x4051. 0x1041. 0x5011. 0x0545. 0x0154. 0x0515. 0x0104. 0x0045. 0x0011. 0x1411. 0x1400. 0x1114. 0x4511. 0x4441. 0x0454. 0x4140. 0x0404. 0x4111. 0x0054. 0x0450. 0x4404. 0x4411. 0x0414. 0x1101. 0x1454. 0x4450. 0x4005. 0x0005. 0x0151. 0x1001. 0x0155. 0x1441. } Interleaved bits (aka Morton numbers) are useful for linearizing 2D integer coordinates. 0x1154. 0x4104. 0x1505. 0x0140. 0x0040. 0x4555. 0x0041. 0x0550. 0x0010. 0x0114. 0x4400. 0x4015. 0x4050. 0x1514. 0x0050. 0x1550. 0x4001. 0x1545. 0x5000. 0x1414. 0x4150. 0x4000. 0x4011. 0x4500. 0x4151. 0x4545. 0x4501. 0x0504. 0x4010. Interleave bits the obvious way unsigned short x. i++) // unroll for more speed. // bits of x are in the even positions and y in the odd. 0x4054. 0x0514. 0x1004. 0x0410. 2001. 0x0014. 0x4144. 0x1010. Sepember 14. 0x0015. 0x1440. 0x4100..

0x5411. }. 0x5114. 0x5111. but almost doubling the memory required. 0x5115. 0x5440. 0x0F0F0F0F. 0x5500. 0x5454. 0x00FF00FF}. unsigned short z. // z gets the resulting 16-bit Morton Number. // Interleave bits of (8-bit) x and y. 0x5450. 0x5151. 0x5044. so that all of the unsigned short y. 0x5150. 0x5515. 0x5055. with two of them pre-shifted by 16 to the left of the previous two. as in the other versions). thus reducing the operations by two. 0x5455. 0x5441. 0x5504. unsigned char x. 0x5404. This second table could then be used for the y lookups. unsigned int z. 0x5544. but many of the operations are 64-bit multiplies so it isn't appropriate for all machines. 2. 0x5054. 0x5550. For more speed. 0x5110. four tables could be used. // bits of x are in the even positions and y in the odd. 0x5551. 0x5144. Interleave bits with 64-bit multiply In 11 operations. z = ((x * 0x0101010101010101ULL & 0x0102040810204081ULL >> 49) ((y * 0x0101010101010101ULL & 0x0102040810204081ULL >> 48) 0x8040201008040201ULL) * & 0x5555 | 0x8040201008040201ULL) * & 0xAAAA. this version interleaves bits of two bytes (rather than shorts. 0x5051. 0x33333333. Interleave bits by Binary Magic Numbers const unsigned int B[] = {0x55555555. 0x5401. 0x5514. 0x5045.0x5040. 0x5451. 0x5405. 0x5501. 0x5105. 0x5414. 0x5141. 4. so that all of the unsigned char y. 0x5510. z = MortonTable256[y MortonTable256[x MortonTable256[y MortonTable256[x >> 8] << 17 | >> 8] << 16 | & 0xFF] << 1 | & 0xFF]. 0x5050. 0x5445. 0x5154. // z gets the resulting 32-bit Morton Number. 8}. 0x5545. 0x5100. 0x5041. 0x5400. 0x5554. 2004 after reading the multiply-based bit reversals here. 0x5155. 0x5444. 0x5511. 0x5140. 0x5505. 0x5541. Holger Bettag was inspired to suggest this technique on October 10. 0x5410. so that we would only need 11 operations total. use an additional table with values that are MortonTable256 pre-shifted one bit to the left. // Interleave bits of x and y. 0x5415. 0x5104. const unsigned int S[] = {1. 0x5145. . Extending this same idea. 0x5555 unsigned short x. // bits of x are in the even positions and y in the odd. 0x5540. 0x5101.

x x x x y y y y = = = = = = = = (x (x (x (x (y (y (y (y | | | | | | | | (x (x (x (x (y (y (y (y << << << << << << << << S[3])) S[2])) S[1])) S[0])) S[3])) S[2])) S[1])) S[0])) & & & & & & & & B[3]. the high bit of a byte is set iff any bit in the byte was set. a fast pretest that requires only 4 operations may be performed to determine if the word may have a zero byte. but the slower and more reliable version above may then be used on candidates for an overall increase in speed with correct output. simply increase the constants to be 0x7F7F7F7F7F7F7F7F. and at most 12. it uses 5 operations. z = x | (y << 1). The code above may be useful when doing a fast string copy in which a word is copied at a time. B[1]. // Interleave lower 16 bits of x and y. B[2]. it adds a number that will result in an overflow to the high bit of a byte if any of the low bits were initialy set. The code at the beginning of this section (labeled "Fewer operations") works by first zeroing the high bits of the 4 bytes in the word. Subsequently. Extending to 64 bits is trivial. Next the high bits of the original word are ORed with these values. // More operations: bool hasNoZeroByte = ((v & 0xff) && (v & 0xff00) && (v & 0xff0000) && (v & 0xff000000)) // OR: unsigned char * p = (unsigned char *) &v. bool hasZeroByte = ((v + 0x7efefeff) ^ ~v) & 0x81010100. B[2]. . Determine if a word has a zero byte // Fewer operations: unsigned int v. B[1]. // are in the even positions and bits from y in the odd. thus. For an additional improvement.unsigned int x. // 32-bit word to check if any 8-bit byte in it is 0 bool hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F). // z gets the resulting 32-bit Morton Number. we determine if any of these high bits are zero by ORing with ones everywhere except the high bits and inverting the result. B[0]. B[0]. so there are occasional false positives. unsigned int z. bool hasNoZeroByte = *p && *(p + 1) && *(p + 2) && *(p + 3). Finally. B[3]. so the bits of x unsigned int y. testing for a null byte in the obvious ways (which follow) have at least 7 operations (when counted in the most sparing way). On the other hand. The test also returns true if the high byte is 0x80.

The countless macro was added by Sean Anderson on April 10. which is defined below. previously it was written in a newsgroup post on April 27. 1). 1) on April 6. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Requirements: x>=0. } There is yet a faster method — use hasless(v. Uses 3 arithmetic/logical operations when n is constant.0x01010101UL). Uses 4 arithmetic/logical operations when n is constant.0x01010101UL) & ~v & 0x80808080UL. below. Determine if a word has a byte less than n Test if a word x contains an unsigned byte with value < n.n) \ (((~0UL/255*(127+(n))-((x)&~0UL/255*127))&~(x)&~0UL/255*128)/128%255) Juha Järvi sent this clever technique to me on April 6. Paul Messmer suggested the fast pretest improvement on October 2. 0<=n<=127 . 2005. use #define countless(x. 0<=n<=128 #define hasless(x. 2005. by ANDing these two subexpressions the result is the high bits set where the bytes in v were zero. Requirements: x>=0. Determine if a word has a byte greater than n Test if a word x contains an unsigned byte with value > n. 2004. it works in 4 operations and requires no subsquent verification. since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second. Finally. Specifically for n=1. inspired by Juha's countmore. The subexpression (v . which he found on Paul Hsieh's Assembly Lab.n) (((x)-~0UL/255*(n))&~(x)&~0UL/255*128) To count the number of bytes in x that are less than n in 7 operations.if (hasZeroByte) // or may just have 0x80 in the high byte { hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F). 2005. It simplifies to bool hasZeroByte = (v . evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. Juha Järvi later suggested hasless(v. or any byte by XORing x with a mask first. 1987 by Alan Mycroft. it can be used to find a 0-byte by examining one long at a time.

It uses 7 arithmetic/logical operations when n and m are constant.n)/128%255) Juha Järvi suggested likelyhasbetween on April 6. Determine if a word has a byte between m and n When m < n.m. 2005.n) \ (((((x)&~0UL/255*127)+~0UL/255*(127-(n))|(x))&~0UL/255*128)/128%255) The macro hasmore was suggested by Juha Järvi on April 6.n) \ ((((x)-~0UL/255*(n))&~(x)&((x)&~0UL/255*127)+~0UL/255*(127(m)))&~0UL/255*128) This technique would be suitable for a fast pretest. Requirements: x>=0.m. Sean Anderson created hasbetween and countbetween on April 10.#define hasmore(x. use: #define countbetween(x. . Note: Bytes that equal n can be reported by likelyhasbetween as false positives. and he added countmore on April 8. so this should be checked by character if a certain result is needed. From there. this technique tests if a word x contains an unsigned byte value. 2005.n) (hasbetween(x. 2005.m. use: #define countmore(x. 0<=m<=127.n) \ ((~0UL/255*(127+(n))((x)&~0UL/255*127)&~(x)&((x)&~0UL/255*127)+~0UL/255*(127m))&~0UL/255*128) To count the number of bytes in x that are between m and n (exclusive) in 10 operations. 2005.m. A variation that takes one more operation (8 total for constant m and n) but provides the exact answer is: #define hasbetween(x. such that m < value < n.n) (((x)+~0UL/255*(127-(n))|(x))&~0UL/255*128) To count the number of bytes in x that are more than n in 6 operations. 0<=n<=128 #define likelyhasbetween(x.