Eyas El-Qawasmeh and Wafa'a Al-Qarqaz
Computer Science Dept. Jordan University of Science and Technology P.O. Box 3030, Irbid 22110 Jordan,

ABSTRACT Bit-counting is the operation of counting the number of ones in a given computer word or binary vector. Nowadays there are several solutions for this problem. Among these solutions is the usage of lookup table. However, the lookup table can not be used for large sizes of binary vectors or computer words. This paper presents a new implementation of bit-counting problem based on lookup table. The advantage of the proposed algorithm is that it avoids the limitation of the table lookup. This is achieved by taking advantage of the regular behavior of the number of set bits in all possible values for a computer word. The regular pattern enables us to reduce the size of the lookup table. Performance results showed that the suggested techniques outperform other existing methods.
Keywords: BitCounting, Lookup Table, RemainingBits, NumberOfOnes, Comrade Group.

1. Introduction
Bit-counting, which is also called popcount, refers to the operation of counting the number of ones in a given computer word or binary vector. The bitcounting has been used in many applications including information retrieval systems, file processing systems, coding theory [Berkovich, et al., 2000], genetic algorithms, and Game theory. For example, information retrieval systems may represent search results in the form of a single bit attribute matrix representing tensor hundreds-of-thousands of documents indicating whether each document satisfies one or more search criteria. In this case, bit-counting operation is used to determine the number of documents satisfying the search criteria. Likewise, file comparison routines may be used to compare files having large numbers of elements. In this case, a count of the number of matching elements may be required to find an overall match metric. Genetic algorithms use the bit-counting

operation in many of its procedures [Goldberg, et al., 1992]. Currently, there are several software implementations for the bit counting operation. These implementations are: a) Sequential shifting, b) Arithmetic logic counting (AL), c) Emulated popcount, d) Lookup table e) Hamming distance bit vertical Counter (HC), and f) Frequency division [Berkovich et al., 2000] [Berkovich et al., 1998] [El-Qawasmeh, Hemidi, 2000] [Gutman, 2000] [Reingold et al., 1977]. These schemes are competing in the level of efficiency, both time and space efficiency. For example the lookup table algorithm beats the sequential shifting, arithmetic and logic, and parallel methods in favor of time efficiency, on the other hand its space requirements is a critical point when the binary vector (or computer word) size becomes large since it occupies a table of size 2vector size. This paper is interested in the lookup table algorithm. This table lookup has a sever

2 Arithmetic Logic Counting (AL) This method depends on doing the mask operation (AND) for a number with itself after subtracting one from it.3 Parallel Count The parallel counting successively groups the bits into sub-groups of 2. 4. Following is a listing of the most common methods: 2. the performance analysis of the suggested technique will be investigated and compared with other implementations of popcount. While the Number ≠ 0 do If the lowest bit of Number is 1 then increase Counter by one. While the Number ≠ 0 do Number = Number AND (Number –1) // AND is the bitwise logical operation Increase Counter by one. The same logic operation is repeated as long as the number does not equal to zero. In other make more control over the table size. In addition. we can not use a lookup table of size 2 32 entries in general. The organization of this paper is as follows. section 3 describes the new suggested method. section 2 describes some of the existing software methods. Current Related Work Currently. The algorithm is shown in figure (1) below: Counter = 0.problem presented by the limitation of its size. For example. 2001]. 16. End while Figure (1): Sequential shifting method 2. This means that its performance is better in the case of sparse binary vectors of “1”s rather than in dense binary vectors of “1”s [ElQawasmeh. Shift the Number to the right one bit. Experimental results showed that the proposed scheme is the best among all known. Section 4 presents a performance analysis to the suggested algorithm. The conclusions are introduced in section 5. 8. The operational time of this algorithm is proportional to the number of “1”s rather than to the length of the computer word. End while Figure (2): Arithmetic Logic Counting (AL) method 2. The running time for the sequential shifting is O(n) where n is the number of bits in the computer word. This is due to the “While” loop that will continue to be executed for a number of times equal to the number of “1”s in a given computer word.1 Sequential Shifting: It loops and checks each bit alone until the number becomes zero. Counter = 0. it is O(ones(w)) where w is a computer word. The algorithm for the arithmetic logic counting is shown in figure (2). and . The main objective of this paper is to introduce a new enhancement to the lookup table algorithm by reducing the size of the lookup table. there are many different software implementations for bit-counting that vary in the level of efficiency. The reduction takes advantage from the regular behavior of the number of ones in any four consecutive binary values –the smallest one is a multiple of four value. 2.

then we need to store in the RAM a table of size 232 which is not applicable these days. there was a suggested enhancement technique which depended on splitting 32 bits or 64 bits computer word into groups. Working on the badly-effective increasing in the lookup table size.Partition the register into groups of 2 bits. a single access to the lookup table indexed by the number itself will directly returns the result. while maintaining a count of “1”s in each group. To do this simultaneously to all groups. Therefore. 5. we have to mask appropriately to prevent spilling from one group to the next lower group.4 Lookup Table . The lookup table algorithm is based upon storing the number of ones for each possible word value in a lookup table.Add the population count of adjacent 2-bit groups and store the sum to the 4-bit group resulting from merging these adjacent 2-bit groups. AND 0xff] + Table [(No. shifted right 16 bits) AND 0xff] + Table [No. However. shifted right 8 bits) AND 0xff] + Table [(No.32. Lookup table technique runs efficient only for small computer words (8 or 16 bits). Compute the population count for each 2-bit group and store the result in the 2-bit group. In order to handle all 2-bit groups simultaneously. shifted right 24 bits] Figure (3): Enhanced lookup table method Note that in figure (3). 2. mask out the even numbered groups and then add the odd numbered groups to the even numbered groups. it runs in a constant time since there are no mathematical calculations or logical operations. 3. if the computer word is 32-bits. The lookup technique is fast. 4.Now. 2. The value 0xff represents a hexadecimal representation the binary value with ones in its least significant 8 bits (each f corresponds to 4 ones). The above steps are applicable for 32-bit machines while for 64-bit machines an extra step similar to step 5 is required in order to add values of two adjacent 32-bit groups. an improvement to this critical point is needed. Figure (3) shows the algorithm of this enhancement. Create and fill the lookup table of size [2 group size] NumberOfOnes = Table [No. then getting a value from the table of size 2group size per group apart. the AND represents the logical bitwise AND operation. To get the popcount value for a given number. For example.Add the adjacent 8-bit fields together and store the sum into 16bit fields created by merging the adjacent 8-bit fields. instead of single access to a table of size 2 word size. The algorithm can be summarized as follows: 1. then the lookup table technique is not appropriate. the value in each k-bit field is small enough that adding two k-bit fields results in a value that still fits in the k-bit field. if the table size is large. The result will be four 8-bit fields whose lower half has the correct sum and whose upper half contains an incorrect value that has to be masked out. and for the first time.Add the values of two 16-bit groups to produce the final result that will be stored in the least six significant bits. we mask out the odd numbered groups.

Figure (4): The first 32 possible values for a computer word of size 8 bits and the popcount value corresponding to each is show in the "Number of Ones" columns. Create and fill a lookup table of size word size/ 2 2 2 NumberOfOnes = Table [ number shifted right 2 bits ] + RemainingBits(number AND 0x3) (a) Figure (5-a): The new suggested algorithm: Version one . The algorithm is shown in figure (5) below.1 Version One Version one partitions the computer word into two parts. Making use of this property gives us more flexibility with constructing the lookup table. The second version gains more control over the table size. The suggested Algorithm In the proposed algorithm. for each computer word (e. where: T = ( word size – number of index bits) / 4 Note that the number of the RemaingBits function calls is determined according to the number of the index bits. A comrade group is a four consecutive binary values starts with a multiple of four value. In addition. the first one is to explain the main point of the algorithm with the gain of reducing the lookup table size to one fourth while keeping the constant running time (i.figure (7-b) for version one and version two consecutively) will be invoked either for one time if version one is used or for T times if version two is used. The second is the least significant 2 bits which are sent to the RemaingBits function. Then the RemainingBits function (figure (5-b). we make recursive calls to the function RemaningBits shown in figure (5b). The procedure is. O(c)). x+1 continues to appear. 3. we uses a lookup table of a size less than 2 word size and still acts with a single table access. table size will be reduced by the factor 1/4 and with more invocations to the RemaningBits function. but with more running time complexity caused by the while loop that iterates into the sub parts of the given vector.3. x. That is instead of storing the bitcounting (popcount) value for each possible value within a computer word or a binary vector of a specific size we can make use of this property by storing the bit-counting value for a single element for each comrade group and then solve the problem for the other elements in that group by the RemaningBits function. 32 bit) the most significant n bits will be used as an index to a lookup table of size 2 n entries instead of sized 2 word size table with each entry in the table is the xvalue (as shown in figure (4)). with K is the memory word size minus 2. x.g. This part is used as an index to the lookup table to return the corresponding value from the table. This property is shown in figure (4). The two results are then added together. Two versions of the algorithm will be introduced. The RemaningBits function is constructed from simple logical operations to determine the number of ones for sub parts of the problem. Within each comrade group the pattern x1. The first which is the most significant K bits. The basis for the given improvement came from the regular behavior of the number of ones in any four consecutive binary values within the same comrade group.e. As soon as the index bits become less by one bit the. Where x is the number of ones in the second (or third) value within each comrade group.

and making some changes for the code of version one will manage to give us more control over the table size. In this context. Create and fill a lookup table of size 2 word size / 2 index bits counter2 = 0 flag = false // Boolean flag initially set to false Let us take an example to clarify this idea. A table of size 26 must be constructed using our suggested approach. It should be noted that the table has 64 elements instead of 256 elements (reduced by one fourth). The code of version two is shown in figure (7-a) below. let us consider a number such as 16 represented in 8 bits. The first one is the index bits which are used as an index to the lookup table. On the other hand the running time complexity becomes larger. 3. comrade group of comrade groups). See Figure (6). 0 0 0 1 0 0 0 0 Index to the lookup table 2 bits sent to the RemainingBits function counter1 = Table [number shifted right by (word size – number of index bits) bits] number = number AND 0xfffff Figure (6): Version one example The lookup table will return the value 2 and the RemainingBits function will return -1.The RemaningBits function algorithm is shown in figure (5-b) bellow: RemainingBits (number) Set counter = 0 if ( (number AND 3 ) equals 3) AND then counter =counter +1 if ( (number OR 0 ) equals 0 ) = counter .1 return counter (b) Figure (5-b): First proposed algorithm using RemainingBits function then counter // Bitwise- Going back to the table of figure (4) and examining it more carefully give the note that the property of comrade groups still holds for each consecutive four comrade groups (i. that would be by reducing the table size to the fourth.The total is 1 which is the bit-counting value of 16.By taking the first entry from each comrade group and for any consecutive four groups-starting from multiple of 16 values.e. The second is the remaining computer word bits after excluding the index bits for the first part. the computer word is partitioned into two parts.2 Version Two In version two. while ( number not equal zero ) counter1 = counter1 + RemainingBits( number AND 0xf . Taking more care over this note.the property still holds. An invocation to the RemainingBits function is made with the second part and a true-valued flag are to be the parameters of that function. The most significant 6 bits are index to the table and the remaining 2 bits are tested by RemainingBits function. flag ) if ( counter2 > 1 ) .

since for large computer words having a lookup table of size one fourth the original expected size still not efficient. In version two. the constant running time complexity of the original lookup table algorithm is still achieved with a single access to a lookup table of a size of one fourth of the original table size. flag ) counter = 0 if (number grater than 3) then return RemainingBits (number AND 0x3 . the remaining 20 bits will be sent to the RemainingBits function to be . The complexity if (flag equals TRUE) then counter = counter +1 return counter (b) Figure (7-b): The new suggested algorithm: RemainingBits function For example. This complexity is related to the table size specified before. Performance Analysis The RemainingBits function algorithm is shown in figure (7-b) below: RemainingBits( number . then the most significant 12 bits are assumed to be the index bits to the lookup table. flag) +RemainingBits (number shifted right 2 bits.then counter1 = counter1+ 1 end if counter2 = counter2 + 1 number = number shifted right 4 bits end while (a) Figure (7-a): The new suggested algorithm ( Version Two) solved with the recursive calls for each four bits. the table size will be 64 entries instead of 256 entries and still works in constant time complexity. For example if the computer word size is 32 bits. This version will be more efficient in case used for small computer words. 00010001101000101110000100110010 Index to the lookup table for each 4 bits call the RemainingBits Function Figure (8): Version two example 4. we can reduce the size of the lookup table to a pre-specified number of entries and then calculate the complexity for resolving the remaining bits after excluding the index bits to that table. NOT flag) // NOT: logical-negation else if if ((number AND 0x3) equals then counter = counter +1 ((number OR 0x0) equals then counter = counter – 1 3) 0) In version one of the suggested algorithm. For example if the computer word is of a size of 32 bits. let a number be represented in 32 bits and assume that we have a table of size 212. For example if a computer word is of a size of 8 bits. then a table of size 1073741824 ( = 232 / 4 ) entries still very large. Reducing the space requirements to the fourth drives the space efficiency down for larger computer words. See Figure (8). then we can construct a table of size 2 12 elements and then iterate in the while loop for the remaining 20 bite. This version is also useful for the enhanced version of the lookup table algorithm.

Version two is also applicable which will reduce the table size again to one fourth (i.3 0.4 0. To measure the performance of version Two of 32-bits binary vectors (or computer word).12) bits.2 0. This is shown in the following analysis. Arithmetic's and logic.8 0. A Pentium 4 machine was used to generate the values and execute the second version of the new algorithm using these values. The over all time complexity will be 5 while iterations for 32 bits each with constant time complexity. With a computer word of size 8 bits version one is a good choice. and the average of the execution time was considered.8) bits. Conclusion Since bit-counting has become an important topic. One of the . With a computer word of size 16 bits we can apply version two by choosing the table size to be 2 8 instead of 2 16 entries. It is also gained control over the lookup table. the suggested enhancement of the lookup table algorithm is faster than the many of the existing bitcounting algorithms. 16 entry) and keeps the running time efficient.1 0. The execution is repeated several hundreds of times.ranges from zero to 1-. A comparison between these methods is shown in figure (9).6 0. many new algorithms have evolved to meet bit-counting application's demands. a 32-bits machine was simulated by generating hundreds of thousands of 32bits binary vectors randomly according to some probability of ones. According to that choice the while loop will iterate two times one time for each four bits in (n . Each while loop will call the RemainingBits function which will call it self recursively two times for each 4 bits. Five iterations for 32 bits is considered a good running time with the note that time complexity may vary for different computer words sizes. Parallel algorithm. The door for enhancing the performance level of this algorithm is still open. which will loop (n-12) / 4 times for an n bit number. the complexity comes from the while no more constant time. 60 50 40 Running time(ms) BitCount 30 parallel Arithmetic Logic 20 Enhanced Lookup 10 0 0 0. and enhanced lockup table algorithm.e.9 1 Probability of ones Figure (9): Time comparisons between different bit-counting methods and the new suggested method 6. and many of the existing ones have been enhanced in concerns of time efficiency. As shown if figure (9). Same thing was done to the sequential shifting algorithm.5 0. Changing the constructions of the lookup table according the behavior of the repeating patterns when n is power of 2 and resolving this with specific mathematical rule can get the algorithm space efficiency to a higher level. The table size is 64 entries instead of 256 entries and the running time is constant. According to that choice the while loop will iterate thirteen times one time for each four bits in (n . With a computer word of size 64 bits we can apply version two by choosing the table size to be 212 instead of 2 64 entries. By making more use of the comrade group property discussed before it could be more useful if implemented not only in the procedure code but also with constructing the Lookup table.7 0.

J. 6. No. Proc. 215-230. No. I. Lapir.1992. A Bit-Counting Algorithm Using the Frequency Division Principle. pp. and N. and H. 1-18. C. : Beating The Popcount. El-Qawasmeh. Vol. E. Complex Systems.. 427-432. Both versions of the algorithm implement the lookup table technique side by side with the idea of comrade groups to operate on a computer word or a binary vector of any size. It acts as a kind of slider between the table size and the running time which is set according to both the nature of the application and the available resources. 1977. Clark. E. Germany. S. 2000.most popular time efficient algorithms is the Lookup table algorithm which has the negativity of space requirements with large computer words and binary vectors. 51. El-Qawasmeh. pp. E. Nievergeit. Englewood Cliffs. Deb. K. M. Vol. Combinatorial Algorithms. 1. [4] E. Prentice Hall. [2] E. Journal of Research and Practice in Information Technology. Hemidi. 133-134. 62-64. pp. pp. New Jersey 07632. International Journal of Information Technology. J. Dobb’s Journal. Australia. El-Qawasmeh. G. Exploiting 64-Bit parallelism. 2001. Deo. 14. Goldberg. Mack. 2000. Dr. 2001. pp. Berkovich. El-Qawasmeh. Theory and Practice. and Mack. 9. 30.. S. [6] [7] References [1] E. In this paper a new algorithm is presented to overcome the badly-effective space requirement of the lookup table algorithm and keep the time efficient. Performance Investigation of Hamming Distance Bit Vertical Counter Applied to Access Methods in Information Retrieval. Noise. 32(3/4). 25. of the 16th IASTED International Conference Applied Informatics. Garmisch-Partenkirchen. This implementation makes the management of the running time against the required space more flexible. 1998. Gutman. pp. and the Sizing of Populations. No. Practice and Experience Vol. pp. 15311540. Performance Investigation of Bit-Counting Algorithms with a Speedup to Lookup Table. No. Journal Of The American Society For Information Science. Lapir. M. Vol. and Zincke. Vol. Vol.. Organization of Near Matching in Bit Attribute Matrix Applied to Associative Access Methods In Information Retrieval. Berkovich. [8] [3] . [5] R. 9. Genetic Algorithms. G. 5. 2000. Software. 333-362. Reingold.

Sign up to vote on this title
UsefulNot useful