Chapter 8 Integer Optimizations 179
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
8.6 Efficient Implementation of Population-Count
Function in 32-Bit Mode
Population count is an operation that determines the number of set bits in a bit string. For example,
this can be used to determine the cardinality of a set. The example code in this section shows how to
efficiently implement a population count operation for 32-bit operands. The example is written for the
inline assembler of Microsoft
®
Visual C.
Function popcount implements a branchless computation of the population count. It is based on a
O(log(n)) algorithm that successively groups the bits into groups of 2, 4, 8, 16, and 32, while
maintaining a count of the set bits in each group. The algorithm consists of the following steps:
1. Partition the integer into groups of two bits. Compute the population count for each 2-bit group
and store the result in the 2-bit group. This calls for the following transformation to be performed
for each 2-bit group:
00b -> 00b
01b -> 01b
10b -> 01b
11b -> 10b
If the original value of a 2-bit group is v, then the new value will be v –(v >> 1). In order to handle
all 2-bit groups simultaneously, it is necessary to mask appropriately to prevent spilling from one
bit group to the next lower bit group. Thus:
w = v - ((v >> 1) & 0x55555555)
2. Add the population count of adjacent 2-bit group and store the sum to the 4-bit group resulting
from merging these adjacent 2-bit groups. To do this simultaneously to all groups, mask out the
odd numbered groups, mask out the even numbered groups, and then add the odd numbered
groups to the even numbered groups:
x = (w & 0x33333333) + ((w >> 2) & 0x33333333)
Each 4-bit field now has one of the following values: 0000b, 0001b, 0010b, 0011b, or 0100b.
3. For the first time, the value in each k-bit field is small enough that adding two k-bit fields results
in a value that still fits in the k-bit field. Thus the following computation is performed:
y = (x + (x >> 4)) & 0x0F0F0F0F
The result is four 8-bit fields whose lower half has the desired sum and whose upper half contains
“junk” that has to be masked out. A symbolic form is as follows:
x = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh
x >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg
sum = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ
The WWWW, XXXX, YYYY, and ZZZZ values are the interesting sums with each at most
1000b, or 8 decimal.