![]() ![]() 4+4 = 8 which still fits in 4 bits, so carry between nibble elements is impossible in i + (i > 4). It masks after adding instead of before, because the maximum value in any 4-bit accumulator is 4, if all 4 bits of the corresponding input bits were set. The final shift-and-add step of (i + (i > 4)) & 0x0F0F0F0F widens to 4x 8-bit accumulators. before shifting is a good thing when compiling for ISAs that need to construct 32-bit constants in registers separately. optimization isn't possible this time so it does just mask before / after shifting. The next step takes the odd/even eight of those 16x 2-bit accumulators and adds again, producing 8x 4-bit sums. ![]() This effectively does 16 separate additions in 2-bit accumulators ( SWAR = SIMD Within A Register). The first step is an optimized version of masking to isolate the odd / even bits, shifting to line them up, and adding. How this SWAR bithack works: i = i - ((i > 1) & 0x55555555) It doesn't get any faster with "simple" inputs, but it's still pretty decent.) (Its performance is not data-dependent on normal CPUs where all integer operations including multiply are constant-time. This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it. Return (i * 0x01010101) > 24 // horizontal sum of bytesįor JavaScript: coerce to integer with |0 for performance: change the first line to i = (i|0) - ((i > 1) & 0x55555555) Java: use int, and use > instead of >. ![]() GCC10 and clang 10.0 can recognize this pattern / idiom and compile it to a hardware popcnt or equivalent instruction when available, giving you the best of both worlds. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. clearing the lowest set with a bithack in a loop until it becomes zero. If you know that your bytes will be mostly 0's or mostly 1's then there are efficient algorithms for these scenarios, e.g. (Look up each byte separately to keep the table small.) If you want popcount for a contiguous range of numbers, only the low byte is changing for groups of 256 numbers, making this very good. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory. Portable algorithms that don't need (or benefit from) any HW supportĪ pre-populated table lookup method can be very fast if your CPU has a large cache and you are doing lots of these operations in a tight loop. C) might not expose any portable function that could use a CPU-specific popcount when there is one. But your compiler's choice of fallback for target CPUs that don't have hardware popcnt might not be optimal for your use-case. C++20 std::popcount(), or C++ std::bitset::count(), as a portable way to access builtin / intrinsic functions (see another answer on this question). Your compiler may know how to do something that's good for the specific CPU you're compiling for, e.g. The 'best' algorithm really depends on which CPU you are on and what your usage pattern is. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle ( citation needed - hardware popcount is normally fast if it exists at all.). Instructions like x86's popcnt (on CPUs where it's supported) will almost certainly be fastest for a single integer. Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. This is known as the ' Hamming Weight', 'popcount' or 'sideways addition'. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |