fast division by constant integer
Description
Integer division is notoriously slow, but can be replaced by a multiplication-shift sequence which preferrable when dividing many numbers by the same divisor. In short, given a k bit integer d, p = log2_ceil(d) and magic = (2^(k+p) - 1) / d. n / d = (muluh(a, magic) + a) >> p, where muluh is the upper k bits of the 2k-bit product a * magic. Although not necessary, you can decompose even divisors such that d = 2^tz * d_odd where tz is the trailing zeros. It is probably of dubious benefit in most cases, but its pretty cheap to calculate and produces more intuitive magic numbers and shifts.
While this is heavily influenced by https://github.com/ridiculousfish/libdivide and the TensorIntDiv implementation in Eigen (which are both influenced by Hacker's delight), I deviated in a few areas as the sources above were confusing to me.
Most implementations -- and apparently the clang and g++ compilers -- handle overflow in the final step as follows:
b = muluh(a, magic);
// perform (a + b) >> p, but handle overflow in a + b with some clever re-arrangement
t = (((a - b) >> 1) + b ) >> (p - 1)
This works well but requires special handling when the divider is exactly 1, as log2_ceil(1) == 0. My approach always uses a wider type to handle overflow. For example:
uint32_t fast_int_div(uint32_t a, uint32_t magic, int shift) {
uint64_t b = muluh(a, magic);
// no overflow to worry about, shift == 0 or 32 is ok too
uint64_t t = (b + a) >> shift;
return static_cast<uint32_t>(t);
}
Even if a native wider type is not available, we can handle it by constructing wider type from 2 integers. I think this approach is simpler as no special handling is required for d == 1, and we can actually handle any divisor of any magnitude for maximum generality.
All of this is abstracted away with IntDivider<Numerator> which allows users to divide individual scalars or dense Eigen expressions.
int d; // this can be any integer type
IntDivider<int> divider(d); // IntDivider<int> 's constructor is templated to accept any integer type
Vector<int, Dynamic> data;
data[0] = data[0] / divider;
data[1] = divider.divide(data[1]);
I did not automatically implement this anywhere as it is not always preferrable. If the divisor is known at compile time, it is best to let the compiler do the work, preferably with a raw loop. The reason we prefer a raw loop is that Eigen sometimes vectorizes 32-bit integer division by casting to double, which is preferrable if the divisor is not known at compile time. Some benchmarks:
ubuntu clang++ -O3 -mavx2 -DNDEBUG
| uint32_t, size = 1024, d = 7 | ns |
|---|---|
| volatile raw loop | 1138 |
| compile-time raw loop | 72 |
| fast division xpr | 108 |
| int64_t, size = 1024, d = 7 | ns |
|---|---|
| volatile raw loop | 1138 |
| compile-time raw loop | 384 |
| fast division xpr | 644 |
Unsigned division is better across the board, and is closer to the compile time case. I believe my signed division is not ideal as I simply re-use the unsigned version and flip the sign at the end. I think the compiler can also cheat by knowing the sign and magnitude of the divisor, so maybe its not that bad.