fast division by constant integer

Description

Integer division is notoriously slow, but can be replaced by a multiplication-shift sequence which preferrable when dividing many numbers by the same divisor. In short, given a k bit integer d, p = log2_ceil(d) and magic = (2^(k+p) - 1) / d. n / d = (muluh(a, magic) + a) >> p, where muluh is the upper k bits of the 2k-bit product a * magic. Although not necessary, you can decompose even divisors such that d = 2^tz * d_odd where tz is the trailing zeros. It is probably of dubious benefit in most cases, but its pretty cheap to calculate and produces more intuitive magic numbers and shifts.

While this is heavily influenced by https://github.com/ridiculousfish/libdivide and the TensorIntDiv implementation in Eigen (which are both influenced by Hacker's delight), I deviated in a few areas as the sources above were confusing to me.

Most implementations -- and apparently the clang and g++ compilers -- handle overflow in the final step as follows:

b = muluh(a, magic);
// perform (a + b) >> p, but handle overflow in a + b with some clever re-arrangement
t = (((a - b) >> 1) + b ) >> (p - 1)

This works well but requires special handling when the divider is exactly 1, as log2_ceil(1) == 0. My approach always uses a wider type to handle overflow. For example:

uint32_t fast_int_div(uint32_t a, uint32_t magic, int shift) {
  uint64_t b = muluh(a, magic);
// no overflow to worry about, shift == 0 or 32 is ok too
  uint64_t t = (b + a) >> shift;
  return static_cast<uint32_t>(t);
}

Even if a native wider type is not available, we can handle it by constructing wider type from 2 integers. I think this approach is simpler as no special handling is required for d == 1, and we can actually handle any divisor of any magnitude for maximum generality.

All of this is abstracted away with IntDivider<Numerator> which allows users to divide individual scalars or dense Eigen expressions.

int d; // this can be any integer type
IntDivider<int> divider(d); // IntDivider<int> 's constructor is templated to accept any integer type
Vector<int, Dynamic> data;
data[0] = data[0] / divider;
data[1] = divider.divide(data[1]);

I did not automatically implement this anywhere as it is not always preferrable. If the divisor is known at compile time, it is best to let the compiler do the work, preferably with a raw loop. The reason we prefer a raw loop is that Eigen sometimes vectorizes 32-bit integer division by casting to double, which is preferrable if the divisor is not known at compile time. Some benchmarks:

ubuntu clang++ -O3 -mavx2 -DNDEBUG

uint32_t, size = 1024, d = 7 ns
volatile raw loop 1138
compile-time raw loop 72
fast division xpr 108
int64_t, size = 1024, d = 7 ns
volatile raw loop 1138
compile-time raw loop 384
fast division xpr 644

Unsigned division is better across the board, and is closer to the compile time case. I believe my signed division is not ideal as I simply re-use the unsigned version and flip the sign at the end. I think the compiler can also cheat by knowing the sign and magnitude of the divisor, so maybe its not that bad.

Reference issue

Additional information

Edited by Charles Schlosser

Merge request reports

Loading