(Related to this issue)

This branch provides a speed boost to div and mod by a constant when dealing with unsigned 32-bit inputs on x86_64 platforms. Special thanks to Marģers for the suggestion and assembly language sample showing how using the full 64-bit registers provide a significant gain, especially for "bad" divisors such as 7 (they have no suitable reciprocal in 32 bits, only 33 bits, which requires some extra instructions to compensate for).

The speed boost comes from the fact that, constrained to 32-bit, the operations required for a "good" divisor is a high multiplication (take the upper 32 bits of the result) followed by a right shift. When expanded to 64-bit, because of the restricted domain of the input (between 0 and 2^31 - 1), a much larger error is permitted and so it can be configured such that only the high multiplication is required. This also removes the pipeline stall between the multiplication and the shift, allowing more concurrent execution. For example, to divide by 3 using 32-bit only, you have to multiply high by $AAAAAAAB then shift by 1. Expanding to 64-bit, multiplying by $5555555555555556 is all that's required. The savings are even greater with "bad" divisors because, under 32-bit, an addition and rorate-with-carry is required between the multiplication and the shift, and under 64-bit, once again just the high multiplication by a suitable constant suffices.

The updates to the bdiv bench test (and tmoddiv6) adds some tests for 16-bit inputs and changes the 32-bit and 64-bit tests to remove division by 5 (since it's very similar to how division by 10 is handled - the reciprocals are identical and only the shift factor is different) and adds division by 7, a known "bad" divisor when it comes to this algorithm and hence a good candidate for benchmarking.

Criteria

Confirm correct compilation and speed gains under x86_64.

Additional

I attempted to implement this for AArch64 as well, but no speed gains were noted, probably due to how constants are encoded and a 'safety check' to ensure the upper 32 bits of the numerator are zero (a couple of the date/time tests failed on x86_64 without this because a function result was 64-bit, with only the lower 32 bits being used, but the upper 32 bits being non-zero). The x86 peephole optimizer removes the "and $ffffffff,%reg" instructions if they prove to be unnecessary, but the AArch64 peephole optimizer does not yet have such optimisations. This may improve in the future.

Note

The changes to bdiv will cause a merge conflict with the "Fast "x mod constant = 0" optimisation" merge request, as they add different tests, although this is simple enough to rectify, whichever one is merged first.

Edited Sep 22, 2021 by J. Gareth "Kit" Moreton

x86-64: Improved integer division (part 1)

Criteria

Additional

Note

Merge request reports