Generic fpc_mul_qword_checkoverflow (probably also fpc_mul_qword) are really slow
Usually, debug mode that does almost no optimizations and has range checks enabled slows down the execution by the factor of two or three, which is manageable.
Today I noticed that one of my performance-critical paths suddenly slowed down on x86-64 by the factor of 15. This starts to become unpleasant (like, have to wait one minute instead of 4 seconds). I have narrowed this problem down to numerous SizeUint * SizeUint multiplications that produce fpc_mul_qword_checkoverflow calls under range checks.
There are two possible solutions in my mind. Either:
-
Make a specialized version for
x86-64, asx86-32already does (rtl/i386/int64p.inc). Should be easy (not that I can do it myself...), but will be limited tox86-64. Still, at least onx86-64it will be useful all around, becauseqword * qwordmultiplications are native there and thus extremely common. -
Generic implementation of
fpc_mul_qwordandfpc_mul_qword_checkoverflow(rtl/inc/int64.inc) can be improved.
- Provided a half-width (
uint32 * uint32) multiplication is available, resort to it in a well-known way.
function fpc_mul_qword(f1, f2: qword): qword;
type
Half = uint32;
Quart = uint16;
var
hl, p0, p1, p2, p3: Half;
carry: Quart;
begin
hl := Hi(f1) * Lo(f2) + Lo(f1) * Hi(f2);
p0 := Half(Lo(Lo(f1))) * Lo(Lo(f2));
p1 := Half(Lo(Lo(f1))) * Hi(Lo(f2));
p2 := Half(Hi(Lo(f1))) * Lo(Lo(f2));
p3 := Half(Hi(Lo(f1))) * Hi(Lo(f2));
carry := (Half(Hi(p0)) + Lo(p1) + Lo(p2)) shr bitsizeof(Quart);
result :=
Half(p0 + p1 shl bitsizeof(Quart) + p2 shl bitsizeof(Quart)) or
qword(p3 + Hi(p1) + Hi(p2) + carry + hl) shl bitsizeof(Half);
end;
(Okay, this looks scary... maybe I overcomplicated something...)
- Current implementation can use
BsrQWordto improve the common case when one of numbers to be multiplied is small.
function fpc_mul_qword(f1, f2: qword): qword;
var
swapTemp: qword;
b: byte;
begin
result := 0;
// Ensure f1 >= f2 and both nonzero, to use BsrQWord and to minimize iterations count.
if f1 < f2 then begin swapTemp := f1; f1 := f2; f2 := swapTemp; end;
if f2 = 0 then exit;
for b := 0 to BsrQWord(f2) do
begin
if odd(f2) then
result := result + f1;
f1 := f1 shl 1;
f2 := f2 shr 1;
end;
end;
- In addition to doing the same,
fpc_mul_qword_overflowcan resort to uncheckedfpc_mul_qwordwhenBsrQWord(f1) + BsrQWord(f2) <= bitsizeof(QWord) - 2, because this condition guarantees that the multiplication won't overflow and will trigger often.