Generic fpc_mul_qword_checkoverflow (probably also fpc_mul_qword) are really slow
Usually, debug mode that does almost no optimizations and has range checks enabled slows down the execution by the factor of two or three, which is manageable.
Today I noticed that one of my performance-critical paths suddenly slowed down on x86-64
by the factor of 15. This starts to become unpleasant (like, have to wait one minute instead of 4 seconds). I have narrowed this problem down to numerous SizeUint * SizeUint
multiplications that produce fpc_mul_qword_checkoverflow
calls under range checks.
There are two possible solutions in my mind. Either:
-
Make a specialized version for
x86-64
, asx86-32
already does (rtl/i386/int64p.inc). Should be easy (not that I can do it myself...), but will be limited tox86-64
. Still, at least onx86-64
it will be useful all around, becauseqword * qword
multiplications are native there and thus extremely common. -
Generic implementation of
fpc_mul_qword
andfpc_mul_qword_checkoverflow
(rtl/inc/int64.inc) can be improved.
- Provided a half-width (
uint32 * uint32
) multiplication is available, resort to it in a well-known way.
function fpc_mul_qword(f1, f2: qword): qword;
type
Half = uint32;
Quart = uint16;
var
hl, p0, p1, p2, p3: Half;
carry: Quart;
begin
hl := Hi(f1) * Lo(f2) + Lo(f1) * Hi(f2);
p0 := Half(Lo(Lo(f1))) * Lo(Lo(f2));
p1 := Half(Lo(Lo(f1))) * Hi(Lo(f2));
p2 := Half(Hi(Lo(f1))) * Lo(Lo(f2));
p3 := Half(Hi(Lo(f1))) * Hi(Lo(f2));
carry := (Half(Hi(p0)) + Lo(p1) + Lo(p2)) shr bitsizeof(Quart);
result :=
Half(p0 + p1 shl bitsizeof(Quart) + p2 shl bitsizeof(Quart)) or
qword(p3 + Hi(p1) + Hi(p2) + carry + hl) shl bitsizeof(Half);
end;
(Okay, this looks scary... maybe I overcomplicated something...)
- Current implementation can use
BsrQWord
to improve the common case when one of numbers to be multiplied is small.
function fpc_mul_qword(f1, f2: qword): qword;
var
swapTemp: qword;
b: byte;
begin
result := 0;
// Ensure f1 >= f2 and both nonzero, to use BsrQWord and to minimize iterations count.
if f1 < f2 then begin swapTemp := f1; f1 := f2; f2 := swapTemp; end;
if f2 = 0 then exit;
for b := 0 to BsrQWord(f2) do
begin
if odd(f2) then
result := result + f1;
f1 := f1 shl 1;
f2 := f2 shr 1;
end;
end;
- In addition to doing the same,
fpc_mul_qword_overflow
can resort to uncheckedfpc_mul_qword
whenBsrQWord(f1) + BsrQWord(f2) <= bitsizeof(QWord) - 2
, because this condition guarantees that the multiplication won't overflow and will trigger often.