Generic fpc_mul_qword_checkoverflow (probably also fpc_mul_qword) are really slow

Usually, debug mode that does almost no optimizations and has range checks enabled slows down the execution by the factor of two or three, which is manageable.

Today I noticed that one of my performance-critical paths suddenly slowed down on x86-64 by the factor of 15. This starts to become unpleasant (like, have to wait one minute instead of 4 seconds). I have narrowed this problem down to numerous SizeUint * SizeUint multiplications that produce fpc_mul_qword_checkoverflow calls under range checks.

There are two possible solutions in my mind. Either:

Make a specialized version for x86-64, as x86-32 already does (rtl/i386/int64p.inc). Should be easy (not that I can do it myself...), but will be limited to x86-64. Still, at least on x86-64 it will be useful all around, because qword * qword multiplications are native there and thus extremely common.
Generic implementation of fpc_mul_qword and fpc_mul_qword_checkoverflow (rtl/inc/int64.inc) can be improved.

Provided a half-width (uint32 * uint32) multiplication is available, resort to it in a well-known way.

function fpc_mul_qword(f1, f2: qword): qword;
type
	Half = uint32;
	Quart = uint16;
var
	hl, p0, p1, p2, p3: Half;
	carry: Quart;
begin
	hl := Hi(f1) * Lo(f2) + Lo(f1) * Hi(f2);
	p0 := Half(Lo(Lo(f1))) * Lo(Lo(f2));
	p1 := Half(Lo(Lo(f1))) * Hi(Lo(f2));
	p2 := Half(Hi(Lo(f1))) * Lo(Lo(f2));
	p3 := Half(Hi(Lo(f1))) * Hi(Lo(f2));
	carry := (Half(Hi(p0)) + Lo(p1) + Lo(p2)) shr bitsizeof(Quart);

	result :=
		Half(p0 + p1 shl bitsizeof(Quart) + p2 shl bitsizeof(Quart)) or
		qword(p3 + Hi(p1) + Hi(p2) + carry + hl) shl bitsizeof(Half);
end;

(Okay, this looks scary... maybe I overcomplicated something...)

Current implementation can use BsrQWord to improve the common case when one of numbers to be multiplied is small.

function fpc_mul_qword(f1, f2: qword): qword;
var
	swapTemp: qword;
	b: byte;
begin
	result := 0;
	// Ensure f1 >= f2 and both nonzero, to use BsrQWord and to minimize iterations count.
	if f1 < f2 then begin swapTemp := f1; f1 := f2; f2 := swapTemp; end;
	if f2 = 0 then exit;

	for b := 0 to BsrQWord(f2) do
	begin
		if odd(f2) then
			result := result + f1;
		f1 := f1 shl 1;
		f2 := f2 shr 1;
	end;
end;

In addition to doing the same, fpc_mul_qword_overflow can resort to unchecked fpc_mul_qword when BsrQWord(f1) + BsrQWord(f2) <= bitsizeof(QWord) - 2, because this condition guarantees that the multiplication won't overflow and will trigger often.

Edited Nov 24, 2021 by Rika

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information