2-instruction 0-jump 0-cmov BSR/F on x86 based on half-documented reality. (!674) · Merge requests · FPC / FPC / FPC Source

Rika requested to merge runewalsh/source:bs into main May 11, 2024

Free Pascal documents that BSR/F return 255 on zero input, and presently they are compiled into:

{-$optimization nopeephole}
var
	x: uint32;
begin
	x := random(0) + 5;
	x := BsrDWord(x);
	writeln(x);
end.

; x := BsrDWord(x);
mov   ecx, $FF
bsr   ebx, eax
cmovz ebx, ecx

Until recent changes it was even (try nopeephole):

; x := BsrDWord(x);
bsr ebx, eax
jnz .done
mov ebx, $FF
.done:

Now, let’s look at the vendor’s documentation.

AMD:

If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register.

Intel:

If the content of the source operand is 0, the content of the destination operand is undefined.

In this seemingly unfortunate situation, we shall turn to folklore: [1], [2], [3], [4]. They say that despite wording, Intel silently implements the same behavior as AMD, which makes sense.

So, in practice, dest := BSF/R(src) can be compiled into MOV dest, $FF; BSF/R dest, src.

Benchmark: BsfBenchmark.pas.

My speedup: 0.9→0.3 ns/call.

Grains of salt:

None of the compilers on Godbolt do this, maybe for a reason, maybe not...
Folklore is discreet about ancient Intel CPUs, consult your sources.
I have positively no idea what’s going on in the code I changed, please recheck.
~~what happened with pipelines~~

2-instruction 0-jump 0-cmov BSR/F on x86 based on half-documented reality.

Merge request reports