Jump over some LOCK prefixes instead of having LOCK and non-LOCK branches.
First, slightly improve i386.inc:fpc_AnsiStr_Decr_Ref
: use scratch edx
instead of nonvolatile esi
, inline cpudeclocked
, do what said in the title, and tail-call FPC_FREEMEM
(I’d save one more indirection level with {$ifndef FPC_PIC} jmp MemoryManager.FreeMem {$endif}
, but that would require making MemoryManager
accessible from there first...).
Second, jump over the LOCK
prefixes in x86_64.inc:inclocked/declocked
.
You may have a fair amount of doubt: is it okay to jump over the LOCK
prefix, essentially into the middle of the prefixed instruction. The short answer is “yes”. I found at least this example where the first line of the disassembly does exactly that:
80ade23: 74 01 je 0x80ade26
80ade25: f0 0f c1 16 lock xadd %edx,(%esi)
and a mention in Agner Fog’s “Optimizing subroutines in assembly language” that can be allegorically read as a very vague advice against such jumps:
In total, this MR saves ~16 bytes of code for each of 5 functions it touches, and 35 LoC.