[x86] Refactoring and extensions of CMOV optimisations (!198) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:cmov-const into main Apr 17, 2022

Summary

This merge request (which has been on and off for a few months now, hence the internal error numbers) rewrites the CMOV optimisations in OptPass2Jcc for x86 platforms and adds some new features, including the ability to optimise MOVs that write immediates to registers in places.

More specifically:

Existing CMOV optimisations in OptPass2Jcc have been refactored, mainly because it was proving difficult to remember what each local variable was set to (they now have more meaningful names, like hp_jump rather than hp4).
As a side-effect of the refactor, alignment hints are now properly removed with the labels during the CMOV optimisations.
OptPass2Jcc can now deal with Jcc/MOV/.lbl combinations where the MOV writes an immediate (constant) to a register by making use of the fairly new GetIntRegisterBetween instruction. This optimisation is not performed under -Os due to it creating additional instructions and hence increasing code size.
As part of the CMOV optimisations, the compiler will now pick up when two opposing CMOVs (have opposite conditions) write to the same register, and will try to ensure the first one is a MOV and is positioned as early as possible in order to improve performance.
As part of the above two points, the compiler will attempt to move the MOV instructions to before the CMP or TEST instruction (if they exist) that appears prior to the Jcc instruction. It will also try to go further if something like SUB or AND etc. appears before that instruction, thus ensuring optimisations performed later by PostPeepholeOptTestOr are not lost.
A new TrySwapMovOp method has been introduced to facilitate the above, and TrySwapMovCmp has been rewritten to use this. At the same time, some bug fixes were made where MOVSS was checked instead of MOVSD when checking to see if it's the 0-operand (i.e. REP MOVSD) or 2-operand version (the SSE instruction).
IsRefSafe now returns true if the reference type is addr_full, although i386 does not currently permit addr_full references in CMOV optimisations because these sometimes get turned into immediates (usually because the reference points to a small global constant) and raise assembler errors when they appear in CMOV instructions. This will be subject to further research.
OptPass1CMP now detects CMP/JE/MOV/.lbl and CMP/JNE/MOV/.lbl combinations where the operands of the CMP and MOV instructions are identical (possibly in reverse order). When found, only the MOV is kept when the conditional jump is JE, and everything is removed (unless the label has other references) in the JNE case, because the end result is that the two operands will always be identical afterwards. The case with JE seems to be popular in property setters.
Even if IsRefSafe returns False, if a reference in a MOV instruction appears exactly as it does in the most recent comparison (the one that sets the current conditional flags), then CMOV is permitted because if the reference is invalid, a segmentation fault will have already occurred.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

More opportunities for removing conditional branches are found, converting them to CMOV sequences.

Relevant logs and/or screenshots

An example of the CMOV optimisation working successfully with an immediate in aasmcpu.s (under x86_64-win64 -O4) - before:

	...
.Lj400:
	cmpw	$58,92(%rbx)
	jne	.Lj380
	movw	$60,%si
.Lj380:
	...

After:

	...
.Lj400:
	movw	$60,%ax
	cmpw	$58,92(%rbx)
	cmovew	%ax,%si
.Lj380:
	...

A nice example in classes.s where two variables are conditionally set to zero, using the same temporary register - before:

	...
.Lj346:
	...
	cmpq	%rax,%r9
	jne	.Lj352
	xorl	%eax,%eax
	xorl	%ecx,%ecx
.Lj352:
	...

After:

	...
.Lj346:
	...
	xorl	%ebx,%ebx
	cmpq	%rax,%r9
	cmoveq	%rbx,%rax
	cmoveq	%rbx,%rcx
.Lj352:
	...

An example of a CMOV(c)/CMOV(~c) pair being converted to MOV/CMOV(~c) in avl_tree.s... - before:

	...
.Lj682:
	cmpl	%ebx,%edx
	cmovll	%ebx,%eax
	cmovnll	%edx,%eax
	...

After:

	...
.Lj682:
	movl	%ebx,%eax
	cmpl	%ebx,%edx
	cmovnll	%edx,%eax
	...

In the same unit, we get to see the optimisation in OptPass1Cmp in action (intermingled with the CMOV constant optimisation) - before:

.section .text.n_avl_tree$_$tavltreenodememmanager_$__$$_setmaxfreeratio$int64,"ax"
	.balign 16,0x90
.globl	AVL_TREE$_$TAVLTREENODEMEMMANAGER_$__$$_SETMAXFREERATIO$INT64
AVL_TREE$_$TAVLTREENODEMEMMANAGER_$__$$_SETMAXFREERATIO$INT64:
	movq	%rcx,%rax
	testq	%rdx,%rdx
	jnl	.Lj814
	xorl	%edx,%edx
.Lj814:
	cmpq	40(%rax),%rdx
	je	.Lj811
	movq	%rdx,40(%rax)
.Lj811:
	ret

After:

.section .text.n_avl_tree$_$tavltreenodememmanager_$__$$_setmaxfreeratio$int64,"ax"
	.balign 16,0x90
.globl	AVL_TREE$_$TAVLTREENODEMEMMANAGER_$__$$_SETMAXFREERATIO$INT64
AVL_TREE$_$TAVLTREENODEMEMMANAGER_$__$$_SETMAXFREERATIO$INT64:
	movq	%rcx,%rax
	xorl	%ecx,%ecx
	testq	%rdx,%rdx
	cmovlq	%rcx,%rdx
	movq	%rdx,40(%rax)
	ret

It's not perfect, as a stronger optimisation would be to swap %rax and %rcx (%rax isn't used in the result), although this is rather deep - i.e.:

.section .text.n_avl_tree$_$tavltreenodememmanager_$__$$_setmaxfreeratio$int64,"ax"
	.balign 16,0x90
.globl	AVL_TREE$_$TAVLTREENODEMEMMANAGER_$__$$_SETMAXFREERATIO$INT64
AVL_TREE$_$TAVLTREENODEMEMMANAGER_$__$$_SETMAXFREERATIO$INT64:
	xorl	%eax,%eax
	testq	%rdx,%rdx
	cmovlq	%rax,%rdx
	movq	%rdx,40(%rcx)
	ret

To emphasise the point about it being popular in property setters, this is the Pascal code for the above procedure:

procedure TAVLTreeNodeMemManager.SetMaxFreeRatio(NewValue: SizeInt);
begin
  if NewValue<0 then NewValue:=0;
  if NewValue=FMaxFreeRatio then exit;
  FMaxFreeRatio:=NewValue;
end;

A slightly less optimal CMOV(c)/CMOV(~c) conversion in cclasses.s - before:

	...
.Lj1119:
	...
	movq	%rax,%r13
	cmpq	%r14,%rax
	cmovgeq	%r14,%rax
	cmovngeq	%r13,%rax
	...

After:

	...
.Lj1119:
	...
	movq	%rax,%r13
	cmpq	%r14,%rax
	movq	%r14,%rax ; Note that the MOV instruction is not moved to before the CMP instruction because it would change the comparison result.
	cmovngeq	%r13,%rax
	...

This one might need analysis of the Pascal source for inefficiencies or to look more closely at the peephole optimizer, because the code still isn't optimal, mainly because %r13 = %rax at this point (%r13 isn't used afterwards), and the optimal code would actually be:

	...
.Lj1119:
	...
	cmpq	%r14,%rax
	cmovgeq	%r14,%rax ; (strictly speaking, cmovgq %r14,%rax would be best since this removes the extra latency of writing to %rax when it's already equal to %r14)
	...

Additional Notes

Quite a new cases of duplicate code have been noticed where the compiler checks to see if a comparison is testing against zero. Since it is before the post-peephole stage, it searches for cmp 0,%reg, test %reg,%reg and test $-1,%reg. A future refactor will put all these individual checks into a single Boolean function.

Edited Sep 12, 2022 by J. Gareth "Kit" Moreton

[x86] Refactoring and extensions of CMOV optimisations

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Relevant logs and/or screenshots

Additional Notes

Merge request reports