J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:overzealous-cmov into main Mar 08, 2024

Summary

Special thanks to @runewalsh over at !610 (merged) for this one!

This merge request aims to correct some overzealous CMOV optimisations that may not be obvious until after the fact. If a "branching" block is made where CMOVcc is followed by Jcc with the same condition, and the register set by CMOVcc is deallocated after the jump (if it doesn't branch), then the CMOVcc instruction can be changed into a regular MOV instruction and potentially moved to before the comparison instruction for faster execution speed and smaller code size.

There is also a cleanup of OptPass2MOV and an additional MOV/CMP/MOV optimisation that attempts to rearrange registers so more of the MOVs can be placed before the comparison, thus increasing the chance of macro-fusing CMP with a subsequent Jcc. The conditions for this optimisation are quite strict (they require that another MOV or Jcc instruction follow) in order to best offset the performance loss from the potential pipeline stall introduced between the first MOV and CMP by increasing the chances of macro-fusion between CMP and Jcc.

Additional Notes

Unfortunately the most common case of this - setting the function result and exiting (e.g. via Exit(0);) - tends not to be optimised because the result register is frequently allocated across the entire procedure and so it is seen to still be in use after the jump even though it may get overwritten later.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Some CMOVcc instructions are changed back into MOV instructions to increase performance.

Relevant logs and/or screenshots

In aasmcpu (x86_64-win64, -O4), the most common example with a constant is shown - before:

	...
	movl	$2,%eax
	testb	$2,32(%rsp)
	cmovnel	%eax,%edx
	jne	.Lj572
	movzbl	2(%rsi),%edx
.Lj572:
	...

After:

	...
	movl	$2,%edx
	testb	$2,32(%rsp)
	jne	.Lj572
	movzbl	2(%rsi),%edx
.Lj572:
	...

The system unit shows a register-only one - before:

.Lj1527:
	...
	cmpw	%ax,%si
	cmovgw	%ax,%si
	testw	%r15w,%r15w
	cmovlw	%ax,%r13w
	jl	.Lj1537
	...

After:

.Lj1527:
	...
	cmpw	%ax,%si
	cmovgw	%ax,%si
	movw	%ax,%r13w
	testw	%r15w,%r15w
	jl	.Lj1537
	...

Also in the system unit, the MOV/CMP/MOV optimisation in OptPass2MOV comes into play - before:

.section .text.n_system$_$str_real$smallint$smallint$double$treal_type$openstring_$$_gen_digits_64$huxtohxuvdoc,"ax"
	...
	movq	%r9,%rcx
	xorl	%eax,%eax
	cmpq	$1000000000,%r9
	cmovbl	%eax,%r9d
	cmovbl	%eax,%r12d
	cmovbl	%ecx,%edi
	jb	.Lj1743
	...

After:

.section .text.n_system$_$str_real$smallint$smallint$double$treal_type$openstring_$$_gen_digits_64$huxtohxuvdoc,"ax"
	...
	movq	%r9,%rcx
	xorl	%r9d,%r9d
	xorl	%r12d,%r12d
	movl	%ecx,%edi
	cmpq	$1000000000,%rcx
	jb	.Lj1743
	...

In the same procedure in the system unit, a register and a constant CMOV get optimised together - before:

.section .text.n_system$_$str_real$smallint$smallint$double$treal_type$openstring_$$_gen_digits_64$huxtohxuvdoc,"ax"
	...
	movl	%ecx,%edi
	xorl	%ecx,%ecx
	cmpq	$1000000000,%rdx
	cmovbl	%ecx,%r9d
	cmovbl	%edx,%r12d
	jb	.Lj1743
	...

After:

.section .text.n_system$_$str_real$smallint$smallint$double$treal_type$openstring_$$_gen_digits_64$huxtohxuvdoc,"ax"
	...
	movl	%ecx,%edi
	xorl	%r9d,%r9d
	movl	%edx,%r12d
	cmpq	$1000000000,%rdx
	jb	.Lj1743
	...

The refactor of OptPass2MOV also fixed a case of inefficient optimisation in rgobj - before:

	...
.Lj479:
	movl	%esi,%eax
	movl	%ebx,%edx
	addq	%rdx,%rax
	shrq	$1,%rax
	movl	%eax,%ebp
	...

After:

	...
.Lj479:
	movl	%esi,%eax
	addl	%ebx,%eax
	rcrl	$1,%eax
	movl	%eax,%ebp
	...

Future Work

There was a noted example where performance is lost slightly in SysUtils - before:

.section .text.n_sysutils_$$_datetimetofiledate$tdatetime$$int64,"ax"
	...
	setab	%dl
	xorl	%ecx,%ecx
	orb	%dl,%al
	cmovneq	%rcx,%rax
	jne	.Lj11254

After:

.section .text.n_sysutils_$$_datetimetofiledate$tdatetime$$int64,"ax"
	...
	setab	%dl
	orb	%dl,%al
	testb	%al,%al
	movl	$0,%eax
	jne	.Lj11254
	...

Here, %dl is not used after the OR instruction (%eax is the function result, and the jump goes to the function epilogue), and one potential optimisation, once the correct conditions are properly worked out, is to reverse the operands on the OR instruction and change the TEST instruction to use %dl instead, thus becoming:

.section .text.n_sysutils_$$_datetimetofiledate$tdatetime$$int64,"ax"
	...
	setab	%dl
	orb	%al,%dl
	testb	%dl,%dl
	movl	$0,%eax
	jne	.Lj11254
	...

Which can then the further optimised to:

.section .text.n_sysutils_$$_datetimetofiledate$tdatetime$$int64,"ax"
	...
	setab	%dl
	orb	%al,%dl
	xorl	%eax,%eax
	testb	%dl,%dl
	jne	.Lj11254
	...

Then the performance is now truly faster since MOV 0 is positioned before TEST and becomes XOR again, can execute in parallel with the OR instruction due to register renaming, and the TEST and JNE instruction can undergo macro-fusion (still not perfect as the TEST instruction doesn't get removed by the post-peephole stage, but the cycle count is 3 (compared to about 4 in trunk and this patch) and the actual machine code is smaller).

Edited Mar 08, 2024 by J. Gareth "Kit" Moreton

[x86] Inefficient CMOVcc/Jcc pairs

Summary

Additional Notes

System

What is the current bug behavior?

What is the behavior after applying this patch?

Relevant logs and/or screenshots

Future Work

Merge request reports