[x86] CMOV optimisation overhaul (!532) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:cmov-jcc into main Nov 12, 2023

Summary

This merge request overhauls the CMOV optimisations present in OptPass2Jcc. Initially just a refactor so it can be more easily ported to AArch64 in the future, some unexpected improvements appeared.

CMOV block creation "outlined" into a separate object that's created only when needed.
Duplicated or similar code greatly reduced to improve maintenance and reduce bugs.
New combinations 'discovered' and categorised:
- (EXISTS) "Simple": Jcc .Lbl1 / MOV / .Lbl1:. Optimised to CMOV(~c).
- (EXISTS) "Detour": Jcc .Lbl1 / MOV / JMP .Lbl1. Optimised to CMOV(~c) / JMP .Lbl1.
- (EXISTS) "Double": Jcc .Lbl1 / MOV / JMP .Lbl2 / .Lbl1 / MOV / .Lbl2. Optimised to CMOV(~c) / CMOVcc, with some optimisations if two CMOVs make a complementary pair (opposite conditions but writing to the same register.
- "Branching": Jcc .Lbl1 / MOV / JMP .Lbl2 / .Lbl1:. If it's not the classic "Double", this can be converted to CMOV(~c) / J(~c) .Lbl2.
- "Double Branching (Same)": Jcc .Lbl1 / MOV / JMP .Lbl2 / .Lbl1: / MOV / JMP .Lbl2. In this case, .Lbl2 appears elsewhere, but there's an extra jump at the end of the 2nd block that goes to the same location. The end result is CMOV(~c) / CMOVcc / JMP .Lbl2, with some optimisations if two CMOVs make a complementary pair.
- "Double Branching (Different)": Jcc .Lbl1 / MOV / JMP .Lbl2 / .Lbl1: / MOV / JMP .Lbl3. Like the above, but the two jumps go to different places. This is mostly treated the same as the single "Branching" type, although complementary CMOV optimisations take place as well, otherwise what follows .Lbl1 is largely left untouched.
- "Double Second Branching": Jcc .Lbl1 / MOV / JMP .Lbl2 / .Lbl1: / MOV / JMP .Lbl3 / .Lbl2:. Like "Double Branching (Different)", only .Lbl2 appears after the second block. Most of the same optimisations as "Double" and "Double Branching (Same)" are performed, only the final jump is made conditional. The end result is CMOV(~c) / CMOVcc / Jcc .Lbl3 / .Lbl2:.
OptPass2CMP and OptPass2TEST, and a secondary TryCmpCMovOpts routine, have been introduced to optimise complementary CMOV pairs and reposition MOV instructions that are otherwise missed in a few circumstances, usually as a result of a "Branching" block being created and the leftover Jcc making a new CMOV block that for some reason can't be easily merged to make one of the double block types.
Because the FLAGS register has to be allocated over much of the CMOV blocks, some mov $0,%reg instructions don't get optimised to xor %reg,%reg even though it's actually safe to do. As such, PostPeepholeOptMov has been adapted so if the FLAGS are in use, it searches ahead to see if the flags get overwritten (e.g. by a CMP instruction) and so can be safely scrambled.

For all the double types, .Lbl1 can only have a single reference.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

CMOV block optimisations are now more prevalent and the compiler source refactored to be easier to maintain this particular optimisation.

Development Notes

During initial testing, there was a mysterious failure on i386-linux, namely the test tintfcdecl2.pp, which was odd because the peephole optimizer doesn't run for that test, and turned out to be a code generation error. The bug that caused this, an error in the new TryCmpCMovOpts routine, was fixed as a result. It goes to show though that it is very hard to track all the effects of new compiler features since they can combine with existing optimisations in unexpected ways or otherwise be disguised until a very particular code sequence unveils them.

Relevant logs and/or screenshots

Well, where to begin. Virtually every source file sees some kind of improvement, both in speed and size, the latter mostly from the stripping of unnecessary alignment fields. For a simple introduction, in the ServiceManager unit under x86_64-win64, -O4, before:

.section .text.n_servicemanager$_$tservicemanager_$_configservice$qword$tservicedescriptor_$$_stopchar$hiub462steen,"ax"
	.balign 16,0x90
.globl	SERVICEMANAGER$_$TSERVICEMANAGER_$_CONFIGSERVICE$QWORD$TSERVICEDESCRIPTOR_$$_STOPCHAR$hiuB462sTeEN
SERVICEMANAGER$_$TSERVICEMANAGER_$_CONFIGSERVICE$QWORD$TSERVICEDESCRIPTOR_$$_STOPCHAR$hiuB462sTeEN:
.Lc322:
	cmpq	$0,(%rdx)
	jne	.Lj442
	xorl	%eax,%eax
	ret
	.p2align 4,,10
	.p2align 3
.Lj442:
	movq	(%rdx),%rdx
	testq	%rdx,%rdx
	jne	.Lj444
	leaq	FPC_EMPTYCHAR(%rip),%rdx
.Lj444:
	movq	%rdx,%rax
.Lc323:
	ret

After:

.section .text.n_servicemanager$_$tservicemanager_$_configservice$qword$tservicedescriptor_$$_stopchar$hiub462steen,"ax"
	.balign 16,0x90
.globl	SERVICEMANAGER$_$TSERVICEMANAGER_$_CONFIGSERVICE$QWORD$TSERVICEDESCRIPTOR_$$_STOPCHAR$hiuB462sTeEN
SERVICEMANAGER$_$TSERVICEMANAGER_$_CONFIGSERVICE$QWORD$TSERVICEDESCRIPTOR_$$_STOPCHAR$hiuB462sTeEN:
.Lc322:
	xorl	%ecx,%ecx
	cmpq	$0,(%rdx)
	cmoveq	%rcx,%rax
	je	.Lj443
	movq	(%rdx),%rdx
	testq	%rdx,%rdx
	jne	.Lj444
	leaq	FPC_EMPTYCHAR(%rip),%rdx
.Lj444:
	movq	%rdx,%rax
.Lj443:
.Lc323:
	ret

This could actually be improved further with the right checks, since jne .Lj444 will always branch (The value at (%rdx) cannot be zero at that point, since je .Lj443 did not branch). Following the flow of logic, The theoretical best assembly language for this would be:

	xorl	%eax,%eax
	cmpq	$0,(%rdx)
	cmovneq	(%rdx),%rax
	ret

Though it would get optimised out in this instance, there is potential to support LEA instructions in a limited capacity in CMOV blocks in a similar way to how constants are handled.

In the weditor unit, showcasing a "double branching (different)" style block - before:

	call	*232(%rax)
	cmpb	$0,32(%rsp)
	jne	.Lj504
	xorl	%ecx,%ecx
	jmp	.Lj505
	.p2align 4,,10
	.p2align 3
.Lj504:
	xorl	%ecx,%ecx
	xorl	%r8d,%r8d
	jmp	.Lj507

After:

	call	*232(%rax)
	xorl	%ecx,%ecx
	cmpb	$0,32(%rsp)
	je	.Lj505
	xorl	%ecx,%ecx
	xorl	%r8d,%r8d
	jmp	.Lj507

There is an inefficiency in that xorl %ecx,%ecx gets called twice, but this is not due to this optimisation - indeed, in the original Pascal code:

  ...
  if S='' then
    CP:=0
  else
    begin
     CP:=0; RX:=0;
     ...

CP corresponds to %ecx. Nevertheless, new peephole optimisations or even node-level optimisations should be able to pick this up.

In base64 - before:

	...
	movq	%rcx,%rbx
	cmpq	$-1,32(%rcx)
	je	.Lj37
	movq	32(%rcx),%r12
	jmp	.Lj34
	.p2align 4,,10
	.p2align 3
.Lj37:
	...

After:

	...
	movq	%rcx,%rbx
	cmpq	$-1,32(%rcx)
	cmovneq	32(%rcx),%r12
	jne	.Lj34
	...

In the system unit - before:

	...
.Lj309:
	movzbl	(%rcx),%r9d
	movzbl	(%rdx),%r11d
	subq	%r11,%r9
	movq	%r9,%r10
	testq	%r9,%r9
	jnl	.Lj313
	movq	$-1,%rax
	ret
	.p2align 4,,10
	.p2align 3
.Lj313:
	testq	%r9,%r9
	jng	.Lj316
	movl	$1,%eax
	ret
	.p2align 4,,10
	.p2align 3
.Lj316:
	...
.Lc211:
	ret

After:

	...
.Lj309:
	movzbl	(%rcx),%r9d
	movzbl	(%rdx),%r11d
	subq	%r11,%r9
	movq	%r9,%r10
	movq	$-1,%r11
	testq	%r9,%r9
	cmovlq	%r11,%rax
	jl	.Lj307
	movl	$1,%r11d
	testq	%r9,%r9
	cmovgq	%r11,%rax
	jg	.Lj307
	...
.Lj307:
.Lc211:
	ret

(the second test instruction could also be removed - something for later!)

Also in the system unit - before:

.seh_proc SYSTEM$_$STR_REAL$SMALLINT$SMALLINT$DOUBLE$TREAL_TYPE$OPENSTRING_$$_GEN_DIGITS_64$hUXTohxUvDOC
	...
	cmpq	$1000000000,%r9
	jnb	.Lj1771
	xorl	%r9d,%r9d
	xorl	%r12d,%r12d
	movl	%ecx,%edi
	jmp	.Lj1772
	.p2align 4,,10
	.p2align 3
.Lj1771:
	...

After:

.seh_proc SYSTEM$_$STR_REAL$SMALLINT$SMALLINT$DOUBLE$TREAL_TYPE$OPENSTRING_$$_GEN_DIGITS_64$hUXTohxUvDOC
	...
	xorl	%eax,%eax
	cmpq	$1000000000,%r9
	cmovbl	%eax,%r9d
	cmovbl	%eax,%r12d
	cmovbl	%ecx,%edi
	jb	.Lj1772
	...

In aasmcpu, before:

	...
.Lj385:
	cmpw	$118,92(%rbx)
	jne	.Lj391
	movw	$116,%si
	jmp	.Lj389
	.p2align 4,,10
	.p2align 3
.Lj391:
	cmpw	$116,92(%rbx)
	jne	.Lj394
	movw	$118,%si
	jmp	.Lj389
	.p2align 4,,10
	.p2align 3
.Lj394:
	cmpw	$59,92(%rbx)
	jne	.Lj397
	movw	$57,%si
	jmp	.Lj389
	.p2align 4,,10
	.p2align 3
.Lj397:
	cmpw	$57,92(%rbx)
	jne	.Lj400
	movw	$59,%si
	jmp	.Lj389
	.p2align 4,,10
	.p2align 3
.Lj400:
	cmpw	$119,92(%rbx)
	jne	.Lj403
	movw	$117,%si
	jmp	.Lj389
	.p2align 4,,10
	.p2align 3
.Lj403:
	cmpw	$117,92(%rbx)
	jne	.Lj406
	movw	$119,%si
	jmp	.Lj389
	.p2align 4,,10
	.p2align 3
.Lj406:
	cmpw	$60,92(%rbx)
	jne	.Lj409
	movw	$58,%si
	jmp	.Lj389
	.p2align 4,,10
	.p2align 3
.Lj409:
	...

After:

	...
.Lj385:
	movw	$116,%ax
	cmpw	$118,92(%rbx)
	cmovew	%ax,%si
	je	.Lj389
	movw	$118,%ax
	cmpw	$116,92(%rbx)
	cmovew	%ax,%si
	je	.Lj389
	movw	$57,%ax
	cmpw	$59,92(%rbx)
	cmovew	%ax,%si
	je	.Lj389
	movw	$59,%ax
	cmpw	$57,92(%rbx)
	cmovew	%ax,%si
	je	.Lj389
	movw	$117,%ax
	cmpw	$119,92(%rbx)
	cmovew	%ax,%si
	je	.Lj389
	movw	$119,%ax
	cmpw	$117,92(%rbx)
	cmovew	%ax,%si
	je	.Lj389
	movw	$58,%ax
	cmpw	$60,92(%rbx)
	cmovew	%ax,%si
	je	.Lj389
	...

Edited Nov 12, 2023 by J. Gareth "Kit" Moreton

[x86] CMOV optimisation overhaul

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Development Notes

Relevant logs and/or screenshots

Merge request reports