Draft: [x86] Cross-jump and cross-label optimisations (!110) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:mov-jcc-shortcut into main Dec 29, 2021

Summary

This merge request contains a number of peephole optimisations for x86 platforms that work across labels and jumps, mostly looking for deterministic CMP instructions and the like. The result is that some conditional branches are made unconditional and jump destinations are made more efficient. In some cases, entire blocks of code are optimised out because of dead labels and newly unconditional jumps causing dead code up to the next live label.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

A large number of optimisations that minimise the number of jumps taken

Additional notes

The MOV/@lbl/CMP/Jcc optimisation, which puts a JMP between the MOV and .Lbl if Jcc will definitely jump if the program flow comes from this direction, is not performed under -Os.
Regarding the above optimisation, a jump to a label after Jcc is NOT performed if Jcc will definitely not jump because this will add a branch and be slower overall compared to simply going through the comparison and the conditional jump (which will also be macro-fused).
Some inefficiencies can appear on -O3 due to OptPass2JMP (which received a minor bug fix), which can cause some assignments to get doubled up. For example, in optconstprop, before:

.Lj74:
	xorb	%r12b,%r12b
	jmp	.Lj78
.Lj76:

After (this artefact appears twice in the unit):

.Lj74:
	xorb	%r12b,%r12b
	xorb	%r12b,%r12b
	jmp	.Lj86
.Lj76:

This doesn't get optimised out because OptPass2JMP is pass 2, not pass 1. It is a harmless anomaly, but I am thinking up solutions for this. There are also cases of "xor %reg,%reg; je .Lbl" which, while also harmless, are not ideal. This one, however, is corrected by !108 (merged).

Relevant logs and/or screenshots

The aoptobj unit gets a small block removed thanks to noticing the deterministic comparisons - before:

	...
	je	.Lj220
.Lj218:
	movb	$1,%al
	jmp	.Lj221
.Lj220:
	xorb	%al,%al
	jmp	.Lj233
.Lj221:
	testb	%al,%al
	je	.Lj233
	.p2align 4,,10
	.p2align 3
.Lj210:
	...

After:

	...
	je	.Lj220
.Lj218:
	movb	$1,%al
	jmp	.Lj210
.Lj220:
	xorb	%al,%al
	je	.Lj233 ; <- This is anomalous "xor %reg,%reg; je .lbl" I was talking about.
	.p2align 4,,10
	.p2align 3
.Lj210:
	...

cclasses merges some optimisations - before:

	jne	.Lj1126
	movb	$1,%bl
	jmp	.Lj1127
	.p2align 4,,10
	.p2align 3
.Lj1126:
	movq	%r12,%r8
	call	SYSTEM_$$_COMPAREBYTE$formal$formal$INT64$$INT64
	testq	%rax,%rax
	seteb	%bl
.Lj1127:
	testb	%bl,%bl
	jne	.Lj1129
	movb	$0,32(%rsp)
	jmp	.Lj1107
	.p2align 4,,10
	.p2align 3
.Lj1129:

After:

	jne	.Lj1126
	movb	$1,%bl
	jmp	.Lj1129
	.p2align 4,,10
	.p2align 3
.Lj1126:
	movq	%r12,%r8
	call	SYSTEM_$$_COMPAREBYTE$formal$formal$INT64$$INT64
	testq	%rax,%rax
	je	.Lj1129
	movb	$0,32(%rsp)
	jmp	.Lj1107
	.p2align 4,,10
	.p2align 3
.Lj1129:

Another example in fgl - before:

	...
.Lj206:
	addl	$1,%r14d
	movq	96(%rsp),%rdx
	cmpq	%rdx,%rdi
	jne	.Lj210
	movb	$1,%r15b
	jmp	.Lj211
	.p2align 4,,10
	.p2align 3
.Lj210:
	movslq	%ebp,%r8
	movq	%rdi,%rcx
	call	SYSTEM_$$_COMPAREBYTE$formal$formal$INT64$$INT64
	testq	%rax,%rax
	seteb	%r15b
.Lj211:
	testb	%r15b,%r15b
	jne	.Lj213
	...

After:

	...
.Lj206:
	addl	$1,%r14d
	movq	96(%rsp),%rdx
	cmpq	%rdx,%rdi
	jne	.Lj210
	movb	$1,%r15b
	jmp	.Lj213
	.p2align 4,,10
	.p2align 3
.Lj210:
	movslq	%ebp,%r8
	movq	%rdi,%rcx
	call	SYSTEM_$$_COMPAREBYTE$formal$formal$INT64$$INT64
	testq	%rax,%rax
	je	.Lj213
	...

In ncal, a bunch of constants get pulled in due to OptPass2JMP - before:

	...
	seteb	-624(%rbp)
	jmp	.Lj17
.Lj15:
	movb	$0,-624(%rbp)
.Lj17:
	cmpb	$0,-624(%rbp)
	je	.Lj19
	...
.Lj19:
	movq	%rbx,%r15
	movq	%rbx,-16(%rbp)
	xorl	%esi,%esi
	xorl	%r12d,%r12d
	jmp	.Lj21
	...

After:

	...
	seteb	-624(%rbp)
	jmp	.Lj17
.Lj15:
	movb	$0,-624(%rbp)
	movq	%rbx,%r15
	movq	%rbx,-16(%rbp)
	xorl	%esi,%esi
	xorl	%r12d,%r12d
	jmp	.Lj21
.Lj17:
	cmpb	$0,-624(%rbp)
	je	.Lj19
	...
.Lj19:
	movq	%rbx,%r15
	movq	%rbx,-16(%rbp)
	xorl	%esi,%esi
	xorl	%r12d,%r12d
	jmp	.Lj21
	...

nmat has quite a few examples of the following - before:

	...
	cmpb	$18,20(%rax)
	je	.Lj487
	testw	$32,42(%rax)
	je	.Lj489
.Lj487:
	movb	$1,%al
	jmp	.Lj490
.Lj489:
	xorb	%al,%al
	jmp	.Lj492
.Lj490:
	testb	%al,%al
	je	.Lj492
	movq	U_$SYMDEF_$$_CUNDEFINEDTYPE(%rip),%rax
	movq	%rax,80(%rbx)
	jmp	.Lj485
	.p2align 4,,10
	.p2align 3
.Lj492:
	movq	136(%rbx),%rcx
	...

After:

	...
	cmpb	$18,20(%rax)
	je	.Lj487
	testw	$32,42(%rax)
	je	.Lj489
.Lj487:
	movq	U_$SYMDEF_$$_CUNDEFINEDTYPE(%rip),%rax
	movq	%rax,80(%rbx)
	jmp	.Lj485
.Lj489:
	xorb	%al,%al
	movq	136(%rbx),%rcx
	...

In ptype, a more unexpected optimisation happens and an ultimately pointless MOV is removed as well as a change in jump destination - before:

	...
.Lj189:
	movb	$1,%dil
	movb	$1,%r12b
	jmp	.Lj173
	.p2align 4,,10
	.p2align 3
.Lj186:
	...

After:

	...
.Lj189:
	movb	$1,%dil
	jmp	.Lj170
	.p2align 4,,10
	.p2align 3
.Lj186:
	...

sfpu128 manages to create a SET instruction from what's left! - before:

	...
	cmpq	%rcx,%rdx
	jb	.Lj1575
	seteb	%dl
	cmpq	$-1,%rax
	setbb	%al
	andb	%al,%dl
	je	.Lj1577
.Lj1575:
	movb	$1,%al
	jmp	.Lj1578
.Lj1577:
	xorb	%al,%al
	jmp	.Lj1574
.Lj1578:
	testb	%al,%al
	je	.Lj1574
.Lj1572:
	movb	$1,%al
	jmp	.Lj1579
.Lj1574:
	xorb	%al,%al
.Lj1579:
	movb	%al,%r14b
	...

After:

	...
	cmpq	%rcx,%rdx
	jb	.Lj1572
	seteb	%dl
	cmpq	$-1,%rax
	setbb	%al
	andb	%al,%dl
	setneb	%al
	jmp	.Lj1579
.Lj1572:
	movb	$1,%al
.Lj1579:
	movb	%al,%r14b
	...

Edited Jul 26, 2022 by J. Gareth "Kit" Moreton

Admin message

Draft: [x86] Cross-jump and cross-label optimisations

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Additional notes

Relevant logs and/or screenshots

Merge request reports