x86: Deeper insight in OptPass2ADD and OptPass2SUB to produce more efficient code (!77) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:addsubmov-insight into main Oct 20, 2021

Summary

This merge request improves the AddPass2ADD and AddPass2SUB optimisaitons in a few ways:

The optimisations under -O3 now search further ahead than the next instruction, permitting longer-distance optimisations.
If the next instruction using the destination register is a MOV, the peephole optimizer will now call OptPass2MOV first, because it was noted that some specialised optimisations, like ones that produce a "cqto", produce much faster and smaller code, and the opportunity gets lost if the ADD/SUB optimisation is performed first.
Fixed bug where AddMov2Lea and SubMov2Lea weren't performed under -Os.

There is a new utility function called GetNextInstructionUsingRegCount, which behaves the same way as GetNextInstructionUsingReg, but returns a Cardinal instead of a Boolean to indicate how far away the next instruction is (it returns 0 if GetNextInstructionUsingReg would also return False).

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Under -Os, additional optimisations are made that shrink the code. Under regular optimisation, optimisations are made that help break dependency chains at a cost of increased code size.

Relevant logs and/or screenshots

For example (in Classes under -O4) - before:

        ...
.Lj161:
	addl	$1,%ecx
	movq	8(%rdi),%r11
	movl	%ecx,%r8d
	movq	8(%rsi),%r10
	movl	%ecx,%r9d
	movl	(%r11,%r8,4),%r8d
	andl	%r8d,(%r10,%r9,4)
	cmpl	%edx,%ecx
	jnge	.Lj161
        ...

After:

        ...
.Lj161:
	leal	1(%ecx),%r8d // <-- "mov %ecx,%r8d" moved here and changed to "lea 1(%ecx),%r8d" as it can now be executed in parallel with "addl $1,%ecx" - same with "mov %ecx,%r9d".
	leal	1(%ecx),%r9d
	addl	$1,%ecx
	movq	8(%rdi),%r11
	movq	8(%rsi),%r10
	movl	(%r11,%r8,4),%r8d
	andl	%r8d,(%r10,%r9,4)
	cmpl	%edx,%ecx
	jnge	.Lj161
        ...

Future work

In the example above, both %r8d and %r9d are set to equal %ecx + 1. Before Pass 2, if it can be detected that %r8 = %r9 and replaced with just %r9, then a saving can be made because %r8 gets overwritten, ideally resulting in:

        ...
.Lj161:
	leal	1(%ecx),%r9d
	addl	$1,%ecx
	movq	8(%rdi),%r11
	movq	8(%rsi),%r10
	movl	(%r11,%r9,4),%r8d
	andl	%r8d,(%r10,%r9,4)
	cmpl	%edx,%ecx
	jnge	.Lj161
        ...

x86: Deeper insight in OptPass2ADD and OptPass2SUB to produce more efficient code