[x86 / Refactor (mostly)] `Mov2Nop 8` efficiency check and refactor for maintainability (!1128) · Merge requests · FPC / FPC / FPC Source

Summary

This merge request is twofold...

The first commit adds some code to the Pass 2 loop so the third commit works properly.
The second commit refactors the TryMovArith2Lea optimisation subroutine to be more self-contained and not rely on an external state so much, e.g. the check to see if the value is in range (and is actually a value) is now performed inside TryMovArith2Lea, along with configuring the register usage flags in `TmpUsedRegs``.
The third commit focuses on the movl %reg1d,%reg2d; movq %reg2q.%reg3q optimisation (named MovlMovq2MovlMovl 1), making it deferred in most cases under -O3 so it only runs on a second iteration of pass 2 (it will always run on -O2 and under because Pass 2 only runs once here). This has shown to improve code generation when information about the 64-bit assignment has not been lost. Logically this would fit better in a "Pass 3" stage, but the only one that exists, the post-peephole optimisation stage, should not be used for complex optimisations (although if preferred by @FPK2, it can be written this way).

System

Processor architecture: i386, x86-64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

OptPass2MOV is now slightly more maintainable with its nested TryMovArith2Lea function. Some code improvements now occur in rare situations on -O3 and above; -O2 and under should see no changes in binary output (except for the modified compiler source files).

Relevant logs and/or screenshots

By deferring MovlMovq2MovlMovl 1, another optimisation gets performed first that converts an unnecessary movzbl into a more efficient movb instruction in bzip2 (x86_64-win64, -O4) - before:

	...
.Lj21:
	movzbl	84(%rbx),%r9d
	movzbl	%sil,%edx
	movl	$8,%ecx
	subl	%edx,%ecx
	shrl	%cl,%r9d
	movb	%r9b,%al
	movzbl	%sil,%ecx
	shlb	%cl,84(%rbx)
	...

After:

	...
.Lj21:
	movzbl	84(%rbx),%r9d
	movzbl	%sil,%edx
	movl	$8,%ecx
	subl	%edx,%ecx
	shrl	%cl,%r9d
	movb	%r9b,%al
	movb	%sil,%cl
	shlb	%cl,84(%rbx)
	...

A similar optimisation also appears in bzip2stream.

[x86 / Refactor (mostly)] Mov2Nop 8 efficiency check and refactor for maintainability

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Relevant logs and/or screenshots

Merge request reports

[x86 / Refactor (mostly)] `Mov2Nop 8` efficiency check and refactor for maintainability