[x86-64] 32-bit to 64-bit zero extension optimisation
Summary
This merge requeest optimises some occasional code where multiple MOVs are used to zero-extend a 32-bit register into a 64-bit register using code such as:
movl %esi,%eax
movq %rax,%r13
The peephole optimizer will now change this to:
movl %esi,%eax ; <-- This gets removed if %rax is not used afterwards
movl %esi,%r13d
If the source and target registers are the same, then an andl
instruction is generated instead - for example:
movl %r12d,%eax
movq %rax,%r12
...becomes:
movl %r12d,%eax ; <-- This gets removed if %rax is not used afterwards
andl %r12d,%r12d
System
- Processor architecture: x86-64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Code is made smaller and slightly faster where 32-bit to 64-bit zero extensions are concerned.
Relevant logs and/or screenshots
In fmtbcd (under -O4, x86_64-win64), before:
...
.Lj1222:
movb %r13b,%dil
movl %esi,%eax
movq %rax,%r13
addq $1,%r13
jno .Lj1223
call FPC_OVERFLOW
.Lj1223:
...
After:
...
.Lj1222:
movb %r13b,%dil
movl %esi,%r13d
addq $1,%r13
jno .Lj1223
call FPC_OVERFLOW
.Lj1223:
...
This is an interesting case for further future optimisation because if you study the code carefully, you'll note that jno .Lj1223
will always branch since %r13 will be between 0 and $FFFFFFFF before the addq $1,%r13
instruction.
In jdmainct, before:
...
.Lj25:
leal 1(%r8d),%edx
addl $1,%r8d
...
movl %r8d,%edx
movq %rdx,%rsi
movq (%r9,%rdx,8),%rdx
...
After:
...
.Lj25:
leal 1(%r8d),%edx
addl $1,%r8d
...
movl %r8d,%esi
movq (%r9,%rsi,8),%rdx
...
Here, the peephole optimizer is not yet smart enough to know that the upper 32 bits of %r8 are equal to zero, so can't replace %rsi in the reference with %r8 without the help of !74.
In the System unit, an example with the same register is demonstrated - before:
.Lj2266:
movl %r12d,%eax
movq %rax,%r12
After:
.Lj2266:
andl %r12d,%r12d
The Sysutils unit has a string of examples that showcase a more long-range version of the optimisation - before:
.section .text.n_sysutils$_$tmultireadexclusivewritesynchronizer_$__$$_threadidtoindex$longword$$longint,"ax"
...
movl %edx,%eax
movq %rax,%rcx
shrq $12,%rcx
xorl %edx,%ecx
movq %rax,%rdx
shrq $32,%rdx
xorl %edx,%ecx
movq %rax,%rdx
shrq $36,%rdx
...
After:
.section .text.n_sysutils$_$tmultireadexclusivewritesynchronizer_$__$$_threadidtoindex$longword$$longint,"ax"
...
movl %edx,%eax
movl %edx,%ecx
shrq $12,%rcx
xorl %edx,%ecx
andl %edx,%edx
shrq $32,%rdx
xorl %edx,%ecx
andl %edx,%edx
shrq $36,%rdx
...
This is another interesting future optimisation case because the SHR instructions with indices greater than or equal to 32 will zero the register.