[Cross-platform] Label and align stripping rework
Summary
This merge request reworks the label stripping that is done during the peephole optimisation stage, since it caused some inefficiencies elsewhere.
- General stripping is no longer performed during Pass 1 and Pass 2, but is instead performed during the post-peephole optimisation stage after all other major optimisations have been performed.
- The
SkipInstr
set now also includesait_align
, since alignment hints offer no useful information on their own for the peephole optimiser, and legitimate ones should always precede a live label (which is not a member ofSkipInstr
). - The stripping of alignment hints is now smarter, in that alignments that appear before dead labels are now removed, and those before live labels are generally preserved, thus improving execution performance. It also handles non-jump labels better.
System
- Processor architecture: Cross-platform (but affects x86 more than most)
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Execution speed is overall improved thanks to smarter align preservation or removal.
Additional notes
Some optimisations under x86 (and likely AArch64 later) would benefit from being able to 'resurrect' a dead label for some jump optimisations, but because the actual objects were released almost as soon as their reference counts fell to zero, this was not possible. Additionally, some inefficient alignment hints remained.
Relevant logs and/or screenshots
Using x86_64-win64, -O4 as examples, since these contain the most complex cross-jump optimisations and the like...
For a simple and common example - in the classes
unit - before:
...
.Lj1539:
movq %rbx,%rcx
call CLASSES$_$TPARSER_$__$$_HANDLENEWLINE
jmp .Lj1533
.Lj1535:
nop
leaq 32(%rsp),%rsp
popq %rbx
.Lc1433:
ret
After - an align between JMP
and the next label gets preserved:
...
.Lj1539:
movq %rbx,%rcx
call CLASSES$_$TPARSER_$__$$_HANDLENEWLINE
jmp .Lj1533
.balign 16,0x90
.Lj1535:
nop
leaq 32(%rsp),%rsp
popq %rbx
.Lc1433:
ret
In the db
unit - before:
...
movw $1,%ax
ret
.balign 16,0x90
.Lj1870:
movw $2,%ax
ret
.balign 16,0x90
.Lj1867:
xorw %ax,%ax
.balign 16,0x90
.Lc1473:
ret
After - the align that appears before the final RET
instruction is removed, thereby saving 14 bytes in this case:
movw $1,%ax
ret
.balign 16,0x90
.Lj1870:
movw $2,%ax
ret
.balign 16,0x90
.Lj1867:
xorw %ax,%ax
.Lc1473:
ret
...
Also in the db
unit - before:
...
testq %rax,%rax
jnl .Lj133
movl $-1,%edx
jmp .Lj134
.p2align 4,,10
.p2align 3
.Lj133:
xorl %edx,%edx
testq %rax,%rax
setgb %dl
.Lj134:
testl %edx,%edx
jne .Lj139
...
After:
...
testq %rax,%rax
jnl .Lj133
movl $-1,%edx
jmp .Lj134
.p2align 4,,10
.p2align 3
.Lj133:
xorl %edx,%edx
testq %rax,%rax
setgb %dl
.p2align 4,,10
.p2align 3
.Lj134:
testl %edx,%edx
jne .Lj139
...
This one is possibly more questionable since .Lj134
only has one reference, which is on a nearby jump. However, there are cases else where the equivalent label in similar sequences has more than one reference. Nevertheless, future research can possibly find savings here (especially as this sequence corresponds to the "sign" function for input %rax, and the TEST
instruction after .Lj134
is deterministic if control flow came from jmp .Lj134
above because of the movl $-1,%edx
before it - optimising the jump to jmp .lj139
would reduce the reference count of .Lj134
to zero).
In the regexpr
unit - before:
.section .text.n_regexpr_$$_isanylinebreak$ansichar$$boolean,"ax"
.balign 16,0x90
.globl REGEXPR_$$_ISANYLINEBREAK$ANSICHAR$$BOOLEAN
REGEXPR_$$_ISANYLINEBREAK$ANSICHAR$$BOOLEAN:
.Lc8:
cmpb $10,%cl
jb .Lj14
subb $12,%cl
jbe .Lj15
subb $1,%cl
jne .Lj14
.balign 16,0x90
.Lj15:
movb $1,%al
ret
.balign 16,0x90
.Lj14:
xorb %al,%al
.balign 16,0x90
.Lc9:
ret
After - thanks to SkipInstr
now including alignment fields, another peephole optimisation is now able to optimise this procedure further and properly convert jne .Lj14
into a SET
/RET
pair (and also the align before the final RET
is removed), resulting in a smaller and faster function:
.section .text.n_regexpr_$$_isanylinebreak$ansichar$$boolean,"ax"
.balign 16,0x90
.globl REGEXPR_$$_ISANYLINEBREAK$ANSICHAR$$BOOLEAN
REGEXPR_$$_ISANYLINEBREAK$ANSICHAR$$BOOLEAN:
.Lc8:
cmpb $10,%cl
jb .Lj14
subb $12,%cl
jbe .Lj15
subb $1,%cl
seteb %al
ret
.balign 16,0x90
.Lj15:
movb $1,%al
ret
.balign 16,0x90
.Lj14:
xorb %al,%al
.Lc9:
ret
One notable inefficiency is in the zip
unit - before:
...
movl %eax,%esi
testl %eax,%eax
jne .Lj142
...
movl %eax,%esi
testl %eax,%eax
jne .Lj142
...
After:
...
movl %eax,%esi
testl %eax,%eax
jne .Lj130
...
movl %eax,%esi
.Lj130:
testl %esi,%esi
jne .Lj142
...
This, however, is not the fault of the label stripping reform, but an inefficiency in the "Deep MOV" optimisation that is resolved in !517 (merged).
There are other inefficiencies, like some TEST
instructions don't get converted to BT
now, but this is seen as an inefficiency or oversight in that particular peephole optimisation rather than a fault in the label stripping overhaul.