[x86 (currently)] New Sliding Window-based assembly-level CSE subsystem
Summary
This merge request introduces a new subsystem that greatly improves x86's peephole optimizer by performing assembly-level common subexpression elimination by way of a sliding window to find initial matches.
Please read attached PDF whitesheet to explain the feature in deeper detail: Sliding_Window_CSE.pdf
System
- Processor architecture: i386, x86_64, possibly i8086. Others will follow at a later date.
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Under -O3 and above, extensive speed and size optimisations are made across the board.
Relevant logs and/or screenshots
Virtually every large source file shows some improvements. Here are a couple of cherry-picked examples under -O4 - dbgdwarf's TDebugInfoDwarf.appendsym_fieldvar_with_name_offset method - before:
...
.Lj682:
leaq (,%r13,8),%rcx
movq 120(%rdi),%rax
cqto
idivq %rcx
imulq %r13,%rax
movq %rax,%r12
addq 56(%rbp),%r12
leaq (,%r13,8),%rcx
movq 120(%rdi),%rax
cqto
idivq %rcx
movq %rdx,%rsi
cmpb $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
...
A number of redundant pointer deallocations and an entire integer division instruction are successfully removed - after:
...
.Lj682:
leaq (,%r13,8),%rcx
movq 120(%rdi),%rax
cqto
idivq %rcx
imulq %r13,%rax
movq %rax,%r12
addq 56(%rbp),%r12
movq %rdx,%rsi
cmpb $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
...
SysUtils' IsLeapYear function - before:
...
imulw $23593,%cx,%ax
rorw $2,%ax
cmpw $655,%ax
ja .Lj6979
imulw $23593,%cx,%ax
rorw $4,%ax
cmpw $163,%ax
setbeb %al
ret
.Lj6979:
...
A partial sequence match is found and the second arithmetic sequence is replaced with just rorw $2,%ax
- after:
...
imulw $23593,%cx,%ax
rorw $2,%ax
cmpw $655,%ax
ja .Lj6979
rorw $2,%ax
cmpw $163,%ax
setbeb %al
ret
.Lj6979:
...
Additional Notes:
This feature is only available under -O3 by default, but can be enabled individually with -OoASMCSE; note that the peephole optimizer must also be enabled.
Compile time metrics have not yet been measured, but I predict a small penalty under -O3 and above, although it won't be as significant as it could be because the compiler itself receives a lot of optimisations to its generated code.