[x86 (currently)] New Sliding Window-based assembly-level CSE subsystem (!191) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:sliding-window into main Mar 27, 2022

Summary

This merge request introduces a new subsystem that greatly improves x86's peephole optimizer by performing assembly-level common subexpression elimination by way of a sliding window to find initial matches.

Please read attached PDF whitesheet to explain the feature in deeper detail: Sliding_Window_CSE.pdf

System

Processor architecture: i386, x86_64, possibly i8086. Others will follow at a later date.

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Under -O3 and above, extensive speed and size optimisations are made across the board.

Relevant logs and/or screenshots

Virtually every large source file shows some improvements. Here are a couple of cherry-picked examples under -O4 - dbgdwarf's TDebugInfoDwarf.appendsym_fieldvar_with_name_offset method - before:

      ...
.Lj682:
      leaq    (,%r13,8),%rcx
      movq    120(%rdi),%rax
      cqto
      idivq    %rcx
      imulq    %r13,%rax
      movq    %rax,%r12
      addq    56(%rbp),%r12
      leaq    (,%r13,8),%rcx
      movq    120(%rdi),%rax
      cqto
      idivq    %rcx
      movq    %rdx,%rsi
      cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
      ...

A number of redundant pointer deallocations and an entire integer division instruction are successfully removed - after:

      ...
.Lj682:
      leaq    (,%r13,8),%rcx
      movq    120(%rdi),%rax
      cqto
      idivq    %rcx
      imulq    %r13,%rax
      movq    %rax,%r12
      addq    56(%rbp),%r12
      movq    %rdx,%rsi
      cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
      ...

SysUtils' IsLeapYear function - before:

      ...
      imulw    $23593,%cx,%ax
      rorw    $2,%ax
      cmpw    $655,%ax
      ja    .Lj6979
      imulw    $23593,%cx,%ax
      rorw    $4,%ax
      cmpw    $163,%ax
      setbeb    %al
      ret
.Lj6979:
      ...

A partial sequence match is found and the second arithmetic sequence is replaced with just rorw $2,%ax - after:

      ...
      imulw    $23593,%cx,%ax
      rorw    $2,%ax
      cmpw    $655,%ax
      ja    .Lj6979
      rorw    $2,%ax
      cmpw    $163,%ax
      setbeb    %al
      ret
.Lj6979:
      ...

Additional Notes:

This feature is only available under -O3 by default, but can be enabled individually with -OoASMCSE; note that the peephole optimizer must also be enabled.

Compile time metrics have not yet been measured, but I predict a small penalty under -O3 and above, although it won't be as significant as it could be because the compiler itself receives a lot of optimisations to its generated code.

[x86 (currently)] New Sliding Window-based assembly-level CSE subsystem