Fast "x mod constant = 0" optimisation

These changes aim to increase the speed of "x mod n = 0" operations, where n is a constant, on x86 platforms. Expansion to other platforms is planned for the future. Algorithms from Hacker's Delight, Second Edition are used.

From the additions to the bdiv test, these are the timings on my development laptop:

Trunk:

              Unsigned 32-bit (n mod 3) = 0 - Pass - average iteration duration: 0.757 ns
                Signed 32-bit (n mod 3) = 0 - Pass - average iteration duration: 6.403 ns
             Unsigned 32-bit (n mod 10) = 0 - Pass - average iteration duration: 0.698 ns
               Signed 32-bit (n mod 10) = 0 - Pass - average iteration duration: 6.461 ns
            Unsigned 32-bit (n mod 100) = 0 - Pass - average iteration duration: 0.931 ns
              Signed 32-bit (n mod 100) = 0 - Pass - average iteration duration: 6.286 ns
            Unsigned 32-bit (n mod 400) = 0 - Pass - average iteration duration: 0.990 ns
          Unsigned 32-bit (n mod 1,000) = 0 - Pass - average iteration duration: 1.048 ns
              Unsigned 64-bit (n mod 3) = 0 - Pass - average iteration duration: 0.698 ns
                Signed 64-bit (n mod 3) = 0 - Pass - average iteration duration: 6.403 ns
             Unsigned 64-bit (n mod 10) = 0 - Pass - average iteration duration: 0.757 ns
           Signed 64-bit (n mod 10,000) = 0 - Pass - average iteration duration: 6.403 ns
            Unsigned 64-bit (n mod 100) = 0 - Pass - average iteration duration: 0.990 ns
       Signed 64-bit (n mod 86,400,000) = 0 - Pass - average iteration duration: 6.286 ns
  Unsigned 64-bit (n mod 1,000,000,000) = 0 - Pass - average iteration duration: 0.990 ns

New algorithm:

              Unsigned 32-bit (n mod 3) = 0 - Pass - average iteration duration: 0.524 ns
                Signed 32-bit (n mod 3) = 0 - Pass - average iteration duration: 0.640 ns
             Unsigned 32-bit (n mod 10) = 0 - Pass - average iteration duration: 0.698 ns
               Signed 32-bit (n mod 10) = 0 - Pass - average iteration duration: 0.815 ns
            Unsigned 32-bit (n mod 100) = 0 - Pass - average iteration duration: 0.640 ns
              Signed 32-bit (n mod 100) = 0 - Pass - average iteration duration: 0.640 ns
            Unsigned 32-bit (n mod 400) = 0 - Pass - average iteration duration: 0.582 ns
          Unsigned 32-bit (n mod 1,000) = 0 - Pass - average iteration duration: 0.582 ns
              Unsigned 64-bit (n mod 3) = 0 - Pass - average iteration duration: 0.640 ns
                Signed 64-bit (n mod 3) = 0 - Pass - average iteration duration: 0.815 ns
             Unsigned 64-bit (n mod 10) = 0 - Pass - average iteration duration: 0.815 ns
           Signed 64-bit (n mod 10,000) = 0 - Pass - average iteration duration: 0.873 ns
            Unsigned 64-bit (n mod 100) = 0 - Pass - average iteration duration: 0.815 ns
       Signed 64-bit (n mod 86,400,000) = 0 - Pass - average iteration duration: 0.873 ns
  Unsigned 64-bit (n mod 1,000,000,000) = 0 - Pass - average iteration duration: 0.757 ns 

In some cases, the produced assembly language is greatly improved. For example, in the IsLeapYear function in SysUtils:

Trunk:

.section .text.n_sysutils_$$_isleapyear$word$$boolean,"ax"
    .balign 16,0x90.globl    SYSUTILS_$$_ISLEAPYEAR$WORD$$BOOLEAN
SYSUTILS_$$_ISLEAPYEAR$WORD$$BOOLEAN:
    pushl   %ebx
    movzwl  %ax,%ecx
    testl   $3,%ecx
    jne    .Lj6455
    movl    %ecx,%ebx
    movl    $1374389535,%eax
    mull    %ebx
    shrl    $5,%edx
    imull    $100,%edx
    subl    %edx,%ebx
    jne    .Lj6456
    movl    $1374389535,%eax
    mull    %ecx
    shrl    $7,%edx
    imull   $400,%edx
    subl    %edx,%ecx
    jne    .Lj6455
.Lj6456:
    movb    $1,%al
    jmp    .Lj6460
.Lj6455:
    xorb    %al,%al
.Lj6460:
    popl    %ebx
    ret

New algorithm:

.section .text.n_sysutils_$$_isleapyear$word$$boolean,"ax"
    .balign 16,0x90
.globl    SYSUTILS_$$_ISLEAPYEAR$WORD$$BOOLEAN
SYSUTILS_$$_ISLEAPYEAR$WORD$$BOOLEAN:
    movzwl  %ax,%edx
    andl    $3,%edx
    jne    .Lj6455
    imulw   $23593,%ax,%dx
    rorw    $2,%dx
    cmpw    $655,%dx
    ja    .Lj6456
    imulw   $23593,%ax,%ax
    rorw    $4,%ax
    cmpw    $163,%ax
    jnbe    .Lj6455
.Lj6456:
    movb    $1,%al
    ret
.Lj6455:
    xorb    %al,%al
    ret 

Criteria

Apply patch and confirm correct compilation and faster operations under i386 and x86_64 where "x mod n = 0" conditions occur, with n being a constant.

Additional notes

Optimisation is not performed if -Co is specified because this tends to cause internal errors with the multiplication nodes that are generated.

Edited by J. Gareth "Kit" Moreton

Merge request reports

Loading