Fast "x mod constant = 0" optimisation
These changes aim to increase the speed of "x mod n = 0" operations, where n is a constant, on x86 platforms. Expansion to other platforms is planned for the future. Algorithms from Hacker's Delight, Second Edition are used.
From the additions to the bdiv test, these are the timings on my development laptop:
Trunk:
Unsigned 32-bit (n mod 3) = 0 - Pass - average iteration duration: 0.757 ns
Signed 32-bit (n mod 3) = 0 - Pass - average iteration duration: 6.403 ns
Unsigned 32-bit (n mod 10) = 0 - Pass - average iteration duration: 0.698 ns
Signed 32-bit (n mod 10) = 0 - Pass - average iteration duration: 6.461 ns
Unsigned 32-bit (n mod 100) = 0 - Pass - average iteration duration: 0.931 ns
Signed 32-bit (n mod 100) = 0 - Pass - average iteration duration: 6.286 ns
Unsigned 32-bit (n mod 400) = 0 - Pass - average iteration duration: 0.990 ns
Unsigned 32-bit (n mod 1,000) = 0 - Pass - average iteration duration: 1.048 ns
Unsigned 64-bit (n mod 3) = 0 - Pass - average iteration duration: 0.698 ns
Signed 64-bit (n mod 3) = 0 - Pass - average iteration duration: 6.403 ns
Unsigned 64-bit (n mod 10) = 0 - Pass - average iteration duration: 0.757 ns
Signed 64-bit (n mod 10,000) = 0 - Pass - average iteration duration: 6.403 ns
Unsigned 64-bit (n mod 100) = 0 - Pass - average iteration duration: 0.990 ns
Signed 64-bit (n mod 86,400,000) = 0 - Pass - average iteration duration: 6.286 ns
Unsigned 64-bit (n mod 1,000,000,000) = 0 - Pass - average iteration duration: 0.990 ns
New algorithm:
Unsigned 32-bit (n mod 3) = 0 - Pass - average iteration duration: 0.524 ns
Signed 32-bit (n mod 3) = 0 - Pass - average iteration duration: 0.640 ns
Unsigned 32-bit (n mod 10) = 0 - Pass - average iteration duration: 0.698 ns
Signed 32-bit (n mod 10) = 0 - Pass - average iteration duration: 0.815 ns
Unsigned 32-bit (n mod 100) = 0 - Pass - average iteration duration: 0.640 ns
Signed 32-bit (n mod 100) = 0 - Pass - average iteration duration: 0.640 ns
Unsigned 32-bit (n mod 400) = 0 - Pass - average iteration duration: 0.582 ns
Unsigned 32-bit (n mod 1,000) = 0 - Pass - average iteration duration: 0.582 ns
Unsigned 64-bit (n mod 3) = 0 - Pass - average iteration duration: 0.640 ns
Signed 64-bit (n mod 3) = 0 - Pass - average iteration duration: 0.815 ns
Unsigned 64-bit (n mod 10) = 0 - Pass - average iteration duration: 0.815 ns
Signed 64-bit (n mod 10,000) = 0 - Pass - average iteration duration: 0.873 ns
Unsigned 64-bit (n mod 100) = 0 - Pass - average iteration duration: 0.815 ns
Signed 64-bit (n mod 86,400,000) = 0 - Pass - average iteration duration: 0.873 ns
Unsigned 64-bit (n mod 1,000,000,000) = 0 - Pass - average iteration duration: 0.757 ns
In some cases, the produced assembly language is greatly improved. For example, in the IsLeapYear function in SysUtils:
Trunk:
.section .text.n_sysutils_$$_isleapyear$word$$boolean,"ax"
.balign 16,0x90.globl SYSUTILS_$$_ISLEAPYEAR$WORD$$BOOLEAN
SYSUTILS_$$_ISLEAPYEAR$WORD$$BOOLEAN:
pushl %ebx
movzwl %ax,%ecx
testl $3,%ecx
jne .Lj6455
movl %ecx,%ebx
movl $1374389535,%eax
mull %ebx
shrl $5,%edx
imull $100,%edx
subl %edx,%ebx
jne .Lj6456
movl $1374389535,%eax
mull %ecx
shrl $7,%edx
imull $400,%edx
subl %edx,%ecx
jne .Lj6455
.Lj6456:
movb $1,%al
jmp .Lj6460
.Lj6455:
xorb %al,%al
.Lj6460:
popl %ebx
ret
New algorithm:
.section .text.n_sysutils_$$_isleapyear$word$$boolean,"ax"
.balign 16,0x90
.globl SYSUTILS_$$_ISLEAPYEAR$WORD$$BOOLEAN
SYSUTILS_$$_ISLEAPYEAR$WORD$$BOOLEAN:
movzwl %ax,%edx
andl $3,%edx
jne .Lj6455
imulw $23593,%ax,%dx
rorw $2,%dx
cmpw $655,%dx
ja .Lj6456
imulw $23593,%ax,%ax
rorw $4,%ax
cmpw $163,%ax
jnbe .Lj6455
.Lj6456:
movb $1,%al
ret
.Lj6455:
xorb %al,%al
ret
Criteria
Apply patch and confirm correct compilation and faster operations under i386 and x86_64 where "x mod n = 0" conditions occur, with n being a constant.
Additional notes
Optimisation is not performed if -Co is specified because this tends to cause internal errors with the multiplication nodes that are generated.