x86: New TEST optimisations
Summary
This merge request applies some optimisations to TEST/JNE pairs. Most notably, it removes superfluous TESTs if the flags haven't changed and the comparison is identical (similar to the optimisation that removes superfluous CMP instructions), but also merges tests with individual bits - notably, if a sequence such as the following shows up:
testb $2,144(%rax)
jne .Lj1691
testb $32,144(%rax)
jne .Lj1691
This can be compressed by merging the tested bits together with bitwise OR; in this case, producing:
testb $34,144(%rax)
jne .Lj1691
Note that this optimisation is not valid if "je" is used instead of "jne", but there are other combinations that are.
Besides reducing instruction count, it should provide a marginal speed increase because the number of conditional branches are reduced.
Additionally, "testb $255,###", "testw $65535,###" and "testl $4294967295,###" (all bits set for the given instruction size) get their constants changed to -1 to improve optimisation potential.
System
- Processor architecture: i386, x86_64, possibly i8086
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
A large number of optimisations are performed in the compiler source and a few in the RTL. Similarly complex projects should also benefit.
Relevant logs and/or screenshots
Some simple examples: aasmcpu in the compiler - before:
.Lj538:
testb $32,32(%rsp)
jne .Lj544
testb $64,32(%rsp)
jne .Lj544
testb $128,32(%rsp)
je .Lj547
After:
.Lj538:
testb $96,32(%rsp)
jne .Lj544
testb $128,32(%rsp)
je .Lj547
In ncal, three TEST/JNE pairs get collapsed into a single one - before:
.Lj1006:
movl 308(%rbx),%eax
testl $64,%eax
jne .Lj969
testl $8,%eax
jne .Lj969
testl $128,%eax
jne .Lj969
After:
.Lj1006:
movl 308(%rbx),%eax
testl $200,%eax
jne .Lj969
In the System unit, a few superfluous TESTs get removed - before:
.Lj259:
movzwl (%rcx),%r11d
movzwl (%rdx),%ebx
subq %rbx,%r11
movq %r11,%r10
testq %r11,%r11
je .Lj263
testq %r11,%r11
jnl .Lj265
After:
.Lj259:
movzwl (%rcx),%r11d
movzwl (%rdx),%ebx
subq %rbx,%r11
movq %r11,%r10
testq %r11,%r11
je .Lj263
jnl .Lj265
In unicodedata, a duplicate TEST/JLE pair gets removed - before:
.section .text.n_unicodedata_$$_indexinarraydword$array_of_longword$longword$$int64,"ax"
.balign 16,0x90
.globl UNICODEDATA_$$_INDEXINARRAYDWORD$array_of_LONGWORD$LONGWORD$$INT64
UNICODEDATA_$$_INDEXINARRAYDWORD$array_of_LONGWORD$LONGWORD$$INT64:
movq $-1,%rax
addq $1,%rdx
testl %edx,%edx
jle .Lj364
testl %edx,%edx
jle .Lj364
...
After:
.section .text.n_unicodedata_$$_indexinarraydword$array_of_longword$longword$$int64,"ax"
.balign 16,0x90
.globl UNICODEDATA_$$_INDEXINARRAYDWORD$array_of_LONGWORD$LONGWORD$$INT64
UNICODEDATA_$$_INDEXINARRAYDWORD$array_of_LONGWORD$LONGWORD$$INT64:
movq $-1,%rax
addq $1,%rdx
testl %edx,%edx
jle .Lj364
...
Additional Notes
The TEST instructions are allowed to have a reference as the second operand because the registers that compose it are guarateed not to change in between. There might be some side effects if another thread modifies the same memory location, but if anything, side effects are reduced because the number of TEST calls are reduced, plus it's on the programmer to properly synchronise such multithreading.
Despite the savings in the compiler assembly, the x86_64-win64 compiler grows by 512 bytes due to code that performs jump optimisations early when it already has the information it needs, this providing a speed boost for the compiler.
Ultimately, this optimisation benefits the compiler itself more than the RTL, probably due to the large number of sets that are used as states, for example.