[x86] Bit test peephole optimizations
Summary
This merge request aims to convert TEST, OR and AND instructions that have already been optimised into smaller BT, BTS and BTR instructions, specifically:
-
TEST x,%reg/ref; JNE .Lbl
, where x is a power of 2, becomesBT lb(x),%reg/ref; JC .Lbl
, where lb(x) is the base-2 logarithm of x. Other conditional instructions work so long as the conditions are either E/Z or NE/NZ, which are converted into NC and C respectively. -
OR x,%reg/ref
, where x is a power of 2, becomesBTS lb(x),%reg/ref
, although not if aTEST
instruction immediately follows. -
AND x,%reg/ref
, where ¬x is a power of 2 (¬ = bitwise NOT), beomesBTR lb(¬x),%reg/ref
, although not if aTEST
instruction immediately follows.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
TEST
, OR
and AND
instructions that utilise constants will now sometimes be converted into BT
, BTS
and BTR
instructions respectively if the resultant machine code is smaller.
Additional notes
- The bit-test instructions have been present since the 30386 processor, but it wasn't until the Intel Pentium II processor that they have been just as fast as the standard logical instructions. Therefore, the optimisations are only performed under -Os or if the optimize CPU target is Pentium II or higher (under x86_64, this is always the case). Unfortunately, they are known to be slower on AMD processors.
- In accordance with making instructions as small as possible, only instructions of size S_L and S_Q are modified (S_W shows no size difference, and may actually increase if the register in use is %eax).
- Optimisations are done at the post peephole stage to ensure the instructions have already been simplified as far as possible and jump chains shortened.
- 64-bit immediates via
mov; and
etc. are not yet tested and optimised. Possible future expansion. - With 32-bit
OR
andAND
instructions, if an immediate can fit into a signed 8-bit integer, its machine code takes fewer bytes as the immediate can be stored in 8 bits rather than 32, so these instances are not optimised.TEST
instructions don't have this luxury.
Relevant logs and/or screenshots
A huge number of units receive changes. In aasmbase
(-O4, x86_64-win64), before:
.section .text.n_aasmbase_$$_create_smartlink_library$$boolean,"ax"
.balign 16,0x90
.globl AASMBASE_$$_CREATE_SMARTLINK_LIBRARY$$BOOLEAN
AASMBASE_$$_CREATE_SMARTLINK_LIBRARY$$BOOLEAN:
.Lc5:
testl $8192,U_$GLOBALS_$$_CURRENT_SETTINGS+72(%rip)
je .Lj12
...
After (instruction is 2 bytes smaller):
.section .text.n_aasmbase_$$_create_smartlink_library$$boolean,"ax"
.balign 16,0x90
.globl AASMBASE_$$_CREATE_SMARTLINK_LIBRARY$$BOOLEAN
AASMBASE_$$_CREATE_SMARTLINK_LIBRARY$$BOOLEAN:
.Lc5:
btl $13,U_$GLOBALS_$$_CURRENT_SETTINGS+72(%rip)
jnc .Lj12
...
In the aasmcpu
unit - before:
.Lj505:
movl (%rbx),%eax
andl $536870880,%eax
orl $8192,%eax
movl %eax,(%rbx)
jmp .Lj438
...
After (instruction is 1 byte smaller in this case rather than 2 since or imm,%eax
has a shorter byte encoding over other registers):
.Lj505:
movl (%rbx),%eax
andl $536870880,%eax
btsl $13,%eax
movl %eax,(%rbx)
jmp .Lj438
...
In the advancedipc
unit - before:
...
.Lj242:
movl $2147483647,%ecx
call SYSTEM_$$_RANDOM$LONGINT$$LONGINT
movl %eax,%ebx
call SYSTEM_$$_GETPROCESSID$$QWORD
xorl %eax,%ebx
andl $2147483647,%ebx
movl %ebx,-20(%rbp)
...
After (instruction is 2 bytes smaller):
...
.Lj242:
movl $2147483647,%ecx
call SYSTEM_$$_RANDOM$LONGINT$$LONGINT
movl %eax,%ebx
call SYSTEM_$$_GETPROCESSID$$QWORD
xorl %eax,%ebx
btrl $31,%ebx
movl %ebx,-20(%rbp)
...
Edited by J. Gareth "Kit" Moreton