[x86] "x and (not y)" now uses ANDN
Summary
This merge request takes advantage of BMI2 (if it's enabled) to convert expressions of the form x and (not y)
to use the ANDN instruction at the node level.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
When compiling with -CpCOREAVX2
, improvements should be made in many situations where expressions of the form x and (not y)
and similar appear, reducing the cycle count.
Relevant logs and/or screenshots
The align function in the System unit (x86_64-win64 under -O4) - before:
.Lc162:
...
jne .Lj191
notq %rax
andq %r9,%rax
movq %rax,%rcx
ret
.p2align 4,,10
.p2align 3
.Lj191:
...
After:
.Lc162:
...
jne .Lj191
andn %r9,%rax,%rax
movq %rax,%rcx ; <-- This can be removed too; I'm working on this in a separate improvement.
ret
.p2align 4,,10
.p2align 3
.Lj191:
...
Sysutils' internalfindfirst
routine has a rarer example of const and (not var)
which is NOT performed under -Os - before:
...
call fpc_unicodestr_assign
movl %esi,16(%rdi)
notl %esi
andl $30,%esi
movl %esi,32(%rdi)
...
Here, though the instruction size grows overall (and an extra temporary register is used), the dependency chain is reduced from 4 instructions to 3, since the first two mov
instructions can be execeuted simultaneously - after:
...
call fpc_unicodestr_assign
movl %esi,16(%rdi)
movl $30,%eax
andn %eax,%esi,%eax
movl %eax,32(%rdi)
...
In aasmcpu's insentry
routine, there are two examples where three instructions get reduced to 1 - before:
...
.Lj522:
...
movq 56(%rbx,%rdx,8),%rdx
movslq (%rdx),%r14
movq %r14,%rdx
notq %rdx
andq %r13,%rdx
...
After:
...
.Lj522:
...
movq 56(%rbx,%rdx,8),%rdx
movslq (%rdx),%r14
andn %r13,%r14,%rdx
...
In cutils' align
routine, the allocator changes the registers somewhat that removes the mov %edx,%eax
instruction at the start - before:
.Lc55:
movl %edx,%eax
testl %edx,%edx
setg %dl
movzbl %dl,%edx
subl %edx,%eax
testl %ecx,%ecx
jnge .Lj56
addl %eax,%ecx
.Lj56:
notl %eax
andl %ecx,%eax
.Lc56:
ret
After:
.Lc55:
testl %edx,%edx
setg %al
movzbl %al,%eax
subl %eax,%edx
testl %ecx,%ecx
jnge .Lj56
addl %edx,%ecx
.Lj56:
andn %ecx,%edx,%eax
.Lc56:
ret
The xmlutils' tdblhasharray
routine - before:
...
movl %eax,%r13d
movslq 16(%r14),%rdx
movl $1,%eax
shlx %edx,%eax,%eax
leal -1(%eax),%edx
subl $1,%eax
notl %edx
andl %r13d,%edx
movslq 16(%r14),%rcx
...
After:
...
movl %eax,%r13d
movslq 16(%r14),%rdx
movl $1,%eax
shlx %edx,%eax,%eax
subl $1,%eax
andn %r13d,%eax,%edx
movslq 16(%r14),%rcx
...