Skip to content

[x86] "x and (not y)" now uses ANDN

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:andn into main

Summary

This merge request takes advantage of BMI2 (if it's enabled) to convert expressions of the form x and (not y) to use the ANDN instruction at the node level.

System

  • Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

When compiling with -CpCOREAVX2, improvements should be made in many situations where expressions of the form x and (not y) and similar appear, reducing the cycle count.

Relevant logs and/or screenshots

The align function in the System unit (x86_64-win64 under -O4) - before:

.Lc162:
	...
	jne	.Lj191
	notq	%rax
	andq	%r9,%rax
	movq	%rax,%rcx
	ret
	.p2align 4,,10
	.p2align 3
.Lj191:
	...

After:

.Lc162:
	...
	jne	.Lj191
	andn	%r9,%rax,%rax
	movq	%rax,%rcx ; <-- This can be removed too; I'm working on this in a separate improvement.
	ret
	.p2align 4,,10
	.p2align 3
.Lj191:
	...

Sysutils' internalfindfirst routine has a rarer example of const and (not var) which is NOT performed under -Os - before:

	...
	call	fpc_unicodestr_assign
	movl	%esi,16(%rdi)
	notl	%esi
	andl	$30,%esi
	movl	%esi,32(%rdi)
	...

Here, though the instruction size grows overall (and an extra temporary register is used), the dependency chain is reduced from 4 instructions to 3, since the first two mov instructions can be execeuted simultaneously - after:

	...
	call	fpc_unicodestr_assign
	movl	%esi,16(%rdi)
	movl	$30,%eax
	andn	%eax,%esi,%eax
	movl	%eax,32(%rdi)
	...

In aasmcpu's insentry routine, there are two examples where three instructions get reduced to 1 - before:

	...
.Lj522:
	...
	movq	56(%rbx,%rdx,8),%rdx
	movslq	(%rdx),%r14
	movq	%r14,%rdx
	notq	%rdx
	andq	%r13,%rdx
	...

After:

	...
.Lj522:
	...
	movq	56(%rbx,%rdx,8),%rdx
	movslq	(%rdx),%r14
	andn	%r13,%r14,%rdx
	...

In cutils' align routine, the allocator changes the registers somewhat that removes the mov %edx,%eax instruction at the start - before:

.Lc55:
	movl	%edx,%eax
	testl	%edx,%edx
	setg	%dl
	movzbl	%dl,%edx
	subl	%edx,%eax
	testl	%ecx,%ecx
	jnge	.Lj56
	addl	%eax,%ecx
.Lj56:
	notl	%eax
	andl	%ecx,%eax
.Lc56:
	ret

After:

.Lc55:
	testl	%edx,%edx
	setg	%al
	movzbl	%al,%eax
	subl	%eax,%edx
	testl	%ecx,%ecx
	jnge	.Lj56
	addl	%edx,%ecx
.Lj56:
	andn	%ecx,%edx,%eax
.Lc56:
	ret

The xmlutils' tdblhasharray routine - before:

	...
	movl	%eax,%r13d
	movslq	16(%r14),%rdx
	movl	$1,%eax
	shlx	%edx,%eax,%eax
	leal	-1(%eax),%edx
	subl	$1,%eax
	notl	%edx
	andl	%r13d,%edx
	movslq	16(%r14),%rcx
	...

After:

	...
	movl	%eax,%r13d
	movslq	16(%r14),%rdx
	movl	$1,%eax
	shlx	%edx,%eax,%eax
	subl	$1,%eax
	andn	%r13d,%eax,%edx
	movslq	16(%r14),%rcx
	...

Merge request reports