[AArch64 / x86 / Bug Fix] 64-bit Min/Max node support (!624) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:min-max-64 into main Mar 17, 2024

Summary

This merge request expands the use of min/max intrinsics:

Four new intrinsic codes added to support 64-bit operands (min_qword, min_int64, max_qword, max_int64)
Bug fix on AArch64 where unsigned min/max constructs used signed comparisons in the resultant assembly.
i386 now supports 32-bit min/max intrinsics as long as CMOV is supported (64-bit min/max nodes are not generated).
x86_64 now supports both 32-bit and 64-bit min/max intrinsics, generating CMOV instructions where possible.
tinlinenode.simplify now checks to see if the result is deterministic for min/max intrinsics, either because both inputs are constants, or because one is at the maximum range (e.g. min(x,maxint) will always return x). This also holds true for the versions that handle special floating-point values, namely plus and minus infinity and QNaN (which always returns the second parameter even if it itself is a QNaN and the first parameter is a real value).

Note that these intrinsics don't actually correspond to min and max functions, but are instead codes for inlinen nodes that are generated when the compiler comes across if (x < a) then x := a; combinations, for example.

System

Operating system: Windows, Linux (Ubuntu (x86) and Raspberry Pi OS (AArch64))
Processor architecture: i386, x86_64, AArch64
Device: Windows 10 Laptop, Raspberry Pi and others

What is the current bug behavior?

Under AArch64, unsigned min/max intrinsics use signed comparisons, resulting in erroneous behaviour for large values.

What is the behavior after applying this patch?

AArch64 behaviour is corrected and now supports optimisation for 64-bit values.
x86 now supports min/max optimisation (as long as CMOV is supported).

Testing notes

test/tminmax.pp was the main source of conformance testing, especially for deterministic inputs.
webtbs/tw2494.pp failed at one point during development on x86_64 because of forgetting to account for the fact that CMP instructions can only handle signed 32-bit immediates.
The compiler itself developed a strange error where jump tables were being replaced with linear lists when compiling case blocks. This is where the bug regarding unsigned min/max using signed comparisons was revealed, because the highest unsigned value possible was being interpreted as -1 (rather than 18,446,744,073,709,551,615) in a signed 'less than' comparison, which was incorrectly returning False.

Relevant logs and/or screenshots

Changes on x86 are minimal because of the peephole optimisations in OptPass2Jcc, although performing these min/max optimisations at node level takes a lot of strain off the peephole optimizer. Fewer labels are generated so the label numbers change in any file where the node optimisation takes place, whether or not it takes the place of the peephole optimisation.

The system unit shows an interesting case - before:

	...
# [42] if count>length(s)-index then
	movzbl	(%rax),%r9d
	subq	%rdx,%r9
	cmpq	%r8,%r9
	jnl	.Lj1319
# [43] count:=length(s)-index;
	movzbl	(%rax),%r9d
	subq	%rdx,%r9
	movq	%r9,%r8
.Lj1319:
	...

After - it is able to shortcut the extra movzbl (%rax),%r9d; subq %rdx,%r9 pair:

	...
# [43] count:=length(s)-index;
	movzbl	(%rax),%r9d
# [42] if count>length(s)-index then
	subq	%rdx,%r9
	cmpq	%r9,%r8
	cmovnlq	%r9,%r8
.Lj1311: (same position as .Lj1319 as before)
	...

In the cldrxml unit - before:

	...
# [416] if (AStartPosition <= 0) then
	testl	%edx,%edx
	jnle	.Lj69
# [417] startPosition := 0
	movl	$0,-4(%rbp)
	jmp	.Lj70
	.p2align 4,,10
	.p2align 3
.Lj69:
# [419] startPosition := AStartPosition;
	movl	%edx,-4(%rbp)
.Lj70:
# [420] i := startPosition;
	movl	-4(%rbp),%r12d
	...

After:

	...
# [416] if (AStartPosition <= 0) then
	xorl	%eax,%eax
	testl	%edx,%edx
	cmovgl	%edx,%eax
	movl	%eax,-4(%rbp)
# [420] i := startPosition;
	movl	%eax,%r12d
	...

In optutils, a vestigial pointer deallocation is removed - before:

	...
.Lj141:
	movq	(%rbx),%rax
	movq	104(%rax),%rax
# [441] Weight:=max(plongint(arg)^,1);
	movl	(%rsi),%esi
	movl	$1,%eax
	testl	%esi,%esi
	cmovlel	%eax,%esi
	...

After:

	...
.Lj141:
# [441] Weight:=max(plongint(arg)^,1);
	movl	(%rsi),%eax
	movl	$1,%esi
	cmpl	$1,%eax
	cmovgl	%eax,%esi
	...

In pdecvar, a pair of instructions are simplified nicely - before:

	...
# [2070] maxalignment:=max(maxalignment,unionsymtable.fieldalignment);
	movsbl	140(%r14),%eax
	movsbl	-688(%rbp),%edx
	cmpl	%eax,%edx
	jnge	.Lj906
	movsbl	-688(%rbp),%edx
	jmp	.Lj907
	.p2align 4,,10
	.p2align 3
.Lj906:
	movl	%eax,%edx
.Lj907:
	movb	%dl,-688(%rbp)
# [2071] maxpadalign:=max(maxpadalign,unionsymtable.padalignment);
	movsbl	141(%r14),%eax
	movsbl	%r15b,%edx
	cmpl	%eax,%edx
	jnge	.Lj909
	movsbl	%r15b,%edx
	jmp	.Lj910
	.p2align 4,,10
	.p2align 3
.Lj909:
	movl	%eax,%edx
.Lj910:
	movb	%dl,%r15b
	...

After:

	...
# [2070] maxalignment:=max(maxalignment,unionsymtable.fieldalignment);
	movsbl	140(%r14),%ecx
	movsbl	-688(%rbp),%eax
	movl	%eax,%edx
	cmpl	%eax,%ecx
	cmovgl	%ecx,%edx
	movb	%dl,-688(%rbp)
# [2071] maxpadalign:=max(maxpadalign,unionsymtable.padalignment);
	movsbl	141(%r14),%ecx
	movsbl	%r15b,%eax
	movl	%eax,%edx
	cmpl	%eax,%ecx
	cmovgl	%ecx,%edx
	movb	%dl,%r15b
	...

Under aarch64-linux, -O4, even with the new CSEL peephole optimisations, many more improvements were noted.

In the cclasses unit, a number of conditions are corrected. For example - before:

	...
// [2592] if Ablocksize<mindynamicblocksize then
	movz	w0,#64
	cmp	w0,w21
	csel	w0,w0,w21,gt
	...
// [2652] if CurrBlockSize>FMaxBlocksize then
	ldp	w1,w0,[x19, #44]
	cmp	w0,w1
	csel	w0,w0,w1,lt
	str	w0,[x19, #44]

After:

	...
// [2592] if Ablocksize<mindynamicblocksize then
	movz	w0,#64
	cmp	w0,w21
	csel	w0,w0,w21,hi
	...
// [2652] if CurrBlockSize>FMaxBlocksize then
	ldp	w1,w0,[x19, #44]
	cmp	w0,w1
	csel	w0,w0,w1,lo
	str	w0,[x19, #44]
	...

In the case of 64-bit min/max coming into play - in the ncgset unit - before:

	...
// [1306] max_dist:=min(max_dist,2048);
	ldr	x0,[sp, #72]
	cmp	x0,#2048
	b.ls	.Lj313
	movz	x0,#2048
.Lj313:
	str	x0,[sp, #72]
	...

After - this can probably be improved slightly with some minor changes to constant generation and some changes to the AArch64 peephole optimizer:

	...
// [1306] max_dist:=min(max_dist,2048);
	ldr	x0,[sp, #72]
	movz	x1,#2048
	cmp	x0,x1
	csel	x0,x0,x1,lo
	str	x0,[sp, #72]
	...

In optcse, a max function gets simplified because the actual parameters evaluate to constants - before:

	...
// [746] div max(sizeof(PtrUInt) div sizeof(ALUUInt),1)
	movz	w1,#1
	movz	w2,#1
	cmp	w1,w2
	csel	w1,w1,w2,gt
	sxtw	x1,w1
	sdiv	x0,x0,x1
	cbnz	x1,.Lj276
	bl	FPC_DIVBYZERO
.Lj276:
// [748] div 4
	asr	x1,x0,#63
	add	x0,x0,x1,lsr #62
	asr	x0,x0,#2
	ubfiz	x0,x0,#0,#32
	mov	w24,w0
	...

After - the entire block gets stripped out because the function evaluates to 1, which is an identity operation for integer division, so that too gets stripped:

	...
// [749] div 4 (line number is different because of the additional inline codes added on line 79)
	asr	x0,x1,#63
	add	x0,x1,x0,lsr #62
	asr	x0,x0,#2
	ubfiz	x0,x0,#0,#32
	mov	w24,w0
	...

The same thing happens a few lines later, albeit on a less grand scale - before:

	...
// [757] max_fpu_regs_assigned:=max(first_mm_imreg div 5,1);
	movz	w1,#6
	movz	w0,#1
	cmp	w1,w0
	csel	w23,w1,w0,gt
	...

After:

	...
// [758] max_fpu_regs_assigned:=max(first_mm_imreg div 5,1);
	movz	w23,#6
	...

IN optbase, a line of Pascal code gets greatly simplified, minimising the number of pointer deallocations - before:

	...
// [151] SetLength(d,max(Length(s1),Length(s2)));
	mov	x0,x1
	cbz	x0,.Lj57
	ldur	x0,[x0, #-8]
	add	x0,x0,#1
.Lj57:
	mov	x1,x21
	cbz	x1,.Lj58
	ldur	x1,[x1, #-8]
	add	x1,x1,#1
.Lj58:
	cmp	x0,x1
	b.lt	.Lj60
	mov	x0,x20
	cbz	x0,.Lj62
	ldur	x0,[x0, #-8]
	add	x0,x0,#1
	b	.Lj62
.Lj60:
	mov	x1,x21
	cbz	x1,.Lj63
	ldur	x1,[x1, #-8]
	add	x1,x1,#1
.Lj63:
	mov	x0,x1
.Lj62:
	str	x0,[sp]
	mov	x3,sp
	adrp	x1,:got:RTTI_$OPTBASE_$$_TDFASET
	ldr	x1,[x1, :got_lo12:RTTI_$OPTBASE_$$_TDFASET]
	mov	x0,x19
	movz	x2,#1
	bl	fpc_dynarray_setlength
	...

After - at .Lj58, 11 instructions and 3 labels get collapsed into a single CSEL instruction:

	...
// [151] SetLength(d,max(Length(s1),Length(s2)));
	mov	x0,x1
	cbz	x0,.Lj57
	ldur	x0,[x0, #-8]
	add	x0,x0,#1
.Lj57:
	mov	x1,x21
	cbz	x1,.Lj58
	ldur	x1,[x1, #-8]
	add	x1,x1,#1
.Lj58:
	cmp	x0,x1
	csel	x0,x0,x1,gt
	str	x0,[sp]
	mov	x3,sp
	adrp	x1,:got:RTTI_$OPTBASE_$$_TDFASET
	ldr	x1,[x1, :got_lo12:RTTI_$OPTBASE_$$_TDFASET]
	mov	x0,x19
	movz	x2,#1
	bl	fpc_dynarray_setlength
	...

In scandir, a different deterministic optimisation takes place - before:

	...
// [904] heapsize:=min(l,heapsize_limit);
	mov	w0,#2147483647
	cmp	w19,w0
	csel	w0,w19,w0,lt
	adrp	x1,:got:U_$GLOBALS_$$_HEAPSIZE
	ldr	x1,[x1, :got_lo12:U_$GLOBALS_$$_HEAPSIZE]
	stur	w0,[x1]
	...

After - because heapsize_limit is equal to 2,147,483,647, the maximum value for a LongInt, the function min(l,heapsize_limit); always returns l:

	...
// [904] heapsize:=min(l,heapsize_limit);
	adrp	x0,:got:U_$GLOBALS_$$_HEAPSIZE
	ldr	x0,[x0, :got_lo12:U_$GLOBALS_$$_HEAPSIZE]
	stur	w19,[x0]
	...

And again a few lines later - before:

	...
// [911] maxheapsize:=min(l,maxheapsize_limit)
	mov	w0,#2147483647
	cmp	w19,w0
	csel	w0,w19,w0,lt
	adrp	x1,:got:U_$GLOBALS_$$_MAXHEAPSIZE
	ldr	x1,[x1, :got_lo12:U_$GLOBALS_$$_MAXHEAPSIZE]
	stur	w0,[x1]
	b	.Lj389
	...

After:

	...
// [911] maxheapsize:=min(l,maxheapsize_limit)
	adrp	x0,:got:U_$GLOBALS_$$_MAXHEAPSIZE
	ldr	x0,[x0, :got_lo12:U_$GLOBALS_$$_MAXHEAPSIZE]
	stur	w19,[x0]
	b	.Lj389
	...

[AArch64 / x86 / Bug Fix] 64-bit Min/Max node support

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Testing notes

Relevant logs and/or screenshots

Merge request reports