[AArch64 / x86 / Bug Fix] 64-bit Min/Max node support
Summary
This merge request expands the use of min/max intrinsics:
- Four new intrinsic codes added to support 64-bit operands (
min_qword
,min_int64
,max_qword
,max_int64
) - Bug fix on AArch64 where unsigned min/max constructs used signed comparisons in the resultant assembly.
- i386 now supports 32-bit min/max intrinsics as long as
CMOV
is supported (64-bit min/max nodes are not generated). - x86_64 now supports both 32-bit and 64-bit min/max intrinsics, generating
CMOV
instructions where possible. -
tinlinenode.simplify
now checks to see if the result is deterministic for min/max intrinsics, either because both inputs are constants, or because one is at the maximum range (e.g.min(x,maxint)
will always return x). This also holds true for the versions that handle special floating-point values, namely plus and minus infinity and QNaN (which always returns the second parameter even if it itself is a QNaN and the first parameter is a real value).
Note that these intrinsics don't actually correspond to min
and max
functions, but are instead codes for inlinen
nodes that are generated when the compiler comes across if (x < a) then x := a;
combinations, for example.
System
- Operating system: Windows, Linux (Ubuntu (x86) and Raspberry Pi OS (AArch64))
- Processor architecture: i386, x86_64, AArch64
- Device: Windows 10 Laptop, Raspberry Pi and others
What is the current bug behavior?
Under AArch64, unsigned min/max intrinsics use signed comparisons, resulting in erroneous behaviour for large values.
What is the behavior after applying this patch?
- AArch64 behaviour is corrected and now supports optimisation for 64-bit values.
- x86 now supports min/max optimisation (as long as
CMOV
is supported).
Testing notes
-
test/tminmax.pp
was the main source of conformance testing, especially for deterministic inputs. -
webtbs/tw2494.pp
failed at one point during development on x86_64 because of forgetting to account for the fact thatCMP
instructions can only handle signed 32-bit immediates. - The compiler itself developed a strange error where jump tables were being replaced with linear lists when compiling case blocks. This is where the bug regarding unsigned min/max using signed comparisons was revealed, because the highest unsigned value possible was being interpreted as -1 (rather than 18,446,744,073,709,551,615) in a signed 'less than' comparison, which was incorrectly returning False.
Relevant logs and/or screenshots
Changes on x86 are minimal because of the peephole optimisations in OptPass2Jcc
, although performing these min/max optimisations at node level takes a lot of strain off the peephole optimizer. Fewer labels are generated so the label numbers change in any file where the node optimisation takes place, whether or not it takes the place of the peephole optimisation.
The system
unit shows an interesting case - before:
...
# [42] if count>length(s)-index then
movzbl (%rax),%r9d
subq %rdx,%r9
cmpq %r8,%r9
jnl .Lj1319
# [43] count:=length(s)-index;
movzbl (%rax),%r9d
subq %rdx,%r9
movq %r9,%r8
.Lj1319:
...
After - it is able to shortcut the extra movzbl (%rax),%r9d; subq %rdx,%r9
pair:
...
# [43] count:=length(s)-index;
movzbl (%rax),%r9d
# [42] if count>length(s)-index then
subq %rdx,%r9
cmpq %r9,%r8
cmovnlq %r9,%r8
.Lj1311: (same position as .Lj1319 as before)
...
In the cldrxml
unit - before:
...
# [416] if (AStartPosition <= 0) then
testl %edx,%edx
jnle .Lj69
# [417] startPosition := 0
movl $0,-4(%rbp)
jmp .Lj70
.p2align 4,,10
.p2align 3
.Lj69:
# [419] startPosition := AStartPosition;
movl %edx,-4(%rbp)
.Lj70:
# [420] i := startPosition;
movl -4(%rbp),%r12d
...
After:
...
# [416] if (AStartPosition <= 0) then
xorl %eax,%eax
testl %edx,%edx
cmovgl %edx,%eax
movl %eax,-4(%rbp)
# [420] i := startPosition;
movl %eax,%r12d
...
In optutils
, a vestigial pointer deallocation is removed - before:
...
.Lj141:
movq (%rbx),%rax
movq 104(%rax),%rax
# [441] Weight:=max(plongint(arg)^,1);
movl (%rsi),%esi
movl $1,%eax
testl %esi,%esi
cmovlel %eax,%esi
...
After:
...
.Lj141:
# [441] Weight:=max(plongint(arg)^,1);
movl (%rsi),%eax
movl $1,%esi
cmpl $1,%eax
cmovgl %eax,%esi
...
In pdecvar
, a pair of instructions are simplified nicely - before:
...
# [2070] maxalignment:=max(maxalignment,unionsymtable.fieldalignment);
movsbl 140(%r14),%eax
movsbl -688(%rbp),%edx
cmpl %eax,%edx
jnge .Lj906
movsbl -688(%rbp),%edx
jmp .Lj907
.p2align 4,,10
.p2align 3
.Lj906:
movl %eax,%edx
.Lj907:
movb %dl,-688(%rbp)
# [2071] maxpadalign:=max(maxpadalign,unionsymtable.padalignment);
movsbl 141(%r14),%eax
movsbl %r15b,%edx
cmpl %eax,%edx
jnge .Lj909
movsbl %r15b,%edx
jmp .Lj910
.p2align 4,,10
.p2align 3
.Lj909:
movl %eax,%edx
.Lj910:
movb %dl,%r15b
...
After:
...
# [2070] maxalignment:=max(maxalignment,unionsymtable.fieldalignment);
movsbl 140(%r14),%ecx
movsbl -688(%rbp),%eax
movl %eax,%edx
cmpl %eax,%ecx
cmovgl %ecx,%edx
movb %dl,-688(%rbp)
# [2071] maxpadalign:=max(maxpadalign,unionsymtable.padalignment);
movsbl 141(%r14),%ecx
movsbl %r15b,%eax
movl %eax,%edx
cmpl %eax,%ecx
cmovgl %ecx,%edx
movb %dl,%r15b
...
Under aarch64-linux, -O4, even with the new CSEL
peephole optimisations, many more improvements were noted.
In the cclasses
unit, a number of conditions are corrected. For example - before:
...
// [2592] if Ablocksize<mindynamicblocksize then
movz w0,#64
cmp w0,w21
csel w0,w0,w21,gt
...
// [2652] if CurrBlockSize>FMaxBlocksize then
ldp w1,w0,[x19, #44]
cmp w0,w1
csel w0,w0,w1,lt
str w0,[x19, #44]
After:
...
// [2592] if Ablocksize<mindynamicblocksize then
movz w0,#64
cmp w0,w21
csel w0,w0,w21,hi
...
// [2652] if CurrBlockSize>FMaxBlocksize then
ldp w1,w0,[x19, #44]
cmp w0,w1
csel w0,w0,w1,lo
str w0,[x19, #44]
...
In the case of 64-bit min/max coming into play - in the ncgset
unit - before:
...
// [1306] max_dist:=min(max_dist,2048);
ldr x0,[sp, #72]
cmp x0,#2048
b.ls .Lj313
movz x0,#2048
.Lj313:
str x0,[sp, #72]
...
After - this can probably be improved slightly with some minor changes to constant generation and some changes to the AArch64 peephole optimizer:
...
// [1306] max_dist:=min(max_dist,2048);
ldr x0,[sp, #72]
movz x1,#2048
cmp x0,x1
csel x0,x0,x1,lo
str x0,[sp, #72]
...
In optcse
, a max
function gets simplified because the actual parameters evaluate to constants - before:
...
// [746] div max(sizeof(PtrUInt) div sizeof(ALUUInt),1)
movz w1,#1
movz w2,#1
cmp w1,w2
csel w1,w1,w2,gt
sxtw x1,w1
sdiv x0,x0,x1
cbnz x1,.Lj276
bl FPC_DIVBYZERO
.Lj276:
// [748] div 4
asr x1,x0,#63
add x0,x0,x1,lsr #62
asr x0,x0,#2
ubfiz x0,x0,#0,#32
mov w24,w0
...
After - the entire block gets stripped out because the function evaluates to 1, which is an identity operation for integer division, so that too gets stripped:
...
// [749] div 4 (line number is different because of the additional inline codes added on line 79)
asr x0,x1,#63
add x0,x1,x0,lsr #62
asr x0,x0,#2
ubfiz x0,x0,#0,#32
mov w24,w0
...
The same thing happens a few lines later, albeit on a less grand scale - before:
...
// [757] max_fpu_regs_assigned:=max(first_mm_imreg div 5,1);
movz w1,#6
movz w0,#1
cmp w1,w0
csel w23,w1,w0,gt
...
After:
...
// [758] max_fpu_regs_assigned:=max(first_mm_imreg div 5,1);
movz w23,#6
...
IN optbase
, a line of Pascal code gets greatly simplified, minimising the number of pointer deallocations - before:
...
// [151] SetLength(d,max(Length(s1),Length(s2)));
mov x0,x1
cbz x0,.Lj57
ldur x0,[x0, #-8]
add x0,x0,#1
.Lj57:
mov x1,x21
cbz x1,.Lj58
ldur x1,[x1, #-8]
add x1,x1,#1
.Lj58:
cmp x0,x1
b.lt .Lj60
mov x0,x20
cbz x0,.Lj62
ldur x0,[x0, #-8]
add x0,x0,#1
b .Lj62
.Lj60:
mov x1,x21
cbz x1,.Lj63
ldur x1,[x1, #-8]
add x1,x1,#1
.Lj63:
mov x0,x1
.Lj62:
str x0,[sp]
mov x3,sp
adrp x1,:got:RTTI_$OPTBASE_$$_TDFASET
ldr x1,[x1, :got_lo12:RTTI_$OPTBASE_$$_TDFASET]
mov x0,x19
movz x2,#1
bl fpc_dynarray_setlength
...
After - at .Lj58
, 11 instructions and 3 labels get collapsed into a single CSEL
instruction:
...
// [151] SetLength(d,max(Length(s1),Length(s2)));
mov x0,x1
cbz x0,.Lj57
ldur x0,[x0, #-8]
add x0,x0,#1
.Lj57:
mov x1,x21
cbz x1,.Lj58
ldur x1,[x1, #-8]
add x1,x1,#1
.Lj58:
cmp x0,x1
csel x0,x0,x1,gt
str x0,[sp]
mov x3,sp
adrp x1,:got:RTTI_$OPTBASE_$$_TDFASET
ldr x1,[x1, :got_lo12:RTTI_$OPTBASE_$$_TDFASET]
mov x0,x19
movz x2,#1
bl fpc_dynarray_setlength
...
In scandir
, a different deterministic optimisation takes place - before:
...
// [904] heapsize:=min(l,heapsize_limit);
mov w0,#2147483647
cmp w19,w0
csel w0,w19,w0,lt
adrp x1,:got:U_$GLOBALS_$$_HEAPSIZE
ldr x1,[x1, :got_lo12:U_$GLOBALS_$$_HEAPSIZE]
stur w0,[x1]
...
After - because heapsize_limit
is equal to 2,147,483,647, the maximum value for a LongInt, the function min(l,heapsize_limit);
always returns l
:
...
// [904] heapsize:=min(l,heapsize_limit);
adrp x0,:got:U_$GLOBALS_$$_HEAPSIZE
ldr x0,[x0, :got_lo12:U_$GLOBALS_$$_HEAPSIZE]
stur w19,[x0]
...
And again a few lines later - before:
...
// [911] maxheapsize:=min(l,maxheapsize_limit)
mov w0,#2147483647
cmp w19,w0
csel w0,w19,w0,lt
adrp x1,:got:U_$GLOBALS_$$_MAXHEAPSIZE
ldr x1,[x1, :got_lo12:U_$GLOBALS_$$_MAXHEAPSIZE]
stur w0,[x1]
b .Lj389
...
After:
...
// [911] maxheapsize:=min(l,maxheapsize_limit)
adrp x0,:got:U_$GLOBALS_$$_MAXHEAPSIZE
ldr x0,[x0, :got_lo12:U_$GLOBALS_$$_MAXHEAPSIZE]
stur w19,[x0]
b .Lj389
...