Draft: [x86] Cross-jump and cross-label optimisations
Summary
This merge request contains a number of peephole optimisations for x86 platforms that work across labels and jumps, mostly looking for deterministic CMP instructions and the like. The result is that some conditional branches are made unconditional and jump destinations are made more efficient. In some cases, entire blocks of code are optimised out because of dead labels and newly unconditional jumps causing dead code up to the next live label.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
A large number of optimisations that minimise the number of jumps taken
Additional notes
- The MOV/@lbl/CMP/Jcc optimisation, which puts a JMP between the MOV and .Lbl if Jcc will definitely jump if the program flow comes from this direction, is not performed under -Os.
- Regarding the above optimisation, a jump to a label after Jcc is NOT performed if Jcc will definitely not jump because this will add a branch and be slower overall compared to simply going through the comparison and the conditional jump (which will also be macro-fused).
- Some inefficiencies can appear on -O3 due to OptPass2JMP (which received a minor bug fix), which can cause some assignments to get doubled up. For example, in optconstprop, before:
.Lj74:
xorb %r12b,%r12b
jmp .Lj78
.Lj76:
After (this artefact appears twice in the unit):
.Lj74:
xorb %r12b,%r12b
xorb %r12b,%r12b
jmp .Lj86
.Lj76:
This doesn't get optimised out because OptPass2JMP is pass 2, not pass 1. It is a harmless anomaly, but I am thinking up solutions for this. There are also cases of "xor %reg,%reg; je .Lbl" which, while also harmless, are not ideal. This one, however, is corrected by !108 (merged).
Relevant logs and/or screenshots
The aoptobj unit gets a small block removed thanks to noticing the deterministic comparisons - before:
...
je .Lj220
.Lj218:
movb $1,%al
jmp .Lj221
.Lj220:
xorb %al,%al
jmp .Lj233
.Lj221:
testb %al,%al
je .Lj233
.p2align 4,,10
.p2align 3
.Lj210:
...
After:
...
je .Lj220
.Lj218:
movb $1,%al
jmp .Lj210
.Lj220:
xorb %al,%al
je .Lj233 ; <- This is anomalous "xor %reg,%reg; je .lbl" I was talking about.
.p2align 4,,10
.p2align 3
.Lj210:
...
cclasses merges some optimisations - before:
jne .Lj1126
movb $1,%bl
jmp .Lj1127
.p2align 4,,10
.p2align 3
.Lj1126:
movq %r12,%r8
call SYSTEM_$$_COMPAREBYTE$formal$formal$INT64$$INT64
testq %rax,%rax
seteb %bl
.Lj1127:
testb %bl,%bl
jne .Lj1129
movb $0,32(%rsp)
jmp .Lj1107
.p2align 4,,10
.p2align 3
.Lj1129:
After:
jne .Lj1126
movb $1,%bl
jmp .Lj1129
.p2align 4,,10
.p2align 3
.Lj1126:
movq %r12,%r8
call SYSTEM_$$_COMPAREBYTE$formal$formal$INT64$$INT64
testq %rax,%rax
je .Lj1129
movb $0,32(%rsp)
jmp .Lj1107
.p2align 4,,10
.p2align 3
.Lj1129:
Another example in fgl - before:
...
.Lj206:
addl $1,%r14d
movq 96(%rsp),%rdx
cmpq %rdx,%rdi
jne .Lj210
movb $1,%r15b
jmp .Lj211
.p2align 4,,10
.p2align 3
.Lj210:
movslq %ebp,%r8
movq %rdi,%rcx
call SYSTEM_$$_COMPAREBYTE$formal$formal$INT64$$INT64
testq %rax,%rax
seteb %r15b
.Lj211:
testb %r15b,%r15b
jne .Lj213
...
After:
...
.Lj206:
addl $1,%r14d
movq 96(%rsp),%rdx
cmpq %rdx,%rdi
jne .Lj210
movb $1,%r15b
jmp .Lj213
.p2align 4,,10
.p2align 3
.Lj210:
movslq %ebp,%r8
movq %rdi,%rcx
call SYSTEM_$$_COMPAREBYTE$formal$formal$INT64$$INT64
testq %rax,%rax
je .Lj213
...
In ncal, a bunch of constants get pulled in due to OptPass2JMP - before:
...
seteb -624(%rbp)
jmp .Lj17
.Lj15:
movb $0,-624(%rbp)
.Lj17:
cmpb $0,-624(%rbp)
je .Lj19
...
.Lj19:
movq %rbx,%r15
movq %rbx,-16(%rbp)
xorl %esi,%esi
xorl %r12d,%r12d
jmp .Lj21
...
After:
...
seteb -624(%rbp)
jmp .Lj17
.Lj15:
movb $0,-624(%rbp)
movq %rbx,%r15
movq %rbx,-16(%rbp)
xorl %esi,%esi
xorl %r12d,%r12d
jmp .Lj21
.Lj17:
cmpb $0,-624(%rbp)
je .Lj19
...
.Lj19:
movq %rbx,%r15
movq %rbx,-16(%rbp)
xorl %esi,%esi
xorl %r12d,%r12d
jmp .Lj21
...
nmat has quite a few examples of the following - before:
...
cmpb $18,20(%rax)
je .Lj487
testw $32,42(%rax)
je .Lj489
.Lj487:
movb $1,%al
jmp .Lj490
.Lj489:
xorb %al,%al
jmp .Lj492
.Lj490:
testb %al,%al
je .Lj492
movq U_$SYMDEF_$$_CUNDEFINEDTYPE(%rip),%rax
movq %rax,80(%rbx)
jmp .Lj485
.p2align 4,,10
.p2align 3
.Lj492:
movq 136(%rbx),%rcx
...
After:
...
cmpb $18,20(%rax)
je .Lj487
testw $32,42(%rax)
je .Lj489
.Lj487:
movq U_$SYMDEF_$$_CUNDEFINEDTYPE(%rip),%rax
movq %rax,80(%rbx)
jmp .Lj485
.Lj489:
xorb %al,%al
movq 136(%rbx),%rcx
...
In ptype, a more unexpected optimisation happens and an ultimately pointless MOV is removed as well as a change in jump destination - before:
...
.Lj189:
movb $1,%dil
movb $1,%r12b
jmp .Lj173
.p2align 4,,10
.p2align 3
.Lj186:
...
After:
...
.Lj189:
movb $1,%dil
jmp .Lj170
.p2align 4,,10
.p2align 3
.Lj186:
...
sfpu128 manages to create a SET instruction from what's left! - before:
...
cmpq %rcx,%rdx
jb .Lj1575
seteb %dl
cmpq $-1,%rax
setbb %al
andb %al,%dl
je .Lj1577
.Lj1575:
movb $1,%al
jmp .Lj1578
.Lj1577:
xorb %al,%al
jmp .Lj1574
.Lj1578:
testb %al,%al
je .Lj1574
.Lj1572:
movb $1,%al
jmp .Lj1579
.Lj1574:
xorb %al,%al
.Lj1579:
movb %al,%r14b
...
After:
...
cmpq %rcx,%rdx
jb .Lj1572
seteb %dl
cmpq $-1,%rax
setbb %al
andb %al,%dl
setneb %al
jmp .Lj1579
.Lj1572:
movb $1,%al
.Lj1579:
movb %al,%r14b
...