[x86] STC/CLC and jump optimisations
Summary
This merge request aims to remove or simplify code that contains STC
and CLC
instructions invariably generated by "in" nodes under x86. It primarily focuses on STC/CLC
and their interactions with conditional jumps and SETcc
instructions, redirecting, removing or converting them to MOV
instructions if they're found to be determinate.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Code that uses sets and the "in" keyword can expect to see some performance gains.
Relevant logs and/or screenshots
A large number of files in the compiler, RTL and packages receive small improvements to both speed and size. In the ascii85
unit (x86_64-win64, -O4) - before:
...
.Lj182:
...
cmpl $20,%eax
stc
je .Lj187
clc
.Lj187:
jnc .Lj189
.Lj188:
cmpl $6,44(%rsi)
jne .Lj184
.p2align 4,,10
.p2align 3
.Lj189:
...
After - 4 instructions get collapsed into 1 and the label .Lj187
becomes dead:
...
.Lj182:
...
cmpl $20,%eax
jne .Lj189
.Lj188:
cmpl $6,44(%rsi)
jne .Lj184
.p2align 4,,10
.p2align 3
.Lj189:
...
Simiarly in the System
unit - before:
...
.Lj6736:
addq $1,%rax
.Lj6737:
movzbl (%rax),%edx
subl $9,%edx
cmpl $2,%edx
jb .Lj6736
cmpl $23,%edx
stc
je .Lj6739
clc
.Lj6739:
jc .Lj6736
addl $1,%ecx
...
After:
...
.Lj6736:
addq $1,%rax
.Lj6737:
movzbl (%rax),%edx
subl $9,%edx
cmpl $2,%edx
jb .Lj6736
cmpl $23,%edx
je .Lj6736
addl $1,%ecx
...
Later in the unit, 2 for the price of 1 - before:
...
.Lj6750:
movzbl (%rax),%edx
testl %edx,%edx
stc
je .Lj6752
subl $9,%edx
cmpl $2,%edx
jb .Lj6734
cmpl $23,%edx
stc
je .Lj6752
clc
.Lj6752:
jnc .Lj6749
.Lj6734:
...
After:
...
.Lj6750:
movzbl (%rax),%edx
testl %edx,%edx
je .Lj6734
subl $9,%edx
cmpl $2,%edx
jb .Lj6734
cmpl $23,%edx
jne .Lj6749
.Lj6734:
...
Also in the System
unit, a SETcc
instruction is found to be determinate and this is combined with other optimisations to reduce pipeline stalls - before:
...
.Lj9916:
movzbl %r8b,%edx
testl %edx,%edx
stc
je .Lj9918
subl $2,%edx
cmpl $2,%edx
.Lj9918:
setcb %r10b
jmp .Lj9912
...
After:
...
.Lj9916:
movzbl %r8b,%edx
movb $1,%r10b
testl %edx,%edx
je .Lj9912
subl $2,%edx
cmpl $2,%edx
setcb %r10b
jmp .Lj9912
...
In the assemble
unit of the compiler, a large conditional block gets lots of small optimisations - before:
...
.Lj329:
testb $128,U_$GLOBALS_$$_CURRENT_SETTINGS+66(%rip)
jne .Lj330
movzbl U_$SYSTEMS_$$_TARGET_INFO(%rip),%eax
cmpl $5,%eax
stc
je .Lj333
subl $37,%eax
cmpl $2,%eax
jb .Lj332
cmpl $3,%eax
stc
je .Lj333
cmpl $31,%eax
stc
je .Lj333
cmpl $70,%eax
stc
je .Lj333
clc
.Lj333:
jnc .Lj330
.Lj332:
...
After:
...
.Lj329:
testb $128,U_$GLOBALS_$$_CURRENT_SETTINGS+66(%rip)
jne .Lj330
movzbl U_$SYSTEMS_$$_TARGET_INFO(%rip),%eax
cmpl $5,%eax
je .Lj332
subl $37,%eax
cmpl $2,%eax
jb .Lj332
cmpl $3,%eax
je .Lj332
cmpl $31,%eax
je .Lj332
cmpl $70,%eax
jne .Lj330
.Lj332:
...
Additional notes
Some optimisations were missed because related Jcc
instructions weren't simplified until pass 2. To account for this, a new value aoc_DoPass2JccOpts
was added to TOptsToCheck
, and if set by the relevant pass 2 peephole optimisations under -O3 (since it generally requires an additional run of pass 2 to have any effect), would instruct the compiler to perform the expensive DoJumpOptimizations
at the start of OptPass2Jcc
, as well as of course running the new STC/CLC
optimisations again.
One minor drawback is that the label numbers tend to change throughout the file because of the peephole optimizer creating new labels in order to optimise J(c); ...; .Lbl; J(~c)
sets (creates a new label after J(~c)
and redirects J(c)
to point there)
For the STC/CLC
optimisations, in StrUtils - before:
...
cmpl $2,%eax
jb .Lj1732
cmpl $4,%eax
je .Lj1725
cmpl $23,%eax
je .Lj1725
clc
.Lj1732:
jc .Lj1725
.Lj1730:
movl 48(%rsp),%eax
...
After - .Lj1730
becomes a dead label and the JC
instruction gets removed because it will never branch thanks to CLC
(and CLC
itself gets removed because the flags are deallocated):
...
cmpl $2,%eax
jb .Lj1725
cmpl $4,%eax
je .Lj1725
cmpl $23,%eax
je .Lj1725
.Lj1730:
movl 48(%rsp),%eax
...
Regarding DoJumpOptimizations
, in SysUtils
- before:
.section .text.n_sysutils_$$_fexpand$rawbytestring$rawbytestring$$rawbytestring,"ax"
...
.Lj894:
cmpq $1,%rax
jng .Lj896
movq -16(%rbp),%rax
movzbl (%rax),%eax
subl $65,%eax
cmpl $26,%eax
jb .Lj898 ; <-- remember that the conditions "b" and "c" check the same flags.
subl $32,%eax
cmpl $26,%eax
.Lj898:
jnc .Lj896
movq -16(%rbp),%rax
...
After:
.section .text.n_sysutils_$$_fexpand$rawbytestring$rawbytestring$$rawbytestring,"ax"
...
.Lj894:
cmpq $1,%rax
jng .Lj896
movq -16(%rbp),%rax
movzbl (%rax),%eax
subl $65,%eax
cmpl $26,%eax
jb .Lj1076
subl $32,%eax
cmpl $26,%eax
jnc .Lj896
.Lj1076:
movq -16(%rbp),%rax
...