Skip to content

[x86] STC/CLC and jump optimisations

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:stc-jmp into main

Summary

This merge request aims to remove or simplify code that contains STC and CLC instructions invariably generated by "in" nodes under x86. It primarily focuses on STC/CLC and their interactions with conditional jumps and SETcc instructions, redirecting, removing or converting them to MOV instructions if they're found to be determinate.

System

  • Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Code that uses sets and the "in" keyword can expect to see some performance gains.

Relevant logs and/or screenshots

A large number of files in the compiler, RTL and packages receive small improvements to both speed and size. In the ascii85 unit (x86_64-win64, -O4) - before:

	...
.Lj182:
	...
	cmpl	$20,%eax
	stc
	je	.Lj187
	clc
.Lj187:
	jnc	.Lj189
.Lj188:
	cmpl	$6,44(%rsi)
	jne	.Lj184
	.p2align 4,,10
	.p2align 3
.Lj189:
	...

After - 4 instructions get collapsed into 1 and the label .Lj187 becomes dead:

	...
.Lj182:
	...
	cmpl	$20,%eax
	jne	.Lj189
.Lj188:
	cmpl	$6,44(%rsi)
	jne	.Lj184
	.p2align 4,,10
	.p2align 3
.Lj189:
	...

Simiarly in the System unit - before:

	...
.Lj6736:
	addq	$1,%rax
.Lj6737:
	movzbl	(%rax),%edx
	subl	$9,%edx
	cmpl	$2,%edx
	jb	.Lj6736
	cmpl	$23,%edx
	stc
	je	.Lj6739
	clc
.Lj6739:
	jc	.Lj6736
	addl	$1,%ecx
	...

After:

	...
.Lj6736:
	addq	$1,%rax
.Lj6737:
	movzbl	(%rax),%edx
	subl	$9,%edx
	cmpl	$2,%edx
	jb	.Lj6736
	cmpl	$23,%edx
	je	.Lj6736
	addl	$1,%ecx
	...

Later in the unit, 2 for the price of 1 - before:

	...
.Lj6750:
	movzbl	(%rax),%edx
	testl	%edx,%edx
	stc
	je	.Lj6752
	subl	$9,%edx
	cmpl	$2,%edx
	jb	.Lj6734
	cmpl	$23,%edx
	stc
	je	.Lj6752
	clc
.Lj6752:
	jnc	.Lj6749
.Lj6734:
	...

After:

	...
.Lj6750:
	movzbl	(%rax),%edx
	testl	%edx,%edx
	je	.Lj6734
	subl	$9,%edx
	cmpl	$2,%edx
	jb	.Lj6734
	cmpl	$23,%edx
	jne	.Lj6749
.Lj6734:
	...

Also in the System unit, a SETcc instruction is found to be determinate and this is combined with other optimisations to reduce pipeline stalls - before:

	...
.Lj9916:
	movzbl	%r8b,%edx
	testl	%edx,%edx
	stc
	je	.Lj9918
	subl	$2,%edx
	cmpl	$2,%edx
.Lj9918:
	setcb	%r10b
	jmp	.Lj9912
	...

After:

	...
.Lj9916:
	movzbl	%r8b,%edx
	movb	$1,%r10b
	testl	%edx,%edx
	je	.Lj9912
	subl	$2,%edx
	cmpl	$2,%edx
	setcb	%r10b
	jmp	.Lj9912
	...

In the assemble unit of the compiler, a large conditional block gets lots of small optimisations - before:

	...
.Lj329:
	testb	$128,U_$GLOBALS_$$_CURRENT_SETTINGS+66(%rip)
	jne	.Lj330
	movzbl	U_$SYSTEMS_$$_TARGET_INFO(%rip),%eax
	cmpl	$5,%eax
	stc
	je	.Lj333
	subl	$37,%eax
	cmpl	$2,%eax
	jb	.Lj332
	cmpl	$3,%eax
	stc
	je	.Lj333
	cmpl	$31,%eax
	stc
	je	.Lj333
	cmpl	$70,%eax
	stc
	je	.Lj333
	clc
.Lj333:
	jnc	.Lj330
.Lj332:
	...

After:

	...
.Lj329:
	testb	$128,U_$GLOBALS_$$_CURRENT_SETTINGS+66(%rip)
	jne	.Lj330
	movzbl	U_$SYSTEMS_$$_TARGET_INFO(%rip),%eax
	cmpl	$5,%eax
	je	.Lj332
	subl	$37,%eax
	cmpl	$2,%eax
	jb	.Lj332
	cmpl	$3,%eax
	je	.Lj332
	cmpl	$31,%eax
	je	.Lj332
	cmpl	$70,%eax
	jne	.Lj330
.Lj332:
	...

Additional notes

Some optimisations were missed because related Jcc instructions weren't simplified until pass 2. To account for this, a new value aoc_DoPass2JccOpts was added to TOptsToCheck, and if set by the relevant pass 2 peephole optimisations under -O3 (since it generally requires an additional run of pass 2 to have any effect), would instruct the compiler to perform the expensive DoJumpOptimizations at the start of OptPass2Jcc, as well as of course running the new STC/CLC optimisations again.

One minor drawback is that the label numbers tend to change throughout the file because of the peephole optimizer creating new labels in order to optimise J(c); ...; .Lbl; J(~c) sets (creates a new label after J(~c) and redirects J(c) to point there)

For the STC/CLC optimisations, in StrUtils - before:

	...
	cmpl	$2,%eax
	jb	.Lj1732
	cmpl	$4,%eax
	je	.Lj1725
	cmpl	$23,%eax
	je	.Lj1725
	clc
.Lj1732:
	jc	.Lj1725
.Lj1730:
	movl	48(%rsp),%eax
	...

After - .Lj1730 becomes a dead label and the JC instruction gets removed because it will never branch thanks to CLC (and CLC itself gets removed because the flags are deallocated):

	...
	cmpl	$2,%eax
	jb	.Lj1725
	cmpl	$4,%eax
	je	.Lj1725
	cmpl	$23,%eax
	je	.Lj1725
.Lj1730:
	movl	48(%rsp),%eax
	...

Regarding DoJumpOptimizations, in SysUtils - before:

.section .text.n_sysutils_$$_fexpand$rawbytestring$rawbytestring$$rawbytestring,"ax"
	...
.Lj894:
	cmpq	$1,%rax
	jng	.Lj896
	movq	-16(%rbp),%rax
	movzbl	(%rax),%eax
	subl	$65,%eax
	cmpl	$26,%eax
	jb	.Lj898 ; <-- remember that the conditions "b" and "c" check the same flags.
	subl	$32,%eax
	cmpl	$26,%eax
.Lj898:
	jnc	.Lj896
	movq	-16(%rbp),%rax
	...

After:

.section .text.n_sysutils_$$_fexpand$rawbytestring$rawbytestring$$rawbytestring,"ax"
	...
.Lj894:
	cmpq	$1,%rax
	jng	.Lj896
	movq	-16(%rbp),%rax
	movzbl	(%rax),%eax
	subl	$65,%eax
	cmpl	$26,%eax
	jb	.Lj1076
	subl	$32,%eax
	cmpl	$26,%eax
	jnc	.Lj896
.Lj1076:
	movq	-16(%rbp),%rax
	...

Merge request reports