[x86] Merging small constants written to the stack into larger ones (!97) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:const-merges into main Dec 14, 2021

Summary

This merge request looks for adjacent, sequential writes of constants to the stack (accepts the stack pointer or the procedure's frame pointer, if different) and attempts to merge them to reduce instruction count and memory latency. Specifically, it will merge 2 adjacent bytes into a word, 2 adjacent words into a longword, and, for x86_64 only, 2 adjacent longwords into a quadword if the constant can be encoded as a 32-bit signed integer (which locks the second value to just 0 or -1).

Note that these merges are only performed if the first address is aligned to the new size. So, for example, two adjacent words will be merged if the addresses are 64(%rsp) and 66(%rsp), but not 62(%rsp) and 64(%rsp), because the latter pair is not suitably aligned for a longword.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Sequential writing of small constants to the stack are merged into larger constants.

Notes:

Programs that make heavy use of variable-length parameter lists and sets, such as the compiler itself, greatly benefit from this optimisation. Under -O4, the size of i386-win32's compiler shrinks by 2,048 bytes, and x86_64-win64's compiler shrinks by 1,536 bytes (it also shrinks by this much under -O2).
After merging two byte-writes, the peephole optimizer will look ahead to see if two more bytes follow, since then these can be merged into a single longword. This common occurrance potentially reduces the number of times that OptPass1MOV has to be entered and, more importantly, reduce the number of iterations of pass 1 required (otherwise, the two pairs of bytes would be separately merged into words, and then the two words would be merged on the next pass, since the first word will now be behind the current instruction).
A similar thing is attempted for merging two word-writes... the peephole optimizer looks ahead for 2 adjacent writes. On x86_64, the peephole optimizer will then try to merge the two adjacent longwords (only certain combinations of constant are permissible); on i386, this merge won't be attempted on account of direct quadwords not being supported, but the look-ahead still happens since there's no real loss in doing so and it helps reduce the number of calls to OptPass1MOV.
A word-sized MOV such as "movw $0,64(%rsp)" takes 6 bytes to encode when the offset is between -128 and 127, while the longword-sized equivalent takes 7 bytes, the sign-extended quadword MOV takes 8 bytes and the byte-sized write is only 4 bytes. For offsets outside of the signed byte range, add 3 bytes to the instruction size. As a result, merging 2 byte-writes into a word saves 3 bytes, 2 word-writes into a longword saves 5 bytes and 2 longword-writes into a quadword saves 6 bytes. When combined, merging 4 byte-writes into a single longword-write saves 9 bytes and removes 3 instructions.

Relevant logs and/or screenshots

The biggest savings are in the compiler itself, especially with calls such as MatchInstruction with a long list of instructions to check against. In cgcpu, for example - before:

	je	.Lj8
	cmpl	$17104903,260(%rdx)
	jne	.Lj8
	movw	$16,48(%rsp)
	movq	$14,40(%rsp)
	movl	$0,56(%rsp)
	movw	$0,64(%rsp)
	movw	$2,66(%rsp)
	movw	$1,68(%rsp)
	movw	$8,70(%rsp)
	movw	$9,72(%rsp)
	movw	$10,74(%rsp)
	movw	$11,76(%rsp)
	movw	$3,78(%rsp)
	movw	$4,80(%rsp)
	movw	$5,82(%rsp)
	movw	$12,84(%rsp)
	movw	$13,86(%rsp)
	movw	$14,88(%rsp)
	movw	$15,90(%rsp)
	movw	$6,92(%rsp)

After:

	je	.Lj8
	cmpl	$17104903,260(%rdx)
	jne	.Lj8
	movw	$16,48(%rsp)
	movq	$14,40(%rsp)
	movl	$0,56(%rsp)
	movl	$131072,64(%rsp)
	movl	$524289,68(%rsp)
	movl	$655369,72(%rsp)
	movl	$196619,76(%rsp)
	movl	$327684,80(%rsp)
	movl	$851980,84(%rsp)
	movl	$983054,88(%rsp)
	movw	$6,92(%rsp)

RTL examples aren't as common, but for SysUtlis (x86_64-win64) - before:

.Lj7031:
	...
	movl	$0,-44(%rbp)
	movl	$0,-40(%rbp)
	movl	$0,-36(%rbp)

After:

.Lj7031:
	...
	movl	$0,-44(%rbp)
	movq	$0,-40(%rbp)

This one is a good example because the first two movl instructions are rejected on account of -44 not being on an 8-byte boundary, but -40 is, and since the second constant is zero and the first one less than 2^31, this means the merged 64-bit constant can be represented as a sign-extended 32-bit integer (the only format that you can write directly to a memory location under x86_64). Either way, two 7-byte MOV instructions are merged into a single 8-byte MOV instruction to make a saving of 1 instruction and 6 bytes.

Edited Dec 14, 2021 by J. Gareth "Kit" Moreton

[x86] Merging small constants written to the stack into larger ones

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Notes:

Relevant logs and/or screenshots

Merge request reports