J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:typeconv-strip into main Jun 02, 2022

Summary

This merge request aims to reduce the number of type conversion nodes where the source and destination types are identical or near-identical. This is mostly to help with constant propagation and other optimisation techniques that tend to get stuck with type conversions, which is also hampering the development and effectiveness of pure functions.

System

Processor architecture: All

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Constant propagation and other optimisations should be more effective as a result.

Relevant logs and/or screenshots

These relate to x86_64-win64 under -O4.

A lot of units get minor improvements here and there, especially where constant propagation and common subexpression elimination is concerned, and especially in regards to recalling values from the stack.

ascii85 unit - Before:

.Lj364:
	...
	movq	-24(%rbp),%rdx
	movl	$1024,%eax
	movq	-24(%rbp),%rdx
	movq	24(%rdx),%rdx
	subl	16(%rdx),%eax
	...

After:

.Lj364:
	...
	movq	-24(%rbp),%rdx
	movq	24(%rdx),%rdx
	movl	$1024,%eax
	subl	16(%rdx),%eax
	...

The System unit especially gets a notable one:

	...
.Lj494:
	movl	$1,%eax
	cmpb	$127,(%rcx)
	jna	.Lj496
	movb	(%rcx),%r9b
	notb	%r9b
	movzbl	%r9b,%r9d
	bsrq	%r9,%r9
	jne	.Lj497
	movl	$255,%r9d
.Lj497:
	cmpb	$1,%r9b
	setbeb	%r11b
	cmpb	$6,%r9b
	setaeb	%r10b
	orb	%r10b,%r11b
	je	.Lj499
	negq	%rax
	ret
	.p2align 4,,10
	.p2align 3
.Lj499:
	...

In the patch, the negq %rax instruction becomes movq $-1,%rax because constant propagation is able to identify that the function result in %rax is equal to 1 when it reaches the negation instruction, as seen in the corresponding Pascal code:

    result:=1;
    { multiple byte utf-8 code point? }
    if p[0]>#127 then
      begin
        { bsr searches for the leftmost 1 bit. We are interested in the
          leftmost 0 bit, so first invert the value
        }
        firstzerobit:=bsrbyte(not(byte(p[0])));
        { if there is no zero bit or the first zero bit is the rightmost bit
          (bit 0), this is an invalid UTF-8 byte ($ff cannot appear in an
          UTF-8-encoded string, and in the worst case bit 1 has to be zero)
          Additionally, 5-byte UTF-8 sequences don't exist either, so bit 1
          cannot be the first zero-bit either. And bits 6 and 7 can't be 0
          either in the first byte.
        }
        if (firstzerobit<=1) or (firstzerobit>=6)  then
          begin
            result:=-result;
            exit;
          end;
	...

This is also apparent in the node dump for the instruction - before:

<blockn resultdef="$void" pos="1150,11" flags="nf_pass1_done" complexity="255">
   <statementn pos="1151,28">
      <inlinen resultdef="$void" pos="1151,13" flags="nf_pass1_done,nf_internal" complexity="255" inlinenumber="in_neg_assign_x">
         <typeconvn resultdef="Int64" pos="1151,22" flags="nf_pass1_done,nf_absolute" complexity="0" convtype="tc_equal">
            <loadn resultdef="Int64" pos="1151,28" flags="nf_pass1_done,nf_modify" complexity="0">
               <symbol>result</symbol>
            </loadn>
         </typeconvn>
      </inlinen>
   </statementn>
   <statementn pos="1152,17">
      <exitn resultdef="$void" pos="1152,13" flags="nf_pass1_done" complexity="2">
      </exitn>
   </statementn>
</blockn>

After:

<blockn resultdef="$void" pos="1150,11" flags="nf_pass1_done" complexity="3">
   <statementn resultdef="$void" pos="1151,28" flags="nf_pass1_done" complexity="3">
      <assignn resultdef="$void" pos="1151,13" flags="nf_pass1_done" complexity="1">
         <loadn resultdef="Int64" pos="1151,19" flags="nf_pass1_done,nf_write,nf_absolute" complexity="0">
            <symbol>result</symbol>
         </loadn>
         <ordconstn resultdef="Int64" pos="1151,21" flags="nf_pass1_done" complexity="0" rangecheck="FALSE">
            <value>-1</value>
         </ordconstn>
      </assignn>
   </statementn>
   <statementn resultdef="$void" pos="1152,17" flags="nf_pass1_done" complexity="2">
      <exitn resultdef="$void" pos="1152,13" flags="nf_pass1_done" complexity="2">
      </exitn>
   </statementn>
</blockn>

The Classes unit has a few random tidbits. Before:

.Lj5169:
	cmpq	$0,U_$CLASSES_$$_NEEDRESOLVING(%rip)
	je	.Lj5171
	movq	U_$CLASSES_$$_NEEDRESOLVING(%rip),%rax
	movq	16(%rax),%rax
	movq	%rax,-16(%rbp)
	jmp	.Lj5177
	.p2align 4,,10
	.p2align 3
.Lj5176:
	movq	-16(%rbp),%rax
	movq	8(%rax),%rax
	movq	%rax,-16(%rbp)
.Lj5177:
	cmpq	$0,-16(%rbp)
	je	.Lj5171
	movq	-16(%rbp),%rax
	movq	16(%rax),%rax
	cmpq	-8(%rbp),%rax
	jne	.Lj5176
	...

This one is a net zero because a particular peephole optimization isn't performed, since a register is reused - after:

.Lj5169:
	cmpq	$0,U_$CLASSES_$$_NEEDRESOLVING(%rip)
	je	.Lj5171
	movq	U_$CLASSES_$$_NEEDRESOLVING(%rip),%rax
	movq	16(%rax),%rax
	movq	%rax,-16(%rbp)
	jmp	.Lj5177
	.p2align 4,,10
	.p2align 3
.Lj5176:
	movq	-16(%rbp),%rax
	movq	8(%rax),%rax
	movq	%rax,-16(%rbp)
.Lj5177:
	movq	-16(%rbp),%rax
	testq	%rax,%rax ; (Not optimised because %rax is reused)
	je	.Lj5171
	; Instruction removed
	movq	16(%rax),%rax
	cmpq	-8(%rbp),%rax
	jne	.Lj5176
	...

movq -16(%rbp),%rax; testq %rax,%rax doesn't get optimised into cmpq $0,-16(%rbp) because of not needing to read from -16(%rbp) multiple times. However, there is potential for future peephole optimisation because .Lj5177 only has two entry points: the instruction before it, and a single unconditional jump. Furthermore, the instruction immediately before the jump and the label are identical, namely movq %rax,-16(%rbp), so it's possible to remove both instructions and insert a copy of it immediately after the label and perform more peephole optimisations know that it's known that -16(%rbp) and %rax are identical.

Additional Notes:

One initial sticking point with removing typeconv nodes is that it got confused with generic types which resulted in a bad node tree. To help get around this, a new is_undefined utility function was introduced that checked to see if a def was of type undefineddef or pointed to such.
A call to simplify in ttypeconvnode.pass_typecheck was removed since this caused problems if the child node was of type addrn. It was largely superfluous anyway and gets called again anyway as part of the regular first pass of the node tree.
An addition was made to the XML node dump feature in that type conversion nodes now output the totypedef field if it's different to resultdef (it's omitted if they're the same, which is the most common situation).

Edited Jul 23, 2022 by J. Gareth "Kit" Moreton

Draft: [Cross-platform] Typeconv node stripping