Supposedly faster i386 int() and frac(). (!588) · Merge requests · FPC / FPC / FPC Source

Rika requested to merge runewalsh/source:if into main Feb 13, 2024

int and frac for i386 that work by bit twiddling (zeroing the mantissa bits after the point) instead of changing the FPU control word.

Not as good as it sounds; my frac(x) has the same speed in the common case of 1 ≤ abs(x) ≤ probably-something-around-High(int64). (fisttp could make it simpler and faster but turns out to be a SSE3 instruction. 😑)

Still, int(x) seems to be improved in every case, and frac(x) with abs(x) < 1 can occasionally be no less common; imagine repeated fx += step; ix += int(fx); fx := frac(fx) with step = 0.001.

Benchmark and tests against existing implementation in some internal edge cases: IntFracBenchmark.pas. Note that noise2(scale = 1) is another case that unintentionally falls on the fast path of frac(x).

My results:

                      before (func1)   after (func2)
	   
int(cos(i)):            12 ns/call      4.1 ns/call
int(100 * cos(i)):      12 ns/call      7.8 ns/call
int(1e20):              22 ns/call      7.0 ns/call (*)
int(infinity):          86 ns/call      6.9 ns/call (*)

frac(cos(i)):           13 ns/call      6.9 ns/call
frac(100 * cos(i)):     13 ns/call       12 ns/call
frac(1e20):             21 ns/call      4.1 ns/call (*)
frac(infinity):        169 ns/call      4.8 ns/call (*)

noise2(scale = 10):    138 ns/call      128 ns/call
noise2(scale = 1):     141 ns/call      105 ns/call

__________________
(*) Cool numbers but impractical cases so ignore ^^.

Not sure about calling conventions though, i.e. if the input tword always resides at [esp + 4] haha...

Additionally, do some cosmetic changes to x64 versions. (16-bit pextrw is SSE2 and zeroes the rest of the destination.)

Edited Feb 15, 2024 by Rika

Supposedly faster i386 int() and frac().

Merge request reports