frac function is slow on AMD in Linux (but fast on Intel or in Windows) (#39275) · Issues · FPC / FPC / FPC Source

frac function is slow on AMD in Linux (but fast on Intel or in Windows)

<h3><details><summary>Original Reporter info from Mantis: <small>Artlav</small></summary><small> - **Reporter name:** Artyom </small></details></h3> ## Description: frac function is about 20 times slower on Linux on AMD CPUs (tested on Ryzen 2600, Ryzen 3600 and Threadripper 3975WX) than on Windows on the same CPUs.<br/> On Intel CPUs it's close to equally fast on every OS.<br/> This only happens on x86_64, when compiled for i386 there is no difference in performance. Digging into the RTL, on windows it's using fpc_frac_real that is outside FPC_HAS_TYPE_EXTENDED ifdef (in rtl/x86_64/math.inc), which is double SSE code similar to the frac_sse of my example bit.<br/> While on Linux it is using the one inside it, which is extended x87 fistpq code. So it comes down to Windows not supporting extended type and thus getting a double SSE frac implementation, while Linux does support extended type, and thus is using extended x87 frac implementation.<br/> And as far as i can find, AMD's implementation of old 80bit FPU operations is MUCH, MUCH slower than Intel's. Given that frac is a fairly basic function and AMD CPUs are rapidly gaining popularity, this is a rather critical issue. ## Steps to reproduce: Run this code ``` pascal //############################################################################// {$ifdef mswindows}{$apptype console}{$endif} program frac_tst; //############################################################################// function frac_sse(const d:double):double;assembler;nostackframe; asm movq %xmm0, %rax shr $48, %rax and $0x7ff0,%ax cmp $0x4330,%ax jge .L0 cvttsd2si %xmm0, %rax cvtsi2sd %rax, %xmm4 subsd %xmm4, %xmm0 ret .L0: xorpd %xmm0, %xmm0 end; //############################################################################// procedure main; var x:double; i:integer; begin write('System: '); x:=0; for i:=0 to 9999999 do x:=x+frac(i/10); writeln(x:3:3); write('Custom: '); x:=0; for i:=0 to 9999999 do x:=x+frac_sse(i/10); writeln(x:3:3); end; //############################################################################// begin main; {$ifdef mswindows}readln;{$endif} end. //############################################################################// ``` Tweak iteration count for the run time to be noticeable, observe time difference between platforms and between system and custom frac. On Intel or Windows both would be fast.<br/> On AMD Linux, system one would be 10-20 times slower. ## Mantis conversion info: - **Mantis ID:** 39275 - **Version:** 3.3.1 - **Monitored by:** » @Alexey-T1 (CudaText man)

issue