frac function is slow on AMD in Linux (but fast on Intel or in Windows)
<h3><details><summary>Original Reporter info from Mantis: <small>Artlav</small></summary><small>
- **Reporter name:** Artyom
</small></details></h3>
## Description:
frac function is about 20 times slower on Linux on AMD CPUs (tested on Ryzen 2600, Ryzen 3600 and Threadripper 3975WX) than on Windows on the same CPUs.<br/>
On Intel CPUs it's close to equally fast on every OS.<br/>
This only happens on x86_64, when compiled for i386 there is no difference in performance.
Digging into the RTL, on windows it's using fpc_frac_real that is outside FPC_HAS_TYPE_EXTENDED ifdef (in rtl/x86_64/math.inc), which is double SSE code similar to the frac_sse of my example bit.<br/>
While on Linux it is using the one inside it, which is extended x87 fistpq code.
So it comes down to Windows not supporting extended type and thus getting a double SSE frac implementation, while Linux does support extended type, and thus is using extended x87 frac implementation.<br/>
And as far as i can find, AMD's implementation of old 80bit FPU operations is MUCH, MUCH slower than Intel's.
Given that frac is a fairly basic function and AMD CPUs are rapidly gaining popularity, this is a rather critical issue.
## Steps to reproduce:
Run this code
``` pascal
//############################################################################//
{$ifdef mswindows}{$apptype console}{$endif}
program frac_tst;
//############################################################################//
function frac_sse(const d:double):double;assembler;nostackframe;
asm
movq %xmm0, %rax
shr $48, %rax
and $0x7ff0,%ax
cmp $0x4330,%ax
jge .L0
cvttsd2si %xmm0, %rax
cvtsi2sd %rax, %xmm4
subsd %xmm4, %xmm0
ret
.L0:
xorpd %xmm0, %xmm0
end;
//############################################################################//
procedure main;
var x:double;
i:integer;
begin
write('System: ');
x:=0;
for i:=0 to 9999999 do x:=x+frac(i/10);
writeln(x:3:3);
write('Custom: ');
x:=0;
for i:=0 to 9999999 do x:=x+frac_sse(i/10);
writeln(x:3:3);
end;
//############################################################################//
begin
main;
{$ifdef mswindows}readln;{$endif}
end.
//############################################################################//
```
Tweak iteration count for the run time to be noticeable, observe time difference between platforms and between system and custom frac.
On Intel or Windows both would be fast.<br/>
On AMD Linux, system one would be 10-20 times slower.
## Mantis conversion info:
- **Mantis ID:** 39275
- **Version:** 3.3.1
- **Monitored by:** » @Alexey-T1 (CudaText man)
issue