More VS EIGEN_STRONG_INLINE and VS code performance
Submitted by neu..@..eck.de
Assigned to Nobody
Link to original bugzilla bug (#1680)
Version: 3.3 (current stable)
Operating system: Windows
Description
Short Story: Every new VS version I test if my simulation compiled with VS run faster/just as fast as the ones with clang-cl. I tested it because i wanted to use some SVML features but it seems I have to wait until clang-cl supports the same intrinsics to get a speed advantage.
Benchmark:
clang-cl: around 140-150 ns/step
vs2019 (without adding force inline to eigen3): around 450 ns/step
vs2019 (with adding force inline to eigen3): around 400 ns/step
vs2019 (forcing everything in hot loop inline; like clang-cl does): around 700 ns/step
What can be seen from the assembly is the following:
The vs2019 assembly is roughly 3 times as long as the clang-cl assembly.
The vs2019 assembly has 934 instructions containing the word "mov" from which are at least 307 unaligned moves (vmovupd).
clang in comparison generates only 184 mov like instructions whereas only 5 are vmovupd.
The question thus is:
Why does VS generate so many mov instruction with the eigen3 library?
Why are there so many unaligned moves although the circumstance that eigen3 aligns the data?
Compiler Flags:
/permissive- /MP /we"4289" /GS- /TP /W4 /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /Zc:inline /fp:fast /D "WIN32" /D "_WINDOWS" /D "NDEBUG" /D "_SILENCE_CXX17_RESULT_OF_DEPRECATION_WARNING" /D "USE_BOOST_RANDOM" /D "USE_PCG_RANDOM" /D "CMAKE_INTDIR="RelWithDebInfo"" /D "_MBCS" /fp:except- /errorReport:prompt /WX- /Zc:forScope /GR /arch:AVX /Gd /Oy /Oi /MD /std:c++17 /Fa"RelWithDebInfo/" /EHsc /Ot /diagnostics:classic /w44640 /w14242 /w14254 /w14263 /w14265 /w14287 /w14296 /w14311 /w14545 /w14546 /w14547 /w14549 /w14555 /w14619 /w14640 /w14826 /w14905 /w14906 /w14928 /bigobj /Ob3
Side Note:
My Code also saw the problem of #1365 but I just decided to switch to clang-cl ;)
I also posted the bug in the VS Feedback because it could be an optimizer bug?