Fix RowMajor performance for triangular/dense assignment

Summary

  • Fixes #3031 (closed)
  • Rewrite the dynamic triangular_assignment_loop to iterate in storage order (outer/inner matching layout) instead of always iterating outer=col, inner=row
  • This gives contiguous memory access for both ColMajor and RowMajor storage, fixing a 5-137x RowMajor performance deficit while maintaining ColMajor parity
  • Use compile-time constexpr row()/col() helpers that constant-fold to zero overhead
  • Keep simple scalar loops so GCC recognizes memcpy/memset idioms and Clang auto-vectorizes

Benchmark Results (float, GCC/Clang x Haswell/Westmere)

RowMajor Triangular2Dense (GCC Haswell):

Size OLD NEW Speedup
64 1091 ns 199 ns 5.5x
256 206,778 ns 2,423 ns 85x
1024 7,830,936 ns 57,028 ns 137x

ColMajor - near-parity across all configs (0.93x-1.36x, median ~1.00x).

Test plan

  • All triangular_1..5 tests pass
  • All selfadjoint_1..5 tests pass
  • All product_trsolve_1..5 tests pass
  • Benchmarked across GCC/Clang x Haswell/Westmere (4 configs, 3 repetitions)
  • Verified assembly: GCC uses memcpy/memset, Clang auto-vectorizes with vmovups
  • clang-format clean

Generated with Claude Code

Edited by Rasmus Munk Larsen

Merge request reports

Loading