Fix RowMajor performance for triangular/dense assignment
Summary
- Fixes #3031 (closed)
- Rewrite the dynamic
triangular_assignment_loopto iterate in storage order (outer/inner matching layout) instead of always iterating outer=col, inner=row - This gives contiguous memory access for both ColMajor and RowMajor storage, fixing a 5-137x RowMajor performance deficit while maintaining ColMajor parity
- Use compile-time
constexpr row()/col()helpers that constant-fold to zero overhead - Keep simple scalar loops so GCC recognizes memcpy/memset idioms and Clang auto-vectorizes
Benchmark Results (float, GCC/Clang x Haswell/Westmere)
RowMajor Triangular2Dense (GCC Haswell):
| Size | OLD | NEW | Speedup |
|---|---|---|---|
| 64 | 1091 ns | 199 ns | 5.5x |
| 256 | 206,778 ns | 2,423 ns | 85x |
| 1024 | 7,830,936 ns | 57,028 ns | 137x |
ColMajor - near-parity across all configs (0.93x-1.36x, median ~1.00x).
Test plan
-
All
triangular_1..5tests pass -
All
selfadjoint_1..5tests pass -
All
product_trsolve_1..5tests pass - Benchmarked across GCC/Clang x Haswell/Westmere (4 configs, 3 repetitions)
- Verified assembly: GCC uses memcpy/memset, Clang auto-vectorizes with vmovups
- clang-format clean
Generated with Claude Code
Edited by Rasmus Munk Larsen