Implement a generic vectorized version of Smith's algorithm for complex division.

Reference issue

Closes #2277 (closed)

What does this implement/fix?

The existing implementation of complex division is susceptible to under- and overflow. This change implements Smith's algorithm, which rescales the denominator to avoid such problems. The new implementation is about ~2x slower than the naive implementation, but makes Eigen pass the LAPACK linear solver test suite.

Additional information

Benchmark numbers for SSE:

name          old cpu/op  new cpu/op   delta
BM_Div/1      18.9ns ± 0%   18.9ns ± 0%    -0.02%  
BM_Div/8      5.98ns ± 0%  11.76ns ± 1%   +96.63%  
BM_Div/64     40.4ns ± 0%   90.3ns ± 1%  +123.32% 
BM_Div/512     325ns ± 0%    728ns ± 1%  +123.73%  
BM_Div/4k     2.74µs ± 0%   5.82µs ± 1%  +112.92% 
BM_Div/32k    22.9µs ± 6%   46.5µs ± 1%  +103.40%  
BM_Div/256k    263µs ± 2%    385µs ± 1%   +46.49% 
BM_Div/1M     1.26ms ±12%   1.75ms ± 9%   +38.05%
Edited by Rasmus Munk Larsen

Merge request reports

Loading