Use native _Float16 for AVX512FP16 and update vectorization.

This allows us to do faster native scalar operations. Also updated half/quarter packets to use the native type if available.

Benchmark improvement:

Comparing ./2910_without_float16 to ./2910_with_float16
Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------
BM_CalcMat<float>/10000/768/500                      -0.0041         -0.0040      58276392      58039442      58273420      58039582
BM_CalcMat<_Float16>/10000/768/500                   +0.0073         +0.0073     642506339     647214446     642481384     647188303
BM_CalcMat<Eigen::half>/10000/768/500                -0.3170         -0.3170      92511115      63182101      92506771      63179258
BM_CalcVec<float>/10000/768/500                      +0.0022         +0.0022       5198157       5209469       5197913       5209334
BM_CalcVec<_Float16>/10000/768/500                   +0.0025         +0.0026      10133324      10159111      10132641      10158507
BM_CalcVec<Eigen::half>/10000/768/500                -0.7760         -0.7760      45337937      10156952      45336532      10156389
OVERALL_GEOMEAN                                      -0.2677         -0.2677             0             0             0             0

Benchmark source: 2910.cc

Fixes #2910 (closed).

Merge request reports

Loading