Skip to content

Add support for AVX512-FP16 for vectorizing half precision math

What does this implement/fix?

This merge request takes advantage of the AVX512-FP16 instruction set to vectorize half floating-point operations. It implements Packet32h and replaces many packet operations for the pre-existing Packet16h and Packet8h. The pre-existing AVX implementations used typecasting to float and back, so this could improve performance significantly by avoiding these intermediate typecasts.

Additional information

I've ran all tests against this, and it passes all of them consistently except for packetmath_13, which checks half precision packet math. The specific tests that fail are the fused math operations, particularly pmsub. From my testing, this seems to just be because of lower precision of the AVX512-FP16 fused intrinsics compared to the reference calculations. I'm not sure what the best thing to do here is, it would be nice to get some feedback on how to handle this.

On my experimental setup, I got a consistent 8-9 times improvement in the performance of the bench_gemm benchmark changing from using AVX512F only to using AVX512-FP16, with OpenMP disabled. This performance gain was consistent for matrices varying from size 256x256 - 4096x4096.

Specifically, the speedup was about 8.4x for 256x256, 8.3x for 512x512, 8.8x for 1024x1024, 8.7x for 2048x2048, and 9.4x for 4096x4096.

Edited by Rasmus Munk Larsen

Merge request reports

Loading