Add fp16 support
This MR adds FP16 support for both CUDA and OpenCL backends, and also adds the host-side coot::fp16 type (which is configurable). FP16 support on CUDA includes matrix multiplications but not decompositions; on OpenCL, clBLAS does not support half-precision multiplication, but when CLBlast is supported then half-precision multiplication will be supported.
When C++23 is enabled, the std::float16_t type is used on the host side to represent the type. Otherwise, with the CUDA backend, we can use __half (provided by CUDA). But for OpenCL, there is no native type for half-precision---cl_half is just a typedef of uint16_t and we can't use it for host-side work. So, when using OpenCL and not C++23, the fp16_shim class is used instead, which converts the FP16 value to an internally-held float.
Significant refactoring of tests was necessary because many of the tests will not work with the limited range of the FP16 type.
Still TODO is documentation, but I want to see this build successfully first.