`fill()` failure on CUDA backend

Reported by @zoq; the following test program fails on the CUDA backend:

#include <bandicoot>

using namespace coot;

int main()
{
  Mat<float> x(5, 100000);
  x.fill(float(3));
}

This gives a CUDA_ERROR_INVALID_VALUE code from cuLaunchKernel() in cuda::fill(). Almost certainly this is just the dimensions of the kernel being wrong; I think this will be an easy fix, I just want to write down the issue so it doesn't get lost.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information