Solve Small Linear Systems per GPU thread

Describe the feature you would like to be implemented.

Enable the matrix decompositions for GPUs.

I am trying to solve small linear systems per GPU thread and the PartialPivLU (as an example) does not support GPUs (cannot be called from CUDA-kernels or OpenACC regions). For systems up to 4x4, I use the inverse() method which is decorated with __host__ __device__ tokens (template specialization with cofactors), but above 4x4 there is no other alternative for GPUs with Eigen (as far as I know).

Why the matrix decompositions are not decorated with __host__ __device__ tokens? I understand that solving large linear systems per GPU thread is not a viable solution, but still it could be very useful to have this option enabled.

Would such a feature be useful for other users? Why?

There are many interesting problems that involve small linear systems (e.g. up to 10x10), which would be very useful to be solved in parallel by each GPU thread. Therefore, enabling the matrix decompositions for GPUs, it would be very usefull to multiple users across disciplines.

Any hints on how to implement the requested feature?

Decorate the existing classes with __host__ __device__ (?)

Additional resources

Edited by Christos Kotsalos