investigate OpenCL + MPI - Redmine #1804
The DD + OpenCL is broken (only MPI with single device per physical node works) and needs investigation how to fix.
(from redmine: issue id 1804, created on 2015-08-12 by pszilard, closed on 2016-05-10)
- Revision a512a937 by Szilárd Páll on 2016-03-31T00:05:24Z:
Fix multiple tMPI ranks per OpenCL device The OpenCL context and program objects were stored in the gpu_info struct which was assumed to be a constant per compute host and therefore shared across the tMPI ranks. Hence, gpu_info was initialized once and a single pointer pointing to the data used by all ranks. This led to the OpenCL context and program objects of different ranks sharing a single device get overwritten/corrupted by one another. Notes: - MPI still segfaults in clCreateContext() with multiple ranks per node both with and without GPU sharing, so no changes on that front. - The AMD OpenCL runtime overhead with all hw threads used is quite significant; as a short-term solution we should consider avoiding using HT by launching less threads (and/or warning the user). Refs #1804 Change-Id: I7c6c53a3e6a049ce727ae65ddf0978f436c04579
- Revision 8a8904ad by Szilárd Páll on 2016-04-27T15:46:23Z:
Fix multiple MPI ranks per node with OpenCL Similarly to the thread-MPI case, the source of the issue was the hardware detection broadcasting the outcome of GPU detection within a node. The MPI platform and device IDs, OpenCL internal entities, differ across processes even if both platform and device(s) are shared. This caused corruption at context creation on all ranks other than the first rank in the node (which did the detection). This change disables the GPU data broadcasting for OpenCL with MPI. Fixes #1804 Change-Id: I90defdcb3515796c46ba89efb0ed1e3c8b1b35f9