investigate OpenCL + MPI - Redmine #1804

The DD + OpenCL is broken (only MPI with single device per physical node works) and needs investigation how to fix.

(from redmine: issue id 1804, created on 2015-08-12 by pszilard, closed on 2016-05-10)

Changesets:
- Revision a512a937 by Szilárd Páll on 2016-03-31T00:05:24Z:

Fix multiple tMPI ranks per OpenCL device

The OpenCL context and program objects were stored in the gpu_info
struct which was assumed to be a constant per compute host and therefore
shared across the tMPI ranks. Hence, gpu_info was initialized once
and a single pointer pointing to the data used by all ranks.
This led to the OpenCL context and program objects of different ranks
sharing a single device get overwritten/corrupted by one another.

Notes:
- MPI still segfaults in clCreateContext() with multiple ranks per node
  both with and without GPU sharing, so no changes on that front.
- The AMD OpenCL runtime overhead with all hw threads used is quite
  significant; as a short-term solution we should consider avoiding
  using HT by launching less threads (and/or warning the user).

Refs #1804

Change-Id: I7c6c53a3e6a049ce727ae65ddf0978f436c04579

Revision 8a8904ad by Szilárd Páll on 2016-04-27T15:46:23Z:

Fix multiple MPI ranks per node with OpenCL

Similarly to the thread-MPI case, the source of the issue was
the hardware detection broadcasting the outcome of GPU detection
within a node. The MPI platform and device IDs, OpenCL internal
entities, differ across processes even if both platform and device(s)
are shared. This caused corruption at context creation on all ranks
other than the first rank in the node (which did the detection).

This change disables the GPU data broadcasting for OpenCL with MPI.

Fixes #1804

Change-Id: I90defdcb3515796c46ba89efb0ed1e3c8b1b35f9