Realm: fix handling of GPU oversubscription through -ll:gpu_ids
The old code does not handle the case where more than #devices GPU IDs are used (with the goal of instantiating multiple GPU processors per physical device). For example, on a machine with 8 GPUs the following:
-ll:gpu_ids 0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7
would fail with:
[0 - 1464c51c8000] 0.000000 {6}{gpu}: 16 GPUs requested, but only 8 available!
It may be easier to inspect the diff ignoring whitespace changes:
diff --git a/runtime/realm/cuda/cuda_module.cc b/runtime/realm/cuda/cuda_module.cc
index 8a2d54fa4..9c62cc6a4 100644
--- a/runtime/realm/cuda/cuda_module.cc
+++ b/runtime/realm/cuda/cuda_module.cc
@@ -4231,10 +4231,10 @@ namespace Realm {
gpus.resize(cfg_num_gpus);
unsigned gpu_count = 0;
// try to get cfg_num_gpus, working through the list in order
- for(size_t i = cfg_skip_gpu_count;
- (i < gpu_info.size()) && (gpu_count < cfg_num_gpus); i++) {
- int idx = (fixed_indices.empty() ? i : fixed_indices[i]);
-
+ while (gpu_count < cfg_num_gpus) {
+ bool success = false;
+ unsigned idx = fixed_indices.empty() ? gpu_count + cfg_skip_gpu_count : fixed_indices[gpu_count];
+ do {
// try to create a context and possibly check available memory - in order
// to be compatible with an application's use of the cuda runtime, we
// need this to be the device's "primary context"
@@ -4316,6 +4316,15 @@ namespace Realm {
dedicated_workers[g] = worker;
gpus[gpu_count++] = g;
+ success = true;
+ break;
+
+ // we failed to use this particular device, but if the GPU indices are not fixed
+ // we can try again with the next one
+ } while (fixed_indices.empty() && ++idx < gpu_info.size());
+
+ // we failed to assign any available device to this GPU processor
+ if (!success) break;
}
// did we actually get the requested number of GPUs?
Edited by Manolis Papadakis