Skip to content

Realm: fix handling of GPU oversubscription through -ll:gpu_ids

Manolis Papadakis requested to merge fix-gpu-ids into master

The old code does not handle the case where more than #devices GPU IDs are used (with the goal of instantiating multiple GPU processors per physical device). For example, on a machine with 8 GPUs the following:

-ll:gpu_ids 0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7

would fail with:

[0 - 1464c51c8000]    0.000000 {6}{gpu}: 16 GPUs requested, but only 8 available!

It may be easier to inspect the diff ignoring whitespace changes:

diff --git a/runtime/realm/cuda/cuda_module.cc b/runtime/realm/cuda/cuda_module.cc
index 8a2d54fa4..9c62cc6a4 100644
--- a/runtime/realm/cuda/cuda_module.cc
+++ b/runtime/realm/cuda/cuda_module.cc
@@ -4231,10 +4231,10 @@ namespace Realm {
       gpus.resize(cfg_num_gpus);
       unsigned gpu_count = 0;
       // try to get cfg_num_gpus, working through the list in order
-      for(size_t i = cfg_skip_gpu_count;
-          (i < gpu_info.size()) && (gpu_count < cfg_num_gpus); i++) {
-        int idx = (fixed_indices.empty() ? i : fixed_indices[i]);
-
+      while (gpu_count < cfg_num_gpus) {
+        bool success = false;
+        unsigned idx = fixed_indices.empty() ? gpu_count + cfg_skip_gpu_count : fixed_indices[gpu_count];
+        do {
           // try to create a context and possibly check available memory - in order
           //  to be compatible with an application's use of the cuda runtime, we
           //  need this to be the device's "primary context"
@@ -4316,6 +4316,15 @@ namespace Realm {
             dedicated_workers[g] = worker;

           gpus[gpu_count++] = g;
+          success = true;
+          break;
+
+          // we failed to use this particular device, but if the GPU indices are not fixed
+          // we can try again with the next one
+        } while (fixed_indices.empty() && ++idx < gpu_info.size());
+
+        // we failed to assign any available device to this GPU processor
+        if (!success) break;
       }

       // did we actually get the requested number of GPUs?
Edited by Manolis Papadakis

Merge request reports