Fix Has Triggered to Avoid Non-Monotonic Results (!1131) · Merge requests · StanfordLegion / legion

Mike Bauer requested to merge fixhastriggered into master Feb 23, 2024

I observed a failure mode where repeated calls to has_triggered would return non-monotonic results from the same thread on the same node. Specifically the first call to has_triggered on an event would report that it had triggered, and then a later call would report that the event had not triggered. I added the following logging statement to the code in the master branch:

diff --git a/runtime/realm/event_impl.cc b/runtime/realm/event_impl.cc
index 1f91eeb42..41721fb4f 100644
--- a/runtime/realm/event_impl.cc
+++ b/runtime/realm/event_impl.cc
@@ -1632,6 +1632,7 @@ namespace Realm {
          poisoned = it->second;
        }
       }
+      log_event.print() << "full test " << me << " " << needed_gen << " " << locally_triggered;
       return locally_triggered;
     }

I then logged the output and I observed a non-monotonic result like the following:

event_1.log:[1 - 7fd0da6417c0]    1.027775 {3}{event}: full test 8000002004000000 34 1
event_1.log:[1 - 7fd0da6417c0]    1.027795 {3}{event}: full test 8000002004000000 34 0

So the same thread did two has_triggered calls on the same event 20us apart, and the first one returned true and the second returned false. While I think it is fine for has_triggered to possibly report that an event hasn't triggered when it actually has, I do think that has_triggered should be monotonic, meaning that once it does report that an event has triggered, it needs to do so consistently.

This branch fixes the non-monotonic behavior by retesting the generation after acquiring the lock in the has_triggered method. I confirmed this fixes the failing Legion test case that I was observing.

There are two remaining open questions:

Do any fixes need to be applied to BarrierImpl::has_triggered to maintain the same monotonicity? I didn't see any that are needed, but someone should probably check it more carefully.
I noticed that the EventImpl::has_local_triggers member variable is not marked atomic but is being read outside the lock. I've mostly convinced myself that this is ok since a stale true will force us to take the lock and do the more detailed test, while a stale false will just report that a later generation hasn't triggered. I don't think you can get non-monotonic behavior from this, but someone else should check.

Fix Has Triggered to Avoid Non-Monotonic Results

Merge request reports