Fix Has Triggered to Avoid Non-Monotonic Results
I observed a failure mode where repeated calls to has_triggered
would return non-monotonic results from the same thread on the same node. Specifically the first call to has_triggered
on an event would report that it had triggered, and then a later call would report that the event had not triggered. I added the following logging statement to the code in the master
branch:
diff --git a/runtime/realm/event_impl.cc b/runtime/realm/event_impl.cc
index 1f91eeb42..41721fb4f 100644
--- a/runtime/realm/event_impl.cc
+++ b/runtime/realm/event_impl.cc
@@ -1632,6 +1632,7 @@ namespace Realm {
poisoned = it->second;
}
}
+ log_event.print() << "full test " << me << " " << needed_gen << " " << locally_triggered;
return locally_triggered;
}
I then logged the output and I observed a non-monotonic result like the following:
event_1.log:[1 - 7fd0da6417c0] 1.027775 {3}{event}: full test 8000002004000000 34 1
event_1.log:[1 - 7fd0da6417c0] 1.027795 {3}{event}: full test 8000002004000000 34 0
So the same thread did two has_triggered
calls on the same event 20us apart, and the first one returned true
and the second returned false
. While I think it is fine for has_triggered
to possibly report that an event hasn't triggered when it actually has, I do think that has_triggered
should be monotonic, meaning that once it does report that an event has triggered, it needs to do so consistently.
This branch fixes the non-monotonic behavior by retesting the generation after acquiring the lock in the has_triggered
method. I confirmed this fixes the failing Legion test case that I was observing.
There are two remaining open questions:
- Do any fixes need to be applied to
BarrierImpl::has_triggered
to maintain the same monotonicity? I didn't see any that are needed, but someone should probably check it more carefully. - I noticed that the
EventImpl::has_local_triggers
member variable is not markedatomic
but is being read outside the lock. I've mostly convinced myself that this is ok since a staletrue
will force us to take the lock and do the more detailed test, while a stalefalse
will just report that a later generation hasn't triggered. I don't think you can get non-monotonic behavior from this, but someone else should check.