Skip to content

Data race causing periodic master restarts

It looks as if there's a data race on the datagrams between the various kernel threads between this check and this assignment

A de-compilation of the compiled kernel module shows that the assignments in master.c get re-ordered due to the optimizer to first set the state as DATAGRAM_RECEIVED and only then the WKC which in turn produces the error in fsm_master.c thus periodically restarting the master whenever the threads are scheduled too closely to one another. This explains why we're failing this condition but passing this condition despite their assignments being programmed in a different order.

Given the above, we've patched these fields to be _Atomic which has sort of resolved our issue as the assignment no longer is reordered and is in fact synced. Thus, we no longer get this logging but instead we get this logging (with datagram state = sent) and we also no longer restart the master FSM.

However, this doesn't feel like the proper solution given that the thread that is processing the datagrams (so in the master FSM) is reading the datagrams while another thread is still processing the response and updating the datagrams.

So, I actually have 2 questions:

  1. Is this a known issue and are we actually doing something wrong in user-space?
  2. If this isn't a known issue then what would be the proper fix for this as marking the fields atomic resolves the data race but does not solve the logical race condition?