State machine in-memory attribute corruption after rescued after_commit exception
Problem to solve
-
Audit all state machine transitions in
rescueblocks. Create a separate issue for each identified problem - Update development docs
- Add automated detection mechanism (Duo, Rubocop etc)
Background
If a state machine transition is rescued, and the rescue attempts another transition on the same object, then this transition gets the wrong from state because catch_exceptions in the state machine gem rolls back the in-memory attribute.
This, in turn, causes the wrong state machine callbacks to fire.
The rollback writes the original from state back to the in-memory attribute, even though the database has already committed the new state. Any subsequent transition on the same object instance will see a stale from state, causing after_transition from: X guards to match (or not match) incorrectly.
Known impact
This was identified as the root cause of zombie ci_running_builds entries in #590004. When a build transitions pending -> running and the after_commit hook raises, the in-memory status is rolled back to pending. The subsequent drop! then transitions from pending (in-memory) instead of running, so the after_transition running: any callback that deletes the Ci::RunningBuild record never fires.
Possible mitigations
-
Call
resetorreloadbefore retrying a transition after rescuing an exception from a state machine callback. This is the approach taken in !223604 (merged) for the specific CI build case. - Change the state machine gem?
Related issues
- #590004 - Fix zombie entries in running builds table (specific instance of this problem)