State machine in-memory attribute corruption after rescued after_commit exception

Problem to solve

  • Audit all state machine transitions in rescue blocks. Create a separate issue for each identified problem
  • Update development docs
  • Add automated detection mechanism (Duo, Rubocop etc)

Background

If a state machine transition is rescued, and the rescue attempts another transition on the same object, then this transition gets the wrong from state because catch_exceptions in the state machine gem rolls back the in-memory attribute.

This, in turn, causes the wrong state machine callbacks to fire.

The rollback writes the original from state back to the in-memory attribute, even though the database has already committed the new state. Any subsequent transition on the same object instance will see a stale from state, causing after_transition from: X guards to match (or not match) incorrectly.

Known impact

This was identified as the root cause of zombie ci_running_builds entries in #590004. When a build transitions pending -> running and the after_commit hook raises, the in-memory status is rolled back to pending. The subsequent drop! then transitions from pending (in-memory) instead of running, so the after_transition running: any callback that deletes the Ci::RunningBuild record never fires.

Possible mitigations

  • Call reset or reload before retrying a transition after rescuing an exception from a state machine callback. This is the approach taken in !223604 (merged) for the specific CI build case.
  • Change the state machine gem?
  • #590004 - Fix zombie entries in running builds table (specific instance of this problem)
Edited by Hordur Freyr Yngvason