Skip to content

Recover stuck stopping environments

Hunter Stewart requested to merge hustewart-recover-failed-stopping into master

What does this MR do and why?

See Recover and revert stuck-stopping environments (#425161 - closed) for more context

When jobs that stop environments fail, the environment can get stuck in a state of "stopping." We want those environments to recover to a state of "available."

This MR addresses that concern with the following:

  • adds a new state event to environments to represent going from "stopping" back to "available"
  • adds a new worker to fire the state event given the proper conditions
  • makes deployables enqueue that new worker when they fail
  • adds changes from worker related rake tasks related to adding a new worker
  • adds specs
  • updates environments spec factory

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before After

How to set up and validate locally

#363197 (closed) provides the steps to verify the behavior.

I recommend to run through it on master to see what happens currently.

After that, switch to this MR's branch and run through it again, noting the following differences.

on master

  • the relevant Environment be stuck in stopping (you can check in rails console)
  • stop job will show up as requiring manual action when you get to the end of the steps

on this branch

  • the relevant Environment will be in a state of available
  • the stop job will run without manual action required.
  • you can tail the background jobs gdk tail rails-background-jobs | grep StopJobFailedWorker and look for the processing of the job

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Hunter Stewart

Merge request reports