Investigate solutions for `auto_stop_in` not engaging when deployment has failed
Expected outcome of this issue
A solution on how to resolve the problem. An example output would either be: a) a step-by-step description on what needs to be done, or b) a PoC MR showing the solution
Background
This came from a problem with auto_stop_in
that a user reported in #382549 (closed). The actual problems are stated in:
...we have a helm chart deployment in the deploy job. This is a complex task which may fail in different stages of the deployment. So although it ends as failed, there may be resources created which we need to eventually delete. Ideally, we would like to have the stop job run in any case (both successful and unsuccessful deploy job) even if there was no prior successful job but that is bit out of current spec for the feature. But if it will work correctly as it is designed now, it is also a step forward for us.
We are using this deploy for QA environments. They should be available for 2 days maximum, the developers can test their stuff there, they can also stop it manually but in case they forget, we need them to automatically clean-up after 2 days.
So in both cases, if it deployed correctly or not, we need them to stop in 2 days.
In case it failed to deploy for whatever reason, we don't want to clean-up immediately but give them a chance to deploy again using the same job in some time (maybe there was not enough resources temporarily or other error). The subsequent redeploy (without cleaning) will take a shorted because some of the resources might be created correctly. We also want them to be able to investigate the "wrongly" deployed env.
From the second point above when: on_failure
because the users need the failing environment to be up for a period of time in order to investigate the cause of the failure.
Current and expected behavior of CI Pipeline
This is in relation to the problem above.
Scenario
There are 2 jobs, deploy
and stop
-
deploy
spins up an Environment where an Application is running -
stop
tears down that Environment
Simple .gitlab-ci.yml
illustration:
deploy:
stage: deploy
script:
- echo "Spinning up Review Environment..."
environment:
name: production
on_stop: stop # refers to the stop job below
auto_stop_in: 2 days
stop:
stage: stop
script:
- echo "Tearing down Review Environment..."
when: manual # the job can still be run manually
environment:
name: production
action: stop
Current behavior
- The
deploy
job creates many different resources, not just the Application being deployed (e.g.: Secrets, Configuration, etc) - Using the
auto_stop_in
configuration, thestop
job is expected to run 2 days after thedeploy
job -
✅ If thedeploy
job is successful, thestop
job runs after 2 days as expected -
❌ If thedeploy
job fails, thestop
job does not run at all
Desired behavior
- The
deploy
job creates many different resources, not just the Application being deployed (e.g.: Secrets, Configuration, etc) - Using the
auto_stop_in
configuration, thestop
job is expected to run 2 days after thedeploy
job -
✅ If thedeploy
job is successful, thestop
job runs after 2 days -
✅ If thedeploy
job fails, thestop
job runs after 2 days- In this case, the user still needs the
stop
job to run to clean up all "dangling" resources created during the failed deploy - Even if the
deploy
job fails, the user prefers to have those 2 days of delay before thestop
job is run. This is because it's a Review Environment, and they would like to have a chance to investigate what was failing in the deploy
- In this case, the user still needs the
- In conclusion, in either a successful or failed deploy, they need the
stop
job to run 2 days after thedeploy
job
Outcome
Please see the investigation summary here: #429616 (comment 1691931533)
Solution proof of concept: POC: Extend Environment stop actions to include... (!139612 - closed)