Investigate solutions for `auto_stop_in` not engaging when deployment has failed

Expected outcome of this issue

A solution on how to resolve the problem. An example output would either be: a) a step-by-step description on what needs to be done, or b) a PoC MR showing the solution

Background

This came from a problem with auto_stop_in that a user reported in #382549 (closed). The actual problems are stated in:

#382549 (comment 1340071989)

...we have a helm chart deployment in the deploy job. This is a complex task which may fail in different stages of the deployment. So although it ends as failed, there may be resources created which we need to eventually delete. Ideally, we would like to have the stop job run in any case (both successful and unsuccessful deploy job) even if there was no prior successful job but that is bit out of current spec for the feature. But if it will work correctly as it is designed now, it is also a step forward for us.

#382549 (comment 1341668262)

We are using this deploy for QA environments. They should be available for 2 days maximum, the developers can test their stuff there, they can also stop it manually but in case they forget, we need them to automatically clean-up after 2 days.

So in both cases, if it deployed correctly or not, we need them to stop in 2 days.

In case it failed to deploy for whatever reason, we don't want to clean-up immediately but give them a chance to deploy again using the same job in some time (maybe there was not enough resources temporarily or other error). The subsequent redeploy (without cleaning) will take a shorted because some of the resources might be created correctly. We also want them to be able to investigate the "wrongly" deployed env.

From the second point above ☝ we cannot use when: on_failure because the users need the failing environment to be up for a period of time in order to investigate the cause of the failure.

Current and expected behavior of CI Pipeline

This is in relation to the problem above.

Scenario

There are 2 jobs, deploy and stop

deploy spins up an Environment where an Application is running
stop tears down that Environment

Simple .gitlab-ci.yml illustration:

deploy:
  stage: deploy
  script:
    - echo "Spinning up Review Environment..."
  environment:
    name: production
    on_stop: stop # refers to the stop job below
    auto_stop_in: 2 days

stop:
  stage: stop
  script:
    - echo "Tearing down Review Environment..."
  when: manual # the job can still be run manually
  environment:
    name: production
    action: stop

Current behavior

The deploy job creates many different resources, not just the Application being deployed (e.g.: Secrets, Configuration, etc)
Using the auto_stop_in configuration, the stop job is expected to run 2 days after the deploy job
✅ If the deploy job is successful, the stop job runs after 2 days as expected
❌ If the deploy job fails, the stop job does not run at all

Desired behavior

The deploy job creates many different resources, not just the Application being deployed (e.g.: Secrets, Configuration, etc)
Using the auto_stop_in configuration, the stop job is expected to run 2 days after the deploy job
✅ If the deploy job is successful, the stop job runs after 2 days
✅ If the deploy job fails, the stop job runs after 2 days
- In this case, the user still needs the stop job to run to clean up all "dangling" resources created during the failed deploy
- Even if the deploy job fails, the user prefers to have those 2 days of delay before the stop job is run. This is because it's a Review Environment, and they would like to have a chance to investigate what was failing in the deploy
In conclusion, in either a successful or failed deploy, they need the stop job to run 2 days after the deploy job

Outcome

Please see the investigation summary here: #429616 (comment 1691931533)

Solution proof of concept: POC: Extend Environment stop actions to include... (!139612 - closed)

Edited Dec 13, 2023 by Pam Artiaga