Environments are stuck in `stopping` state if the on_stop job fails

Reason to close this issue

This is issue has been closed in favor of these two new refined issues:

Please have a look at the refinement discussion from here #363197 (comment 1557491287) and follow up on the aforementioned issues.

Summary

A customer noticed that on_stop jobs are not triggered when a previous on_stop job has failed. This seems to be because the start of the on_stop job puts the environment in a stopping state that persists if the on_stop job fails.

Steps to reproduce

  1. Create a new project with the following .gitlab-ci.yml on the main branch:

    create:
      image: registry.gitlab.com/gitlab-org/terraform-images/releases/terraform:1.5.1
      script:
        - echo deploy
      environment:
        name: feature/$CI_COMMIT_REF_SLUG
        on_stop: stop
      rules:
      - if: $CI_COMMIT_REF_NAME != "main"
    
    stop:
      image: registry.gitlab.com/gitlab-org/terraform-images/releases/terraform:1.5.1
      script:
        - echo stop
        - exit 1
      variables:
        GIT_STRATEGY: none
      environment:
        name: feature/$CI_COMMIT_REF_SLUG
        action: stop
      rules:
      - if: $CI_COMMIT_REF_NAME != "main"
        when: manual
        allow_failure: true
  2. Then

    1. Create a new branch, e.g. test
    2. Wait for the create job to create a new environment.
    3. Delete the branch to trigger the stop job. The stop job fails and the environment is stuck in the stopping state.
    4. Create the test branch again.
    5. Wait for the create job to run. The environment will not be available again, because it is still stuck in stopping.
    6. Delete the branch. The stop job is not triggered.

Current workaround

Have an after_script in the_stop_ job that manually force-stops the environment.

What is the current bug behavior?

on_stop jobs are not triggered when a previous on_stop job has failed

What is the expected correct behavior?

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of: \`sudo gitlab-rake gitlab:env:info\`) (For installations from source run and paste the output of: \`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production\`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:check SANITIZE=true`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true`) (we will only investigate if the tests are passing)

Original description

The following discussion from !86478 (merged) should be addressed:

  • @shinya.maeda started a discussion: (+1 comment)

    @acook.gitlab I think some environments would be stuck at stopping states forever due to stop action failures. We gotta think how to handle such environments. Do we have an issue for discussing this? If no, you can resolve this discussion by creating a new issue.


    For now, I think we should update the troubleshooting section about how to identify them and how to fix it. For example,

    diff --git a/doc/ci/environments/index.md b/doc/ci/environments/index.md
    index 7e1ef5efaa5..66dd4a7c966 100644
    --- a/doc/ci/environments/index.md
    +++ b/doc/ci/environments/index.md
    @@ -1052,3 +1052,24 @@ Project.find_by_full_path(<your-project-full-path>).deployments.where(archived:
     Please note that GitLab could drop this support in the future for the performance concern.
     You can open an issue in [GitLab Issue Tracker](https://gitlab.com/gitlab-org/gitlab/-/issues/new)
     to discuss the behavior of this feature.
    +
    +### Environments can't be stopped due to job failure or stopped environments are not shown in UI
    +
    +Starting from GitLab 15.1, [environment stop](#stop-an-environment) feature requires the corresponding `on_stop` job
    +to succeed. If the job failed, the environment wouldn't appear on the "Stopped" tab in the environment page.
    +
    +To identify such environments, you can use [List Environment API](https://docs.gitlab.com/ee/api/environments.html#list-environments),
    +
    +e.g.
    +
    +```
    +curl --header "PRIVATE-TOKEN: <your_access_token>" "https://gitlab.example.com/api/v4/projects/1/environments?states=stopping"
    +```
    +
    +and you can force-stop these environments by using the `force` option in [Stop Environment API](https://docs.gitlab.com/ee/api/environments.html#stop-an-environment).
    +
    +e.g.
    +
    +```
    +curl --request POST --header "PRIVATE-TOKEN: <your_access_token>" "https://gitlab.example.com/api/v4/projects/1/environments/1/stop"
    +```

    This would be helpful for support-engineers and some users. This can be a follow-up MR.

Possible workarounds

From #363197 (comment 1279225900)

Just wanted to confirm that running a functional stop job (that executes and exits successfully) works for clearing these stuck jobs.

Alternatives are:

  • Calling the Stop Environment API endpoint passing the value true for the force parameter
  • Deleting the environment manually via the Rails console (not recommended)

Proposal

TBD in #363197 (comment 1026918330)

Edited by Timo Furrer