Environments are stuck in `stopping` state if the on_stop job fails
Reason to close this issue
This is issue has been closed in favor of these two new refined issues:
- Recover and revert stuck-stopping environments (#425161 - closed) • Hunter Stewart • 16.5
- Recover and long running stuck-stopping environ... (#425162 - closed) • Hunter Stewart • 16.6
Please have a look at the refinement discussion from here #363197 (comment 1557491287) and follow up on the aforementioned issues.
Summary
A customer noticed that on_stop
jobs are not triggered when a previous on_stop
job has failed. This seems to be because the start of the on_stop
job puts the environment in a stopping
state that persists if the on_stop
job fails.
Steps to reproduce
-
Create a new project with the following
.gitlab-ci.yml
on themain
branch:create: image: registry.gitlab.com/gitlab-org/terraform-images/releases/terraform:1.5.1 script: - echo deploy environment: name: feature/$CI_COMMIT_REF_SLUG on_stop: stop rules: - if: $CI_COMMIT_REF_NAME != "main" stop: image: registry.gitlab.com/gitlab-org/terraform-images/releases/terraform:1.5.1 script: - echo stop - exit 1 variables: GIT_STRATEGY: none environment: name: feature/$CI_COMMIT_REF_SLUG action: stop rules: - if: $CI_COMMIT_REF_NAME != "main" when: manual allow_failure: true
-
Then
- Create a new branch, e.g.
test
- Wait for the
create
job to create a new environment. - Delete the branch to trigger the
stop
job. Thestop
job fails and the environment is stuck in thestopping
state. - Create the test branch again.
- Wait for the create job to run. The environment will not be available again, because it is still stuck in stopping.
- Delete the branch. The stop job is not triggered.
- Create a new branch, e.g.
Current workaround
Have an after_script
in the_stop
_ job that manually force-stops the environment.
What is the current bug behavior?
on_stop jobs are not triggered when a previous on_stop job has failed
What is the expected correct behavior?
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: \`sudo gitlab-rake gitlab:env:info\`) (For installations from source run and paste the output of: \`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production\`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:check SANITIZE=true`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true`) (we will only investigate if the tests are passing)
Original description
The following discussion from !86478 (merged) should be addressed:
-
@shinya.maeda started a discussion: (+1 comment) @acook.gitlab I think some environments would be stuck at
stopping
states forever due to stop action failures. We gotta think how to handle such environments. Do we have an issue for discussing this? If no, you can resolve this discussion by creating a new issue.
For now, I think we should update the troubleshooting section about how to identify them and how to fix it. For example,
diff --git a/doc/ci/environments/index.md b/doc/ci/environments/index.md index 7e1ef5efaa5..66dd4a7c966 100644 --- a/doc/ci/environments/index.md +++ b/doc/ci/environments/index.md @@ -1052,3 +1052,24 @@ Project.find_by_full_path(<your-project-full-path>).deployments.where(archived: Please note that GitLab could drop this support in the future for the performance concern. You can open an issue in [GitLab Issue Tracker](https://gitlab.com/gitlab-org/gitlab/-/issues/new) to discuss the behavior of this feature. + +### Environments can't be stopped due to job failure or stopped environments are not shown in UI + +Starting from GitLab 15.1, [environment stop](#stop-an-environment) feature requires the corresponding `on_stop` job +to succeed. If the job failed, the environment wouldn't appear on the "Stopped" tab in the environment page. + +To identify such environments, you can use [List Environment API](https://docs.gitlab.com/ee/api/environments.html#list-environments), + +e.g. + +``` +curl --header "PRIVATE-TOKEN: <your_access_token>" "https://gitlab.example.com/api/v4/projects/1/environments?states=stopping" +``` + +and you can force-stop these environments by using the `force` option in [Stop Environment API](https://docs.gitlab.com/ee/api/environments.html#stop-an-environment). + +e.g. + +``` +curl --request POST --header "PRIVATE-TOKEN: <your_access_token>" "https://gitlab.example.com/api/v4/projects/1/environments/1/stop" +```
This would be helpful for support-engineers and some users. This can be a follow-up MR.
Possible workarounds
From #363197 (comment 1279225900)
Just wanted to confirm that running a functional stop job (that executes and exits successfully) works for clearing these stuck jobs.
Alternatives are:
- Calling the Stop Environment API endpoint passing the value
true
for theforce
parameter - Deleting the environment manually via the Rails console (not recommended)
Proposal
TBD in #363197 (comment 1026918330)