start-review-app-pipeline got stuck on default branch
On 2023-03-23 we noticed that we have 35 pipelines got stuck in start-review-app-pipeline
job because resource_group
locks on it: #381415 (comment 1316541202)
Today, we have 164 pipelines got stuck, checked via: https://gitlab.com/gitlab-org/gitlab/-/pipelines?scope=all&source=schedule&status=waiting_for_resource
Back then, we have bad designs causing it to have deadlock, thus the issue #381415 (closed)
However, I believe we should have completely eliminated deadlock after !111047 (merged) because now the only jobs use resource_group
are start-review-app-pipeline
and a manual action review-stop
. I didn't see we run it manually anywhere, so I assume it never ran, thus start-review-app-pipeline
should be the only job can potentially lock it, yet it's still stuck.
Now I believe this is a bug where GitLab handles resource_group
:#381139 (closed) It turned out that this might be more related to that some pipeline can be running forever: #399214 (comment 1329303350)
We need to do something to resolve this. A few potential options:
- Cancel them, manually or in a script
- Stop using
resource_group
, and risk review apps stepping onto each others on default branch - Use GDK to replace review app deployment
- Always deploy to a different environment on the default branch, thus we'll never step on each others and no need to use
resource_group
I am making threads to discuss about each option.