start-review-app-pipeline got stuck on default branch

On 2023-03-23 we noticed that we have 35 pipelines got stuck in start-review-app-pipeline job because resource_group locks on it: #381415 (comment 1316541202)

Today, we have 164 pipelines got stuck, checked via: https://gitlab.com/gitlab-org/gitlab/-/pipelines?scope=all&source=schedule&status=waiting_for_resource

Back then, we have bad designs causing it to have deadlock, thus the issue #381415 (closed)

However, I believe we should have completely eliminated deadlock after !111047 (merged) because now the only jobs use resource_group are start-review-app-pipeline and a manual action review-stop. I didn't see we run it manually anywhere, so I assume it never ran, thus start-review-app-pipeline should be the only job can potentially lock it, yet it's still stuck.

Now I believe this is a bug where GitLab handles resource_group: #381139 (closed) It turned out that this might be more related to that some pipeline can be running forever: #399214 (comment 1329303350)

We need to do something to resolve this. A few potential options:

  1. Cancel them, manually or in a script
  2. Stop using resource_group, and risk review apps stepping onto each others on default branch
  3. Use GDK to replace review app deployment
  4. Always deploy to a different environment on the default branch, thus we'll never step on each others and no need to use resource_group

I am making threads to discuss about each option.

Edited by Lin Jen-Shin