Improve deployment project pipeline reliability with proper concurrency control
As seen in https://gitlab.com/gitlab-com/gl-infra/platform/runway/deployments/ai-gateway/-/jobs/6990252534, the deploy
stage needs some form of concurrency control to avoid jobs running into lock-acquired errors and necessitating a manual retry.
We already use resource groups in the service projects however, they are used to manage downstream pipeline triggers. The deploy jobs will encounter the errors when the second (and so forth) pipelines catches up to the first one.
Furthermore, if 3 MRs were to be merged in close succession, the deploy triggers are currently un-ordered. If 3 commits, A, B, C are made where C is the latest commit, we'd expect deployments to deploy commit A, B, then C.
However, since the trigger jobs are unordered, commit C may be performed before B, effectively "rolling back" commit C. i.e. A gets deployed, then C, then B.
Potential solution
-
Use resource groups in the deployment project. https://docs.gitlab.com/ee/ci/resource_groups/
-
We would also need to configure the process mode (https://docs.gitlab.com/ee/ci/resource_groups/#change-the-process-mode) to
latest-first
to ensure that there is a FIFO order in deploying runway services. This avoid performing accidental rollbacks due to mis-ordered deployments.