Pipeline Race Condition could change correct deploy order

Problem to solve

DISCLAIMER: This is a common problem in CIs, but surprisingly I couldn't find the exact issue. So I'm sorry if it's a duplicate.

Let's say a Project's CI is setup to build -> test -> deploy whenever master is changed.

Pipeline 1 (P1) starts at t0. Pipeline 2 (P2) starts at (t0 + 10s)

For any reason, lets say that P2 completes its deploy before P1. Then, when P1 finishes, it will override the P2 deploy which should be the current code since it got merged after the code from P1.

Intended users

Release Manager, Developer, DevOps Engineer

Further details

We want to guarantee that the deployments respect their order of deployment.

Proposal

Once the project is configured to use our CI Environments, we could check the environment name to avoid race conditions on pipelines. E.g., P1 should check if there are any P2 running for this specific environment. A redis lock could be used to only allow one pipeline per environment to be run at a time. Else, we could also only allow one deploy stage at a time for a certain environment. Therefore, P1 could cancel itself or its deploy job if P2 exists. Otherwise P2 could cancel P1 if P1 is not yet on its deploy stage. If the deploy stage has already been reached, maybe it's better for P2 to wait for P1 to finish, since it might be risky to cancel an ongoing deploy.

Discussions around this idea:

Is it better to introduce this check on the start of a P2 pipeline, then check for P1 existence? Or check it right before P1 deploy stage begins and look for existing P2 ?
Can we cancel the older pipeline/deploy if we find the newer pipeline is already in progress?
Another approach could be that the newer waits for the older to complete. But this seems risky and perhaps not what people might prefer ? 🤔

Permissions and Security

Unknown

Documentation

Depending on the solution, we should document this on the CI Environments or CI/CD Pipelines page.

Testing

If we're canceling a whole pipeline in favor of the other, we might cancel expected user scripts that might run before the deploy happens. This might be unexpected to the user.

If we take the approach where the newer wait for the older to complete, we might need to care with multiple long queueing of pipelines in case on gets stuck. It also seems that in general one would rather that the older gets canceled right away. But I might not be seeing some use cases here.

What does success look like, and how can we measure that?

Create a template project with gitlab-ci.yml like the following:

stages:
  - test
  - build
  - deploy

test:
  stage: test
  script: sleep 100 && echo "Running tests"

build:
  stage: build
  script: echo "Building the app"

deploy_:
  stage: deploy
  script:
    - echo "Deploy to production server"
  environment:
    name: production
    url: https://staging.example.com
  only:
  - master

Create 1 MR (P2), which simply changes the .gitlab-ci.yaml test stage to reduce the sleep time:

test:
  stage: test
  script: sleep 10 && echo "Running tests"

Start a pipeline (P1) from master
Immediately merge the MR so (P2) gets triggered.
Acknowledge that (P1) will be cancelled and it will never run its deploy stage.

Admin message