Pipeline Race Condition could change correct deploy order
Problem to solve
DISCLAIMER: This is a common problem in CIs, but surprisingly I couldn't find the exact issue. So I'm sorry if it's a duplicate.
Let's say a Project's CI is setup to build -> test -> deploy
whenever master is changed.
Pipeline 1 (P1) starts at t0. Pipeline 2 (P2) starts at (t0 + 10s)
For any reason, lets say that P2 completes its deploy before P1. Then, when P1 finishes, it will override the P2 deploy which should be the current code since it got merged after the code from P1.
Intended users
Release Manager, Developer, DevOps Engineer
Further details
We want to guarantee that the deployments respect their order of deployment.
Proposal
Once the project is configured to use our CI Environments, we could check the environment name to avoid race conditions on pipelines. E.g., P1 should check if there are any P2 running for this specific environment. A redis lock could be used to only allow one pipeline per environment to be run at a time. Else, we could also only allow one deploy
stage at a time for a certain environment. Therefore, P1 could cancel itself or its deploy job if P2 exists. Otherwise P2 could cancel P1 if P1 is not yet on its deploy stage. If the deploy stage has already been reached, maybe it's better for P2 to wait for P1 to finish, since it might be risky to cancel an ongoing deploy.
Discussions around this idea:
- Is it better to introduce this check on the start of a P2 pipeline, then check for P1 existence? Or check it right before P1 deploy stage begins and look for existing P2 ?
- Can we cancel the older pipeline/deploy if we find the newer pipeline is already in progress?
- Another approach could be that the newer waits for the older to complete. But this seems risky and perhaps not what people might prefer ?
🤔
Permissions and Security
Unknown
Documentation
Depending on the solution, we should document this on the CI Environments or CI/CD Pipelines page.
Testing
If we're canceling a whole pipeline in favor of the other, we might cancel expected user scripts that might run before the deploy happens. This might be unexpected to the user.
If we take the approach where the newer wait for the older to complete, we might need to care with multiple long queueing of pipelines in case on gets stuck. It also seems that in general one would rather that the older gets canceled right away. But I might not be seeing some use cases here.
What does success look like, and how can we measure that?
- Create a template project with
gitlab-ci.yml
like the following:
stages:
- test
- build
- deploy
test:
stage: test
script: sleep 100 && echo "Running tests"
build:
stage: build
script: echo "Building the app"
deploy_:
stage: deploy
script:
- echo "Deploy to production server"
environment:
name: production
url: https://staging.example.com
only:
- master
- Create 1 MR (P2), which simply changes the
.gitlab-ci.yaml
test stage to reduce the sleep time:
test:
stage: test
script: sleep 10 && echo "Running tests"
-
Start a pipeline (P1) from master
-
Immediately merge the MR so (P2) gets triggered.
-
Acknowledge that (P1) will be cancelled and it will never run its
deploy
stage.