Proposal for continuous omnibus deployments to the canary stage
The Road to CD takes the following into account:
- Providing developers production like test environments with Review apps
- Production like database on demand
- Faster testing and QA
- Controlling the impact of features
This issue tries to address the following painpoints that I observed as an RM:
- Delay between merging a fix to master and seeing it on staging
This proposes a pipeline that take a nightly omnibus from master and deploys it to staging -> canary, daily.
- Wasted toil for RCs that never go anywhere because of critical problems
The delivery team is reducing this considerably but monitoring, tagging RCs are still done on a periodic basis, it should be done automatic and daily if possible. This proposal is essentially the same as creating an RC every day but instead we use a nightly build.
- A very large set of changes that hit us at once in the first RC
This proposes a pipeline to canary that never stops, this means that we are dogfooding daily snapshots of master on api/git/web. It's relatively safe since it doesn't receive any production traffic (unless the community opts in).
Delivery pipeline with omnibus that deploys continuously to canary
This is a draft proposal of what a continuous pipeline to canary might look like
- Every night we build an EE package, this assumes that we have continuous updates to the ee repo (https://gitlab.com/gitlab-org/release/framework/issues/49)
- The nightly EE package initiates a pipeline:
staging
->QA
->canary (api/git/web without prod traffic)
->Canary QA
- At any time we decide to promote the nightly build to production, if there are critical problems we role it back or apply post-deployment-patches
- 1 week before the 22nd we create the stable branch and promote to ->
production onebox*
->production
. Hopefully, the release has already had a lot of time on canary and maybe some time on GitLab.com. - At this point we deploy to ->
production onebox*
->production
as needed for critical security updates, regression fixes, with post-deployment patches, or an RC from stable. - Meanwhile we continue to deploy master to canary
- On the 22nd what is running on GitLab.com becomes the official release.
*See https://gitlab.com/gitlab-org/release/framework/issues/93 for a proposed onebox production stage
CICD omnibus deployment with freeze on the 7th
-
From the 23rd -> 7th: 15 days of master on production, the first deploy has 16 days of new changes
- [ MR ] -> [ merge to master ] -> [ create omnibus package for the commit ] -> [ staging ] -> [ canary ] -> [ production ]
-
From the 7th -> 22nd create stable branch : 16 days of freeze where we only take deltas
- [ MR ] -> [ merge to master ] -> [ auto merge to stable with picks ] -> [ create omnibus package for the commit ] -> [ staging ] -> [ canary ] -> [ prod ]
CICD omnibus deployment with shorter freeze window and continuous deployments of master to canary
-
Every day of the month, 1st -> 31st: 31 days of master going to canary if pipelines pass
- [ MR ] -> [ merge to master ] -> [ create omnibus package for the commit ] -> [ staging ] -> [ canary ]
-
From the 23rd -> 17th: 25 days of master going to production if pipelines pass, the first deploy has 6 days of new changes and those changes were already seen on canary
- [ MR ] -> [ merge to master ] -> [ create omnibus package for the commit ] -> [ staging ] -> [ canary ] -> [ promote to prod or rollback if there are issues ]
-
From the 17th -> 22nd create stable branch : 6 days of freeze where we only take deltas
- [ MR ] -> [ merge to stable with picks ] -> [ create omnibus package for the commit ] -> [ staging onebox ] -> [ production onebox ] -> [ production ]
Concerns
- By essentially shortening the freeze window will this add instability?
The freeze window is shorter but GitLab.com will only take what has made it through the pipeline to canary, which requires it to go through staging and QA. Adding the onebox stage also allows us to monitor for critical issues before exposing them to the rest of the fleet which will make rolling back much easier.
- Should we be testing patches, regression fixes, security updates on staging/canary?
Not sure how big of an issue this is, there is a big advantage to keeping master deploying to canary and the idea is that by the time we cut the first RC (1 week before the 22nd) we only take extremely critical, targetted bug or security fixes. We utilize the new onebox stages for these targeted patch releases.
- Do we lose the benefit of cutting public RCs earlier in the month?
We may, I'm not sure what the impact of this is. We could create a longer freeze window but I think there is some value in shortening it so that there are fewer changes that are building up that do not get deployed to GitLab.com.