Proposal for continuous omnibus deployments to the canary stage

The Road to CD takes the following into account:

Providing developers production like test environments with Review apps
Production like database on demand
Faster testing and QA
Controlling the impact of features

This issue tries to address the following painpoints that I observed as an RM:

Delay between merging a fix to master and seeing it on staging

This proposes a pipeline that take a nightly omnibus from master and deploys it to staging -> canary, daily.

Wasted toil for RCs that never go anywhere because of critical problems

The delivery team is reducing this considerably but monitoring, tagging RCs are still done on a periodic basis, it should be done automatic and daily if possible. This proposal is essentially the same as creating an RC every day but instead we use a nightly build.

A very large set of changes that hit us at once in the first RC

This proposes a pipeline to canary that never stops, this means that we are dogfooding daily snapshots of master on api/git/web. It's relatively safe since it doesn't receive any production traffic (unless the community opts in).

Delivery pipeline with omnibus that deploys continuously to canary

This is a draft proposal of what a continuous pipeline to canary might look like

Every night we build an EE package, this assumes that we have continuous updates to the ee repo (https://gitlab.com/gitlab-org/release/framework/issues/49)
The nightly EE package initiates a pipeline: staging -> QA -> canary (api/git/web without prod traffic) -> Canary QA
At any time we decide to promote the nightly build to production, if there are critical problems we role it back or apply post-deployment-patches
1 week before the 22nd we create the stable branch and promote to -> production onebox* -> production. Hopefully, the release has already had a lot of time on canary and maybe some time on GitLab.com.
At this point we deploy to -> production onebox* -> production as needed for critical security updates, regression fixes, with post-deployment patches, or an RC from stable.
Meanwhile we continue to deploy master to canary
On the 22nd what is running on GitLab.com becomes the official release.

*See https://gitlab.com/gitlab-org/release/framework/issues/93 for a proposed onebox production stage

CICD omnibus deployment with freeze on the 7th

From the 23rd -> 7th: 15 days of master on production, the first deploy has 16 days of new changes
- [ MR ] -> [ merge to master ] -> [ create omnibus package for the commit ] -> [ staging ] -> [ canary ] -> [ production ]
From the 7th -> 22nd create stable branch : 16 days of freeze where we only take deltas
- [ MR ] -> [ merge to master ] -> [ auto merge to stable with picks ] -> [ create omnibus package for the commit ] -> [ staging ] -> [ canary ] -> [ prod ]

CICD omnibus deployment with shorter freeze window and continuous deployments of master to canary

Every day of the month, 1st -> 31st: 31 days of master going to canary if pipelines pass
- [ MR ] -> [ merge to master ] -> [ create omnibus package for the commit ] -> [ staging ] -> [ canary ]
From the 23rd -> 17th: 25 days of master going to production if pipelines pass, the first deploy has 6 days of new changes and those changes were already seen on canary
- [ MR ] -> [ merge to master ] -> [ create omnibus package for the commit ] -> [ staging ] -> [ canary ] -> [ promote to prod or rollback if there are issues ]
From the 17th -> 22nd create stable branch : 6 days of freeze where we only take deltas
- [ MR ] -> [ merge to stable with picks ] -> [ create omnibus package for the commit ] -> [ staging onebox ] -> [ production onebox ] -> [ production ]

Concerns

By essentially shortening the freeze window will this add instability?

The freeze window is shorter but GitLab.com will only take what has made it through the pipeline to canary, which requires it to go through staging and QA. Adding the onebox stage also allows us to monitor for critical issues before exposing them to the rest of the fleet which will make rolling back much easier.

Should we be testing patches, regression fixes, security updates on staging/canary?

Not sure how big of an issue this is, there is a big advantage to keeping master deploying to canary and the idea is that by the time we cut the first RC (1 week before the 22nd) we only take extremely critical, targetted bug or security fixes. We utilize the new onebox stages for these targeted patch releases.

Do we lose the benefit of cutting public RCs earlier in the month?

We may, I'm not sure what the impact of this is. We could create a longer freeze window but I think there is some value in shortening it so that there are fewer changes that are building up that do not get deployed to GitLab.com.

Edited Dec 14, 2018 by John Jarvis