Phased (Incremental) Rollouts MVC
Problem to solve
Modern software delivery approaches support phased rollout of changes; to reduce operational risk when rolling out new code, a best practice is to roll out code changes to a fleet in phases, pausing in between certain breakpoints, and optionally allowing for metrics or other criteria to pause or abort a rollout.
Use Cases
- A/B-style deployments where a dark cluster (B) is provisioned, deployed to, and - once validated - failed over to. The newly dark previously live cluster (A) is then either updated or decommissioned. This can be looked at as an incremental rollout where there are two steps of 50% with a check done at each stage.
- Canary-style deployments where arbitrary cutoff points are set at (for example) 10%, 20%, 40%, and 100% and checks are done at each stage.
Proposal
The MVC for this feature contains the following characteristics:
- It should be possible to set a sequence of breakpoints for a single pipeline job via the .gitlab-ci.yml file which define checkpoints for the incremental rollout
- Each checkpoint is associated with arbitrary steps, allowing users to implement incremental steps for whatever platform they are using (see note below about already extant k8s canary feature)
- For the MVC we will only support "manual" checkpoints where the rollout pauses forever, waiting for a human to intervene. For the future we may add a "max wait" time or other criteria, so please keep this in mind in design.
- Each checkpoint should set a rollout% target to define how far the rollout will go. This should be exposed as a variable for automation to access.
- Upon reaching each checkpoint we pause for human intervention to decide to continue or abort the deployment (this is the hook point where we can provide more sophisticated monitoring or other acceptance criteria, so we also should keep that future expansion in mind; a manual human check is just the first implementation of one kind of check you can do here. Imagine for example a monitoring integration that verifies response time or other production KPIs)
- It must be clear on pipeline overview pages that a phased job is paused waiting for human intervention. It should also be clear that a pipeline job is of the "phased" type.
- Should not be implemented using project variables / requiring a complete redeploy to execute as with the k8s canary solution. Control should be within the running job itself, as in https://gitlab.com/gitlab-org/gitlab-ee/issues/5416.
- At the end of a failed deploy, the MVC does nothing additional to restore the previous state, but this should be kept in mind as a future hook point for optional automated rollbacks of the incrementally updated portion as in https://gitlab.com/gitlab-org/gitlab-ee/issues/1661.
- This feature exists alongside our existing canary deployments implementation, but does not require K8s: https://docs.gitlab.com/ee/user/project/canary_deployments.html. Documentation will need to be updated to clarify in which cases you should use each.
- This feature matches the pricing tier for canary deployments (premium).
This feature is being kept separate from the k8s canary deployment feature because the k8s implementation has more insight into the pods in the cluster, and is able to provide more detail on deployed/total pods and other information. As both of these features develop we will keep in mind the possibility of integrating them, but for now the use cases are different enough to keep them separate. I envision this more as the canonical, flexible foundation for phased rollouts in the product, so would expect k8s canary deployments to eventually align to this feature rather than the other way around.
What does success look like, and how can we measure that?
(If no way to measure success, link to an issue that will implement a way to measure this)