Allow pipelines to schedule timed jobs for incremental rollouts
Problem to Solve
To reduce operational risk when rolling out new code, a best practice is to slowly, incrementally roll out code changes to a fleet, pausing in between certain breakpoints, and optionally allowing for metrics to pause or abort a rollout. Our current Kubernetes canary deployment is implemented only as a definable rollout percentage that can be set on a per- manual run basis, which you can see below.
This is a v2 iteration for the existing k8s incremental rollouts feature (above), that exists in production today. What this adds to the existing functionality is the ability to set an automatic timeout whereupon the deployment will continue forward to the next rollout percentage in the deployment phase. For the MVC, this will be a pre-set value of 5m for users choosing timed incremental rollouts; this value is editable in the CI yml after generation by the end user.
The current idea to get this thing shipped is to provide a generic function, not a specific function for k8s rollout. Generic in sense that this is available for any type of the workflow at Job level, not Environment level, and is also not k8s specific (i.e., implemented inside the AutoDevOps shell script).
We will provide a way to enable the timed incremental rollout in AutoDevOps, in addition to the existing manual one, using the shared
when: in 1hour feature.
This could be used for timed rollouts, but it is a responsibility of the user to cancel the process. This will be implemented split into separate stages so that if it fails at 20% (for example), it does not continue forward.
We make this process more managed in detail once we tackle
Incremental Rollout as a first-class thing.
This MVC does not contain detection/prevention of users running simultaneous timed incremental rollouts.
Waiting jobs will have an appearance as follows:
When canceling a scheduled/timed job, it should cancel the scheduled aspect but still leave it available as a manual job that can be run or cancelled. This lets you say "please don't continue automatically" but still have the power to do things by hand.
Default AutoDevOps (generated config) wait between steps: 5m - this can be edited in the YML after the fact by the user to change the timeout.
job: when: in 1hour
Links / references
- Depends on #1589 (closed)
- Alternate: Canary deploys: #1659 (closed)
- Tweet: https://twitter.com/officesunshine/status/821787299084173320
- Pausing and Resuming a Deployment
- Canary deployments