To reduce operational risk when rolling out new code, a best practice is to slowly, incrementally roll out code changes to a fleet, pausing in between certain breakpoints, and optionally allowing for metrics to pause or abort a rollout.
- Pick a good rollout standard such as 1%, 5%, 10%, 20%, 50%, 100%
- Roll out to each group amount, then wait for specific time, then continue
- Ability for manual abort of rollout
- Future: Ability for monitoring metrics to automatically abort rollout
Links / references
- Depends on #1589 (closed)
- Alternate: Canary deploys: #1659
- Tweet: https://twitter.com/officesunshine/status/821787299084173320
- Pausing and Resuming a Deployment
- Canary deployments
I wonder if a simpler, "boring" solution of requiring manual increment/halting might be better as a first iteration than trying to do automatic rollouts. There's danger in having a rollout stuck because someone forgot to increment the rollout, but the implementation simplicity might be worth that risk. Also, I have a feeling people like manual rollouts, somewhat because people just like control, but also because a lot of testing and QA isn't automatic yet, and rollouts can span days/weeks, not just a 30 minute cycle. I know in the past I've turned on "scary" changes for 5% of users, and then carefully watched stats for days, reverted the change when problems were spotted, then fixed those problems, and started rollout again.
@markpundsack Definitely great ideas to help us transition from just a packaged software company to a services company (i.e. GitLab.com).
Would the boringest version of this is just coordinating with the infra team and establish a runbook / playbook that allows product to review a handful of user-facing-ish metrics during incremental roll out.
cc @mydigitalself for GitLab.com awesomeness.
Relevant: Slide 12 of http://www.slideshare.net/LarsWander/orchestrating-vm-container-deployments
Kubernetes deployments API: can't be paused deterministically for validation and rollbacks are always linear. Spinnaker lets you define a rollout plan (1%=>2%=>10%=>50%=>100%) with validation at each stage.