Today, GitLab CI believes that deploys are binary, all-or-nothing affairs. But some teams use blue/green deploys, canary deploys, and traffic vectoring to reduce risk during deploys. This can actually take many shapes as people interpret the idea differently to suit their needs.
Blue/green deploys Create two (nearly) identical environments, arbitrarily called blue and green. One isn't better than the other, just separate.
With all traffic going to green, load new code onto blue, get it ready, and then switch the router to suddenly, as quickly as possible, send all new traffic to blue.
Canary deploys, incremental rollouts, traffic vectoring Similar to blue/green deploys, you have a new version of code running at the same time as old code, but instead of cutting over 100% in one moment, you send a portion of traffic to the new version. This can be done in several ways:
- Given X instances/containers, load the new code on a portion of the instances/containers.
- or Set the router to send a percent of traffic to the new instances/containers running the new code (traffic ventoring).
The first is easier conceptually and fits well with Docker workflows where it's easy to destroy and create new containers running new images. The second is easier when you have stable instances, are willing to allocate double capacity, or when instance granularity is too large. e.g. if you only have 2 instances in production, the minimum granularity is 1 instance, which is 50% of your capacity. If you want only 10% of traffic to test the new changes, you could use a third instance and send 10% of the total traffic there.
Often when people refer to canary deploys, it's a fixed, small percentage of instances/containers/traffic. Like 10%. Or maybe just 1 instance. But it's possible to extrapolate from there to make an incremental rollout where you start with a small canary deploy, then increase the cutoff gradually, say from 1% to 10% to 25% to 50%, until finally reaching 100%. If at any point, a problem is detected, it should be easy to revert back to 0% to fallback to the old code.
Links / references
Monitoring stats in merge requests make it easier to see if the change in code or the change in environment/time was the cause.
Multiple canaries need to be in production at the same time if you have many merges on master per day and a long time per canary. But that means that you need to stop all canaries in flight in case of a anomaly. So won't assume that for now. Need to say in MR button: merge & deploy after 2 changes are done.
Canary is branch in repo or revision, feature flags are branch in code path. We will do Canaries after merging into master.
All Canaries should be automatically shifting percentages, 1/5/20/50/100%. Automatically go back to 0, revert the merge, and open an issue if a significant negative change is detected. Or just go 1% per x seconds so that it is easier for the container scheduler to adapt. Also want to look into the Kubernetes tooling that is available.
All Canaries should be tracked with a metric in Prometheus.
FWIW, we are already using blue/green deploys with Gitlab. Our environment names include both a color and an actual environment, i.e. production-green