Feature flags are awesome as they let you roll out features to specific groups or percentages of people to limit risk. Canaries are good as they also limit risk, but usually for system reasons. It's possible to use session affinity so that users hit the same canary pod (while it exists) which expands the use cases of canaries to include UX changes. But this use of canaries doesn't let a company control which users get to see the canaries.
By combining the ideas behind each, perhaps we can make it easy to add something like a gatekeeper which routes people to backend pods based on group or percentage rules. This is related to traffic vectoring, but traffic vectoring is usually just about sending percentages of traffic to one set of pods vs another set of pods. So this goes beyond by understanding who is behind the traffic.
A gatekeeper thus has much the same benefits of feature flags in that changes are shipped asap and deployment and delivery are decoupled. But has the advantages:
- You can monitor changes in system performance, not just response metrics. e.g. if a new change increases memory usage, it's hard to see that when only using feature flags, but easy to see when using canaries
- You can deploy and isolate a broader set of changes. e.g. you can test out complete refactors or system changes that are harder, if not impossible, to put behind feature flags
- You can still quickly abort by switching routing rather than having to wait for another rollout
- For really large teams that can't afford to roll out every change to the entire fleet on every push to master, deploying to a smaller canary fleet can be more viable. e.g. perhaps every push goes to canary, but then deploys to production only happen once a hour.
- Start from existing canary deployments.
- Add session affinity based on cookies and a load balancer that routes based on cookie values.
- Then add explicit control over that routing so that subsequent traffic doesn't just go to the same pod, but goes to wherever we send it.
- Since routing has to be incredibly fast, the data to make the routing decision needs to be pre-loaded onto the load balancers. e.g. it can't make an API call to route, but perhaps it could do a table lookup on local data that is pushed periodically to it. e.g. store user IDs in the cookie, and then send the IDs of all company employees to the LBs. For increments, do something deterministic like (ID % 100) < 5 (although that results in the same people always getting hit with the canary which is probably not great, so something a little more distributed, yet deterministic would be better; perhaps feature_X = (0..99).to_a.shuffle.take(5); feature_x.includes(ID%100)).
Links / references
- Partially inspired by https://code.facebook.com/posts/270314900139291/rapid-release-at-massive-scale/
- Related: #1660 (closed)
What is it? Why should someone use this feature? What is the underlying (business) problem? How do you use this feature?
Who is this for? Provide one or more use cases.
Make sure these are completed before closing the issue, with a link to the relevant commit.