Canary deployments

moved from gitlab-ce#26961

changed milestone to %9.1

simplify, pick a good rollout standard 1,5,20,50 and timing, not convinced canary should be a stage since it is production in various sizes, if you can do a rollback you can do a partial rollout, slider in UI at some point, deep integration in GitLab, no manual approvals needed, making canary a job might paint us into a corner or we might be able to give it a percentage, why not give production a percentage? Run migrations at start of the production, production deploy is a canary deploy with 100%, to not have a canary merge and go forward button

Created https://gitlab.com/gitlab-org/gitlab-ee/issues/1660 for timed rollout.

I think both directions have merit, but early research indicates canaries might be easier to do with Kubernetes using existing solutions. Kubernetes does allow you to pause/resume a rollout, but I haven't yet found a way to define that progress, so you'd have to have a process watching the progress and pause/resume programmatically. That seems to add a significant complication on our side. Might be worth doing, but my current thinking is that it's easier to start with canary as a stage in a pipeline with a manual action to promote to production. We can then learn more before investing in a more elaborate solution.

Would love to hear from Kubernetes experts (@ayufan @twk3 @WarheadsSE @pchojnacki) on the feasibility of incremental rollouts vs canary rollouts. Also from folks that have done either in production, and where the best bang/buck lies.

added ~1069110 label

removed ~897283 label

added ~333913 label

mentioned in issue #1660 (moved)

This is a portion of k8s' capabilities that I am not familiar with yet. This will require some research, at least on my part.

While I can't comment yet about the feasibility of implementing incremental rollout in k8s.

But from the perspective of someone who worked with and implemented these kind of deployment strategies but using Marathon instead of Kubernetes.

I can tell that canary deployments give a lot of the same benefits as incremental rollout and are easier to implement. I've been working on similar setups using Mesos/Marathon, and we decided to start implementing sort of hybrid approach called Blue/Green deployment.

In short it's a reversible way to promote Canary to Production, by slowly morphing Canary by adding more capacity and directing more traffic to it. Allowing instant switch back to old code in case of encountered problem.

Both Canary and Blue/Green deployments work very well with both Kubernetes and Marathon. Because conceptually they are just separate apps/pods and thus are easy to scale and control traffic routing.

As per !1660 (merged) I feel that this kind of rollout crosses territory of Cluster schedulers (k8s, mesos/marathon) which usually handle this sort of timed incremental deployment of all instances (depending on rollout configuration). And for cluster schedulers are only means to an end which is just deploying new configuration without too much disruption. I wonder what kind of benefits with !1660 (merged) we could bring to our users that the Cluster schedulers do not already bring?

Thanks @pchojnacki, that's helpful. What do you see as the advantages of blue/green over canary? They're not solving for the exact same problem. Depending on your definition, blue/green is about instantaneous transitions, so slowly scaling a canary deploy to take on more traffic isn't really blue/green. But I guess you can use some of the same tools behind blue/green and just treat them as a scalable canary. I've also called that traffic vectoring; where you send some portion of traffic to a different fleet.

Blue/green eliminates downtime, while providing quick rollback, but does nothing else to limit exposure of a bug after deployment. Whereas canary deploys would explicitly impact a smaller percentage of requests while tests/monitoring could evaluate fitness of the new version.

I imagine both tools would be useful, for different reasons.

Use blue/green where a rolling deploy would have downtime. For example if you are adding a new form, you can't have one hit render the form, but then when a user submits the form, it hits an old version of code which doesn't understand the new form. Blue/green would avoid that problem. This simple example could be worked around with pod-affinity which was common for Java pre-Docker, but not 12-factor apps. I'm sure there are better examples.

Use canary for "risky" deploys, or just general risk-reduction.

Good point about this bleeding into container scheduler territory. Although as Mandy pointed out, we could implement a controller that just augments Kubernetes to do a timed, incremental rollout strategy.

Net-net, I believe that doing canary first still makes sense. And in particular I'm more interested in it over a traditional blue/green because it leverages our prometheus monitoring work (to evaluate the fitness of the canary deploy).

@pchojnacki FWIW, I think that referenced article on Mesos blue/green doesn't actually deliver what Martin Fowler describes for blue/green: "Once the software is working in the green environment, you switch the router so that all incoming requests go to the green". The Mesos article instead delivers more what you mentioned, where it was a controlled vectoring from blue to green, taking one or more task instances out at a time, but not all at once.

It's totally reasonable, and there's tons of subjective interpretation of what blue/green means, but it's a bit strange when they even explicitly link to the Fowler article that describes something different. :)

mentioned in issue #1829

FWIW my understanding of canary deployments with k8s is the following:

Your Service uses a selector that incorporates all versions past, present, and future of that app. For example app: worker
You then set up multiple deployments, which have an additional unique selector but retain the shared service selector. For example it could be release: stable is current and release: canary is the canary flavor.
You then ramp up the canary deployment to desired size, and scale down the current version to balance.

The service will spread it's traffic across both deployments because its selector matches them, and you can then individually control the size/rate/version.

It is hard to give this a Deliverable label as we are not sure what we want to achieve as part of this MR and how will it look after we evaluate possible options.

added ~126459 label

as i presume this needs visuals?

@markpundsack I think there is no general consensus about what is Canary and what is Blue/Green actually.

To me when writing above post Canary was just single special instance that got latest version and when that version graduated from canary it was deployed separately to main service cluster.

I've been thinking about it a little bit and I believe it would be worthwhile if we would support different modes. E.g. Single instance Canary, Canary that expands to become full production etc. And allow both instant and gradual vectoring between new and old version. Although maybe not in the first iteration but it would be good to keep that in mind.

Since the Canary's definition might mean different things to different people. I think we need to specify explicitly which variant we want to implement.

It would be alse awesome if we would start using it internaly e.g. to deploy license.gitlab.com or other simple service. This way we'll know from our experience the problems other people might have with it.

That's an excellent suggestion!

@regisF what do you think about getting license or other properties deployed to k8s with canary deploys?

@pchojnacki @markpundsack the better fit, I think, would be customers.gitlab.com. That would allow us to experience canary deploys with a property that is used by more people than license.gitlab.com. We could then A/B test pricing pages or signup forms or whatever and play with monitoring.

However, does canary deploy mean you need a load balancer?

Great to internally drink our own champagne first. Since this is targeted for auto-deploy scripts though, wouldn't the service need to run on a supported scheduler like k8s?

Yes, you'd need a load balancer. You'd also need it deployed on kubernetes on GKE. :)

I did take a look at https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#canary-deployments and figured out that we could do it like this.

Thoughts

Extend auto-deploy .gitlab-ci.yml with canary stage,
Add canary job to .gitlab-ci.yml that is either manual or auto,
The deploy canary will create a new deployment: app-canary with special label: track: $CI_JOB_NAME,
Service will automatically route all traffic to two runnings deployments with different tracks,
When deploying production using manual we will not remove app-canary, but downscale it to 0 replicas, and update app deployment,
We will extend deployboard to read multiple deployments and read track label. We will show different tracks with different colors on deployboard to indicate different versions,
New canary deployments will start by default with 1 replicas, we would store a number of desired replicas in annotations of deployments to reset it each time when a new canary is done.
(next release) As next step would be rollout deployments: it would be changing number of replicas of deployments: app-canary and app with different tracks. We would not do that as part of CI job, but as part of direct call to Kubernetes, auto-deploy will on next deploy use preconfigured number of replicas.
Mechanics for canary and roll out deployments would work as well with any environment, not only production.

What needs to be done

Update auto-deploy to support canaries,
Update gitlab-ci-yml templates,
Update backend for deploy boards to fetch multiple deployments and aggregate all pods,
Introduce better tooltips and new colors for canaries on deployboard.

.gitlab-ci.yml

stages:
  - build
  - test
  - review
  - staging
  - canary
  - production
  - cleanup

canary:
  stage: canary
  script:
    - command deploy canary
  environment:
    name: production
    url: http://$CI_PROJECT_NAME.$KUBE_DOMAIN
  when: manual
  only:
    - master

production:
  ...

This is great @ayufan and pretty much what I was thinking. Deploying to canary would be a manual step before deploying to production. It could/should be a blocking job to enforce the process, although I could see people wanting to override that sometimes. It will only be on master.

To clarify, both canary and production jobs would deploy to the same named environment production. I think that's the right approach, so these show up in a single environment in the UI. If nothing else that looks cooler, and more easily understood, vs having a canary environment separate from production and just having to know that they're both routed to from the same inbound traffic.

Since Canary deploy is an EEP feature, we'll need distinct auto-deploy templates; one for CE and one for EEP. I have no idea how we'll technically support a different template for EEP vs EES. Maybe the template will have to be in EES, but the visualization of the deploy will have to be EEP-only. I think canary deploys should be optional anyway, so having EE have two choices might make sense. Of course once you involve databases, we might have combinational explosion.

One thing I'm not sure how to represent is the concept of "latest deploy". If an environment is running two deploys, do we list them both as latest? What would the UI look like? (/cc @dimitrieh) What does the deployment history look like? Maybe add a column for whether it was canary or full? Each one is still a separate deploy, I believe. When we get to incremental deploys, it'll be more complicated because a single k8s deployment will be scaled. Perhaps we'll still keep an ID for each scale operation.

Replica quantities are already complicated. Auto-deploy hardcodes replicas: 1 all over the place. I'm considering adding a project-level variable to specify the target replicas for production. If we go that route, we could add another variable for canary replicas. One downside to variables is that they don't retrigger a deploy. Heroku treats every env var change as a new release; and thus a new deploy. But we don't, so setting that variable means someone has to manually re-deploy. Maybe that's OK (for now). As long as the variable is used in the deploy job (as opposed to a build job), then at least it's all possible. (I also wonder if some variable changes would require a new build, not just a new deploy. We also need to figure out a generic way to pass project variables to the pods.)

The comment about downscaling the canary to 0 conflicts a bit with the incremental rollout part, where we'd likely grow the canary to 100%. That's more of a blue/green thing though. Is it possible to re-label the deployment once it's scaled up? Like call it canary until it hits 100%, then just relabel it as production track? I'd rather not keep track of the green/blue-ness of the environment.

What happens if someone deploys SHA X to canary, then deploys SHA Y to canary, without first finishing deploying X to production? I assume Y would overwrite X only for canary, rather than creating a second canary (although the latter is a really cool idea). I assume we can't then block X from going all the way to production, but it won't be the "easy" thing to do. e.g. in the UI, there should be a button to promote canary to production, and that would take whatever is currently running in canary. Likewise from chatops. But if someone digs through to find the second-last pipeline on master, they could always manually trigger the deploy. That's already the case. We don't put up any warnings about it. Perhaps we should warn if deploying non-HEAD?

To clarify, both canary and production jobs would deploy to the same named environment production. I think that's the right approach, so these show up in a single environment in the UI. If nothing else that looks cooler, and more easily understood, vs having a canary environment separate from production and just having to know that they're both routed to from the same inbound traffic.

Yes. We would use the same logical environment on GitLab side.

One thing I'm not sure how to represent is the concept of "latest deploy". If an environment is running two deploys, do we list them both as latest? What would the UI look like? (/cc @dimitrieh) What does the deployment history look like? Maybe add a column for whether it was canary or full? Each one is still a separate deploy, I believe. When we get to incremental deploys, it'll be more complicated because a single k8s deployment will be scaled. Perhaps we'll still keep an ID for each scale operation.

This is where more tight approach with Kubernetes would help us, as we could detect that previous deployment is still running on the cluster, but we would still show the last action, basically the last deployment, in that case, canary which is happening.

Replica quantities are already complicated. Auto-deploy hardcodes replicas: 1 all over the place. I'm considering adding a project-level variable to specify the target replicas for production.

Which is also bad that we do not allow it to be changed, as this is also something that will be overwritten later. I was thinking that we should configure it initially, and later allow some of the parameters to be persisted.

The comment about downscaling the canary to 0 conflicts a bit with the incremental rollout part, where we'd likely grow the canary to 100%. That's more of a blue/green thing though. Is it possible to re-label the deployment once it's scaled up? Like call it canary until it hits 100%, then just relabel it as production track? I'd rather not keep track of the green/blue-ness of the environment.

I think that it should be possible. I would have to play around with k8s API to see how it behaves and what possibilities we do have, as it seems that it goes in the direction where we remove current stable and make a canary to become our current stable.

I was thinking about something simple:

We start canary deployment,
Instead of rename, we create another deployment on production that will overwrite the previous deployment.

If we could somehow "merge" these two deployments, put current canary on top of current production this is probably what we need.

What happens if someone deploys SHA X to canary, then deploys SHA Y to canary, without first finishing deploying X to production? I assume Y would overwrite X only for canary, rather than creating a second canary (although the latter is a really cool idea). I assume we can't then block X from going all the way to production, but it won't be the "easy" thing to do. e.g. in the UI, there should be a button to promote canary to production, and that would take whatever is currently running in canary. Likewise from chatops. But if someone digs through to find the second-last pipeline on master, they could always manually trigger the deploy. That's already the case. We don't put up any warnings about it. Perhaps we should warn if deploying non-HEAD?

This is tricky. I guess it depends on. Currently, we would only allow having single canary, but it seems to be possible to have multiple. It depends on how we make the promote, as for promoting can effectively get either existing canary, or get that from CI job where we do know exactly which SHA should be. This creates a potential problem, maybe a way would be to detect that we have canary deployed, and if we try to deploy production and the canary is different than our canary, just fail it as we have the environment in the incosistent state.

@dimitrieh @joshlambert How we should show different "canaries" on deployboard?

@ayufan I agree with your thoughts Kamil, that we use a different color to indicate. My suggestion is that we have a separate background color for the "canary" deploy, leaving "blue" to be the current "GA" deploy. Once the version currently running as canary is promoted to full deployment, it then takes over the blue color and today's deploy board functionality can essentially resume.

For the canary, I would suggest the frame always be a different color, so the deploy status can still show through (for example red if the canary deploy failed) and then when it is successfully deployed, the inside can then fill out with the same color as the frame.

Initial color thought was yellow for the canary, but that generally means caution so probably not a good pick. Maybe an light orange? @dimitrieh is probably best here for sure.

mentioned in merge request !1551 (merged)

This is minimal deployment template to indicate that this deployment is canary:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: production
    track: canary
  name: production-canary
  namespace: simple-app
spec:
  replicas: 5
  selector:
    matchLabels:
      name: production-canary
  template:
    metadata:
      labels:
        name: production-canary
    spec:
      containers:
      - image: path/to/my/image

@dimitrieh Here are the previous colors: https://gitlab.com/gitlab-org/gitlab-ee/issues/1589#note_23450741.

Another thought on Canary is to group the Canary pods off on their own with a unique frame, background, or some other identifier as opposed to labelling the canary pods themselves. @dimitrieh what are your thoughts?

@joshlambert

We need to label pods, as this is how we can fish the data.

@ayufan Yep, for sure on the k8s side. I was just trying to offer an alternative way to visually display the difference. Rather than putting a color inside the box itself to denote, we could simply visually group them with like a background frame or other indicator. This way we could leave the pod graphic itself alone.

added ~111707 label

mentioned in issue #2076 (closed)

@ayufan @joshlambert @markpundsack Canary deploys in the deployboard are visualised like here: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1551#note_26595150

As for showing canary deploys in the environment detail page. I think adding an extra column is the best choice. Also as we want to display how many replica's it will take up:

For the deployboard we may think of showing tooltips with a deployment identifier. Think of the commit id + name and perhaps a tag if its available. Making this closely related to https://gitlab.com/gitlab-org/gitlab-ce/issues/19432

@dimitrieh It will be visible as Job currently.

@ayufan can you elaborate? :)

@markpundsack @ayufan as discussed in the meeting:

column will not be added now (canary will be visible in the job column as name)
commit sha will be used as unique identifier (version)
deployboard will feature a tooltip which states the fact that it is a canary deployment yes or no, aside from the status of the instance