GitLab Web Canary should take random traffic and not have affinity with `gitlab-org` and `gitlab-com` namespaces

This issue is an attempt at combining several conversations which are happening on other issues at present.


At present, we have a mixed strategy for sending traffic to the web service cny stage. It involves a mixture of random traffic (opt-in) and namespace routing (opt-out).

The problem is at present, the namespace affinity for gitlab-org and gitlab-com means that traffic for the Gitaly canary stage is more likely to be sent to Web canary, since Gitaly canary hosts many gitlab-org and gitlab-com projects.

Unfortunately, Gitaly cny is currently overloaded, with very high CPU utilization: this has been documented and is being addressed in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/470 and scalability#619 (amongst others)

The problem is that we have a knock on effect, where poor performance on file-cny-01-stor-gprd.c.gitlab-production.internal is "leaking" downstream to web/cny due to the namespace affinity.

@jarv wrote:

Also commented on newsletter issue but this is probably a better place for the discussion:

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12041#note_463393797

a) remove the namespace routing and just take a flat percentage, between 1 and 5 percent b) keep the namespace routing, and also take a percentage. Though this means we will probably be looking at sending 10 to 20% of traffic to canary. there might be something between (a) and (b) where we reduce the number of paths we route to canary.

I'm not sure how much traffic we need to take into canary to even things out, but including all traffic from the gitlab-{org,com}/ namespace means that we are taking quite a lot. Perhaps we combine reducing the traffic to just /gitlab-org/gitlab and a small percentage from main?

@jarv also wrote:

How much traffic would we need to be sending to re-create a comparable one hour baking period?

The RPS rate for canary is ~400 req/sec where the main stage receives 6k req/sec.

I guess our two options are to either

a) remove the namespace routing and just take a flat percentage, between 1 and 5 percent b) keep the namespace routing, and also take a percentage. Though this means we will probably be looking at sending 10 to 20% of traffic to canary.

there might be something between (a) and (b) where we reduce the number of paths we route to canary. @andrewn were you thinking (a)?

If we hit a problem on Canary would this approach mean we always need to drain?

Yes, and the difficulty will be that we need to keep enough capacity in the VM main stage to take canary traffic. This will work much better when the front-end is running in K8s.

Proposal

Remove namespace routing and use a flat percentage - I prefer 10%. Keep in mind that we also have users opting-into canary (via next.GitLab.com etc) so the true percentage is likely to be higher than 10%

After the problems with Gitaly cny have been resolved, we could consider reversing this decision.

My concern at present is that cny alerts are being ignored due to the noise, and this problem is also causing problems with the release scripts.

cc @amyphillips @marin