Move gitlab-org/gitlab to Gitaly Cluster

Spun out of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12118

Background

Currently on GitLab.com, we host gitlab-org/gitlab on our Canary Gitaly host.

Unfortunately, this host experiences frequent saturation issues, due to high CI load, huge tag and branch cardinality, massive MRs, and other slightly pathological traffic that we send to ourselves for this repo.

These slow requests have an impact on the latency of the Gitaly node. Leading to numerous alerts, silences and investigations. For example: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/470 and scalability#619 being two examples.

Because our Gitaly cny stage also has affinity with our Web cny stage, we also see knockon effects on the web canary. Unfortunately, this has led to silences there, leading to a potentially dangerous situation in which we miss Canary alerts on that stage.

One approach to resolving this issue was to break the affinity between Web cny and Gitaly cny, by removing the so-called "per-namespace opt-out" that we use to send gitlab-org/* traffic to Web canary. This is discussed in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12118.

One downside of breaking this affinity is that GitLab staff would be less likely to be running Canary, and this may adversely impact the informal QA affect that Canary brings.

Proposal

Since the main saturation point is CPU on read activity to the gitlab-org/gitlab repository, one potential solution would be to:

Move gitlab-org/gitlab, and possibly forks, to a Praefect cluster, with Distributed Reads enabled.

This way, read activity, particularly UploadPack and FindCommit traffic would be spread across 3 nodes instead of 1, which may help reduce the latencies for this repository.

It would also give us an opportunity to dogfood Gitaly Cluster's distributed reads functionality on the busiest repository on GitLab.com

It should be noted that Packfile sharing (scalability#688) will have a very major impact on this activity, but this is probably still several months off production.

Potential Complications

  1. Distributed Reads were disabled due to performance issues. gitlab-org/gitaly#3334 (closed) is tracking the re-enablement of this feature, but does not give timeframes.
  2. Availability of the gitlab-org/gitlab project, and it's forks, is important. We need to investigate how quickly we could do this migration, and when the best time to do it would be.
  3. Would Git delta islands for objects shared between gitlab-org/gitlab and it's forks be a complicating factor?

Further Reading

  1. Distributed Reads: https://docs.gitlab.com/ee/administration/gitaly/praefect.html#distributed-reads
  2. Reads distribution feature should be disabled by default gitlab-org/gitaly#3334 (closed)
  3. Add direct connection configuration to praefect https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12033

cc @awthomas @albertoramos @zj-gitlab @8bitlife @brentnewton

Edited by Andrew Newdigate