Move gitlab-org/gitlab to Gitaly Cluster
Spun out of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12118 ## Background Currently on GitLab.com, we host `gitlab-org/gitlab` on our Canary Gitaly host. Unfortunately, this host experiences frequent saturation issues, due to high CI load, huge tag and branch cardinality, massive MRs, and other slightly pathological traffic that we send to ourselves for this repo. These slow requests have an impact on the latency of the Gitaly node. Leading to numerous alerts, silences and investigations. For example: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/470 and https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/619 being two examples. Because our Gitaly `cny` stage also has affinity with our Web `cny` stage, we also see knockon effects on the web canary. Unfortunately, this has led to silences there, leading to a potentially dangerous situation in which we miss Canary alerts on that stage. One approach to resolving this issue was to break the affinity between Web cny and Gitaly cny, by removing the so-called "per-namespace opt-out" that we use to send `gitlab-org/*` traffic to Web canary. This is discussed in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12118. One downside of breaking this affinity is that GitLab staff would be less likely to be running Canary, and this may adversely impact the informal QA affect that Canary brings. ## Proposal Since the main saturation point is CPU on read activity to the `gitlab-org/gitlab` repository, one potential solution would be to: **Move `gitlab-org/gitlab`, and possibly forks, to a Praefect cluster, with Distributed Reads enabled.** This way, read activity, particularly `UploadPack` and `FindCommit` traffic would be spread across 3 nodes instead of 1, which may help reduce the latencies for this repository. It would also give us an opportunity to dogfood Gitaly Cluster's distributed reads functionality on the busiest repository on GitLab.com It should be noted that Packfile sharing (https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/688) will have a very major impact on this activity, but this is probably still several months off production. ### Potential Complications 1. Distributed Reads were disabled due to performance issues. https://gitlab.com/gitlab-org/gitaly/-/issues/3334 is tracking the re-enablement of this feature, but does not give timeframes. 1. Availability of the `gitlab-org/gitlab` project, and it's forks, is important. We need to investigate how quickly we could do this migration, and when the best time to do it would be. 1. Would Git delta islands for objects shared between `gitlab-org/gitlab` and it's forks be a complicating factor? ### Further Reading 1. Distributed Reads: https://docs.gitlab.com/ee/administration/gitaly/praefect.html#distributed-reads 1. Reads distribution feature should be disabled by default https://gitlab.com/gitlab-org/gitaly/-/issues/3334 1. Add direct connection configuration to praefect https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12033 cc @awthomas @albertoramos @zj-gitlab @8bitlife @brentnewton
issue