Implement Praefect client-side load balancing and retry logic
Praefects are deployed behind a TCP load balancer. The TCP load balancer routes an incoming TCP connection to a random Praefect. The requests sent from the client through that TCP connection then always land on the same Praefect. This is an issue for example with Workhorse which keeps open a persistent gRPC connection to a Praefect and reuses it for all requests. In the pathological case, all Workhorses connect to the same Praefect and send all their requests to it until the connection breaks for some reason and the Workhorse are forced to reconnect. Such a scenario could happen for example if all Praefects were turned off and Workhorses are attempting to reconnect. As soon as the first Praefect becomes available, all of the Workhorses connect to it and send all of their future requests to it due to the persistent connection. In Praefect, we can easily configure a [`MaxConnectionAge`](https://pkg.go.dev/google.golang.org/grpc@v1.48.0/keepalive#ServerParameters) in the gRPC server to gracefully close a connection after a given time. This forces the client then to reconnect thus balancing the TCP connections from the clients to Praefect more evenly. A given client still sends its requests to the same Praefect but on average the clients will connect to different Praefects and thus balance the load better. Some fetches can run for quite some time so the connections could pile up as the clients keep reconnecting but the old connections can't be closed as there's one or more long running fetches still on them. Better option is to solve this with client side load balancing. https://gitlab.com/groups/gitlab-org/-/epics/8903 also needs client-side logic and thus the work here to solve the existing issues with Praefect can later built upon and used with the new design. In short this, this epic aims to: 1. Allow clients to resolve addresses of Praefects from a DNS record 2. Round-robin requests to them 3. Retry with another Praefect the request failed and the error is retryable. 4. Deploy the changes to GitLab.com and document how self-managed installations can also start using this. The implementation of this should be agnostic as to whether the server is a Praefect or a Gitaly. The load balancing would naturally do nothing but the retry logic should work with a plain Gitaly as well. ### Summary Through various analysis and deep dive into the implementation of grpc-go and grpc-core (crystalized in https://gitlab.com/groups/gitlab-org/-/epics/8971#note_1207008162 and https://gitlab.com/gitlab-org/gitaly/-/issues/4529#note_1208495828), we jumped to conclusion: - Client-side load-balancing and auto-retry are supported via [service config](https://github.com/grpc/grpc-proto/blob/master/grpc/service_config/service_config.proto). Clients need to inject this service config as a Dial Option. - Built-in round-robin load-balancer works really well, for both Go and Ruby clients. We don't need to re-invent the wheel. - The built-in DNS resolver doesn't refresh the state of DNS service discovery. That leads to unexpected stickiness of a client to the established connection. We could not set `MaxConnectionAge` or `MaxConnectionAgeGrace` because those configs may leak connections (detailed explanation in https://gitlab.com/gitlab-org/gitaly/-/merge_requests/5218#note_1236753939). As a result, we need to implement a custom DNS resolver to overcome such limitation. grpc-ruby doesn't support custom resolver, unfortunately. - By default, gRPC retries on transparent failures. It also supports a custom auto-retry mechanism based on returned gRPC codes. We decided to generate the retry policy for `accessor` RPCs only. Unfortunately, grpc-ruby doesn't support reflection. Thus, we cannot support this feature in Ruby. ## Status 2023-02-16 - :white_check_mark: Gitaly implementations: the two main implementations (https://gitlab.com/gitlab-org/gitaly/-/issues/4530 and https://gitlab.com/gitlab-org/gitaly/-/issues/4529) are done. They expose two Dial Options in the client library. - :white_check_mark: Gitlab-Rails: client-side load-balancing was added to Gitlab-Rails via https://gitlab.com/gitlab-org/gitlab/-/merge_requests/107815 and https://gitlab.com/gitlab-org/gitlab/-/merge_requests/107985. . Although the level of resilience and workload distribution in Ruby client is subpar comparing to Go clients, nothing we can do about it. - :white_check_mark: Add aforementioned dial options were deployed, we'll need to update go clients. There are three Go clients to update. All corresponding merge requests are pending. - :hourglass_flowing_sand: Documentations ```mermaid graph TD issue4723["✅ #4723"] click issue4723 "https://gitlab.com/gitlab-org/gitaly/-/issues/4723" "Add Gitaly client-side load-balancing Dial Option to KAS" issue4530 --> issue4723 issue4529 --> issue4723 issue4722["✅ #4722"] click issue4722 "https://gitlab.com/gitlab-org/gitaly/-/issues/4722" "Add Gitaly client-side load-balancing Dial Option to gitlab-shell" issue4530 --> issue4722 issue4529 --> issue4722 issue4721["✅ #4721"] click issue4721 "https://gitlab.com/gitlab-org/gitaly/-/issues/4721" "Add Gitaly client-side load-balancing Dial Option to workhorse" issue4530 --> issue4721 issue4529 --> issue4721 issue4715["#4715"] click issue4715 "https://gitlab.com/gitlab-org/gitaly/-/issues/4715" "Add documentation in docs/ about DNS resolver and client auto retry" issue4700["✅ #4700"] click issue4700 "https://gitlab.com/gitlab-org/gitaly/-/issues/4700" "Add load balancing strategy to GitLab Rails" issue4690["⏳ #4690"] click issue4690 "https://gitlab.com/gitlab-org/gitaly/-/issues/4690" "Add documentation about Praefect DNS service discovery" issue4689["✅ #4689"] click issue4689 "https://gitlab.com/gitlab-org/gitaly/-/issues/4689" "Refactor Gitaly client stub in Gitlab Rails" issue4531["✅ #4531"] click issue4531 "https://gitlab.com/gitlab-org/gitaly/-/issues/4531" "Change our streaming RPCs to acknowledge the header to the client" issue4530["✅ #4530"] click issue4530 "https://gitlab.com/gitlab-org/gitaly/-/issues/4530" "Implement client-side retries in Gitaly's Go client" issue4529["✅ #4529"] click issue4529 "https://gitlab.com/gitlab-org/gitaly/-/issues/4529" "Implement a custom DNS resolver working well with bult-in round robin load balancing in Go client" issue4529 --> issue3542 issue4249["✅ #4249"] click issue4249 "https://gitlab.com/gitlab-org/gitaly/-/issues/4249" "Streaming RPCs return hard to debug EOF error to the client if the request is rejected by the server" issue3542["❌ #3542"] click issue3542 "https://gitlab.com/gitlab-org/gitaly/-/issues/3542" "Users receive intermittent failures when only one Praefect node cannot reach Gitaly" ```
epic