Skip to content

Don't limit number of Gitaly client keepalives

Will Chandler requested to merge wc-gitaly-keepalive-limit into master

What does this MR do and why?

Long-running RPCs, such as ForkRepository, may take several hours to complete. While Sidekiq waits for the RPC to complete it should send keepalive pings to Gitaly/Praefect to prevent load balancers from killing the connection. However, the default value for GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA is only 2, with pings sent at 5 minute intervals.

As a result, Sidekiq will only send keepalives for the first 5 minutes, then leave the connection idle for up to 6 hours putting long-running RPCs at risk of failure.

The GRPC keepalive docs mention the default value for this setting can prevent keepalives from being sent.

This MR sets GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA to 0, so Sidekiq can send an unlimited number of keepalives on RPCs in an idle state. Note that pings are still sent at 5 minute intervals with this change.

Screenshots or screen recordings

Before

There is one ping sent by Sidekiq at 13:53:48, then leaves the connection idle. 30 minutes later HAProxy kills the connection:

image

After

image

Sidekiq sends a ping every 5 minutes.

How to set up and validate locally

Example below:

  1. Setup a 3k reference environment with a Gitaly Cluster
  2. Ensure HAProxy timeout for Praefect is 30 minutes
  3. Import a large repo into the instance, such as Chromium or LLVM
  4. Fork the repo, at 35 minutes the fork will fail when HAProxy kills the idle connection

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Will Chandler

Merge request reports