Skip to content

server: Give clients grace-period for keepalives

Will Chandler (ex-GitLab) requested to merge wc-reduce-min-keepalive into master

Currently gRPC clients are configured to send keepalives every 20 seconds, and the servers enforce that as a minimum interval. Requiring precisely the same interval has caused a number of problems over time. gGRPC-Core has an issue where keepalives may be sent a few milliseconds ahead of schedule, breaking keepalives from Puma and Sidekiq.

With the introduction of SSHUploadPackWithSidechannel in v15.0, the gRPC connection for Gitlab-Shell and Gitlab-Workhorse is now idle during lengthy clones, triggering keepalives in this scenario. Previously the connection was always busy and keepalives were rare.

After upgrading to v15.0 a large customer reported that they have started to receive intermittent ENHANCE_YOUR_CALM errors on SSH clones, which indicates that keepalives are being sent to rapidly from the client. Examining packet captures from their hosts, we found that Gitlab-Shell was sending keepalive packets roughly 20,001ms apart, which is correct. However, the Gitaly server was intermittently receiving these at an interval of 19,998ms, triggering the error. This appears to be due to the variable connection latency, where typically there's 50ms of latency between the hosts, but if there's slightly less latency then the norm the keepalive arrives early from the server's perspective.

To resolve this, let's give the clients a grace-period of 10 seconds by reducing the minimum keepalive interval to 10 seconds. This way a keepalive that arrives slightly ahead of schedule is not a critical error.

Closes #4397 (closed)

Edited by Will Chandler (ex-GitLab)

Merge request reports