Skip to content

tableflip-free: Revisit client auto-retry window

In gitlab-com/gl-infra/production#18166 (closed), we rolled out tableflip-free to production. Unfortunately, we noticed a huge amount of deadline-exceeded errors in Rails when Gitaly nodes restarted. We had to roll back the change shortly afterward.

There are two factors for that:

  • It takes > 20 seconds to restart some nodes (gitlab-com/gl-infra/production#18166 (comment 1979668722)). This number is exceptionally large. We suspect a highly graceful timeout is set somewhere that stretches the restarting time.
  • The retry logic holds the request only for 2.4 seconds (1 + 3 attempts, defined here).

While the first one is the direct reason of long restart, the client side should handle this situation more resiliently. The client-side auto-retry window seems to be too harsh for a network interruption. We should also run some tests on the local environment to verify the behavior of this auto-retry scheme. It does not eliminate implementing a thin application-layer exponential backoff.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information