Gitaly Timeouts

Gitaly has a strategy of "failing fast": if a NFS (aka file) server becomes temporarily overloaded, a call that normally takes 1s could end up taking 40s.

In this case, it's better to fail-fast on any requests to the "bad server" rather than enqueue all traffic to the entire site behind the slow requests to a single server. This means that although any requests routed to the troubled server will still fail, requests will not back up as much and the site as a whole will be less affected.

Anecdotally, it appears that since all feature flags on Gitaly have been enabled, GitLab.com has been much more susceptible to slow requests than before.

While investigating why this is, I became aware that reasonable timeout values are not being applied to new endpoints, meaning that they fallback to the default of 50s.

Using data gathered from https://log.gitlab.net/goto/80fbd8da819ce545247d7c8210886e44, I've taken the 99.99% duration of each endpoint and used this to propose new default values.

I've used this to propose a set of new defaults: https://docs.google.com/spreadsheets/d/1UwWU-yB7AcoCM6VSjmCglijXSswIl3gJM3xUE3XFIB8/edit

During a call, we also discussed dropped the timeouts altogether for Sidekiq - so that any fast or medium timeout values in Sidekiq will not timeout.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information