Gitaly timeouts should take the current request duration into account and not exceed 55s
Currently, Gitaly has a 55s timeout on web requests.
In #64 (comment 253451181) its become clear that what frequently happens is that the cumulated time spend in Gitaly requests exceeds the 60s unicorn timeout, leading to the unicorn request being killed, with no trace of the badly performing request.
For example, a request could have taken 40s and issued another Gitaly request, which is given a 30s timeout. However, after a total of 60s the unicorn worker killer kills the process. No metrics or logs are recorded, making this shadow event difficult to track down at a later stage.
For this reason, it would be much better to ensure that the timeout on a Gitaly request is always under the 55second limit.
In the example above, after 40s, the maximum timeout granted to a Gitaly call (and ideally pg queries) should never exceed 15s.
Likewise, if a Gitaly call is attempted after 55s, the Gitaly client should immediately error.
cc @zj-gitlab @jacobvosmaer-gitlab @pokstad1 @johncai
Rollout plan for production:
I'll be watching this dashboard:
- We expect to have no, or less unicorn restarts
- We expect no dip in successful Gitaly calls
- We expect an increase in Gitaly Deadline exceeded errors, but not problematic. (#98 (closed))
I'll be enabling the request_deadline
feature in the following steps:
Start time (UTC) | % of requests | Duration | Notes |
---|---|---|---|
2020-01-21 11:06 | 1 | 15m | |
2020-01-21 11:20 | 10 | 15m | |
2020-01-21 11:36 | 20 | 1h | No issues discovered, so ramping up a bit faster |
2020-01-21 12:15 | 50 | 3h | No issues discovered, ramping to 100, to see if we can see the desired effect |
2020-01-21 13:05 | 100 | until my end of day | |
Dawn of the 2nd day | 100 | keep the flag on |
Older attempts
Start time (UTC) | % of requests | Duration | Notes |
---|---|---|---|
2020-01-16 11:00 | 1 | 15m | Stopped the experiment after 4 minutes after seeing this error https://sentry.gitlab.net/gitlab/gitlabcom/issues/1178268 (#104 (closed)) |
10 | 15m | ||
20 | 1h | ||
50 | 3h | ||
100 | until my end of day | ||
Dawn of the 2nd day | 100 | keep the flag on |