Reduce Gitaly spawn timeout from 10 seconds
Overview
Gitaly has spawn tokens which is a wrapper for Golang forking global lock (syscall.ForkLock). This lock synchronizes all fork() and exec() system calls of a Gitaly process. They claim this lock is essential for inherited file descriptors creation in child processes.
The default timeout is 10 seconds, so a request waits 10 seconds before we timeout and send an Internal status code. At the moment our apdex thresholds is 1s (tolerable). In the incidents below we see spawn tokens waiting for 10 seconds and then returning Internal error.
Recent incidents that this happened:
Action Items
Reduce spawn token timeout
When we have to queue in spawn token it usually means that there is a lot of resource contention, especially if we are hitting the timeout (10 seconds). When we are overloaded we should be more aggressive and shed load as much as possible rather than queue up the request because that will extend the overload period and will end up making the more requests slower.
Below we can see the p95 and p99 that we usually don't hit any queueing but when we queue we end up reaching the 10s mark
Update the GITALY_COMMAND_SPAWN_TIMEOUT environment for Gitaly and reduce it to 3 seconds incrementally by first starting with 7 then 5.
- 
Set GITALY_COMMAND_SPAWN_TIMEOUTin gstg👉 https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3432
- 
Set GITALY_COMMAND_SPAWN_TIMEOUTto5s👉 https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3434
- 
Set GITALY_COMMAND_SPAWN_TIMEOUTto2s:point_righ https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3492
Return RESOURCE_EXHAUSTED instead of Internal 👉  gitlab-org/gitaly#5096 (closed)
It's important that we start expressing when the Gitaly server is under load, especially for Allow Gitaly to push back on traffic surges (gitlab-org&7891 - closed) at the moment we are running internal

