Skip to content

Reduce Gitaly spawn timeout from 10 seconds

Overview

Gitaly has spawn tokens which is a wrapper for Golang forking global lock (syscall.ForkLock). This lock synchronizes all fork() and exec() system calls of a Gitaly process. They claim this lock is essential for inherited file descriptors creation in child processes.

The default timeout is 10 seconds, so a request waits 10 seconds before we timeout and send an Internal status code. At the moment our apdex thresholds is 1s (tolerable). In the incidents below we see spawn tokens waiting for 10 seconds and then returning Internal error.

Recent incidents that this happened:

Action Items

Reduce spawn token timeout

When we have to queue in spawn token it usually means that there is a lot of resource contention, especially if we are hitting the timeout (10 seconds). When we are overloaded we should be more aggressive and shed load as much as possible rather than queue up the request because that will extend the overload period and will end up making the more requests slower.

Below we can see the p95 and p99 that we usually don't hit any queueing but when we queue we end up reaching the 10s mark

Screenshot_2023-05-05_at_10.07.40

source

Update the GITALY_COMMAND_SPAWN_TIMEOUT environment for Gitaly and reduce it to 3 seconds incrementally by first starting with 7 then 5.

Return RESOURCE_EXHAUSTED instead of Internal 👉 gitlab-org/gitaly#5096 (closed)

It's important that we start expressing when the Gitaly server is under load, especially for Allow Gitaly to push back on traffic surges (gitlab-org&7891 - closed) at the moment we are running internal

image

source