Replace custom InfoRefsUploadPack's caching with streamcache
From the incident review, we noticed some problems with InfoRefsUploadPack's caching mechanism called diskcache (implemented in !1366 (merged)).
This caching mechanism allows the RPC to generate the result once and reuse caches in subsequent requests. However, it has some notable weaknesses:
- When there is a mutator request, a middleware (code) holds a per-repository lease. While this mutator request is processing, all read requests avoid the caching completely. After it finishes, the cache is invalidated, and an arbitrary subsequent read request will refill the cache.
- The output of info-refs is written into the cache file as well as streamed to clients at the same time via a Tee reader (source). It means the process hangs until the clients complete. As disk-caching is enabled for Geo requests only, clients are typically located far away from Gitaly clusters. In the incident, an RPC takes dozens of minutes to complete.
- The cache file is committed after the transmission completes. It means there is no cache file until the first request completes (which might span some dozens of minutes). Other concurrent requests follow the same flow and spawn redundant processes.
These problems are demonstrated in the following charts:
sequenceDiagram
actor Gitaly
actor request_1
actor request_2
actor request_3
actor request_4
request_1->>Gitaly: mutator, invalidate cache
request_2->>Gitaly: accessor
Gitaly->>request_2: no cache, return result from info-refs process
request_2->>Gitaly: accessor
Gitaly->>request_3: no cache, return result from info-refs process
Gitaly->>request_1: finish mutator
request_4->>Gitaly: accessor, no cache
Gitaly->>request_4: run process, write cache, return result
sequenceDiagram
actor Gitaly
actor request_1
actor request_2
actor request_3
actor request_4
request_1->>Gitaly: cache miss
request_2->>Gitaly: cache miss
request_3->>Gitaly: cache miss
Gitaly->>request_1: run process, write cache, return result
Gitaly->>request_2: run process, overwrite cache, return result
Gitaly->>request_3: run process, overwrite cache, final cache, return result
request_4->>Gitaly: cache hit
Gitaly->>request_4: serve cache, return result
We have a metric called "loser" write. It indicates how many redundant cache writes are issued. Only the latest and slowest process "wins" the cache. At peak, there are around 240k redundant ones.
When there is a wave of read requests from slow clients after the cache has just been invalidated, Gitaly might encounter a spiral of death. The more requests come in, the harder they compete for resources, taking longer to finish cache generation. When the number of processes reaches a certain number, the CPU starts to saturate and refuses to spawn another process. The situation will only be resolved when clients exceed timeouts and cancel such requests.
While it's quite hard to control clients, I think we could make Gitaly more resilient against this traffic pattern. This disk-based caching package was written for a long time. We have a more advanced caching mechanism called streamcache. It is used for pack-objects operations, which are much heavier and have longer elapsed time than InfoRefs.
The package is more advanced. All requests with the same cache key are added to a waiting list. The cache generation process runs only once. Any pending requests will wait until the process is completed and then use the cached artifact. This method prevents redundant processes from being triggered, thus eliminating process accumulation. That package also has a backpressure mechanism.
In summary, I would recommend replacing the custom disk caching mechanism of InfoRefsUploadPack with the streamcache one. It's a more long-term solution than chasing clients. We might need to keep the mutator lease to invalidate the cache. Unlike upload-pack where clients specify which objects to fetch, InfoRefs returns all possible refs. I wonder if WAL could eliminate the need for active invalidation.
