Gitaly file descriptor leak incident 2019-07-31
Today we saw a performance degradation on file-23 (gitlab.com production). The issue was resolved by restarting Gitaly on that machine; we don't know yet what went wrong.
We had about 10,000 cat-file processes (counted with ps
) dating back to 2019-07-18 (the time when things went bad on this server). I think we forgot to look if these were arranged as git cat-file --batch
+ git cat-file --batch-check
pairs (as expected) or if this was only one variety (batch or batch-check). We terminated these processes with SIGTERM, which is visible in the process_open_fds
graph.
Two suspicious metrics stood out:
-
process_open_fds
was abnormally large, and had been since 2019-07-18
- the
gitaly_catfile_processes
gauge was very high, and remained high even after we terminated 10,000 cat-file processes
Edited by Jacob Vosmaer