Incident Review for intermittent CPU saturation on gitaly node file-45 on 2020-03-16
Incident being reviewed: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1774
Related incident with the same root cause: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1731
These incidents are confidential, to protect the privacy of the projects involved.
Summary
- Service(s) affected : Gitaly
- Minutes downtime or degradation : Several 10-25 minute periods of slowness, summing to 195 minutes spread across 11 days
One of the Gitaly shards (file-45-stor-gprd) reached CPU saturation, leading to an increased slowness and rate of errors (timeouts). Other Gitaly shards were not affected.
The CPU saturation was triggered by many clients concurrently running git fetch or git clone on a repo that has a large graph of commit objects to traverse. To generate the inventory of objects that the client needs, Gitaly runs git pack-objects, which for this repo requires significant CPU time and memory. Enough of these helper processes were running that the host starved for CPU time, resulting in the observed increased latency and error rate (timeouts) from this particular Gitaly shard.
For reference, this incident ties back to other similar regressions on this particular Gitaly shard starting on 2020-03-05. Each regression event caused roughly 10-25 minutes of degraded performance on 1 of the 52 Gitaly shards (file-45). Now that we understand the pathology, we can link it back to the following regression events:
- 2020-03-16: 3 events = 65 minutes
- 2020-03-13: 1 event = 20 minutes
- 2020-03-12: 3 events = 60 minutes
- 2020-03-09: 1 event, 10 minutes
- 2020-03-05: 2 events, 40 minutes
Mitigations
The owner of the repo whose distributed builds triggered this regression has kindly offered to adjust their build to reduce its impact. Reusing fresher git-clones on the build servers may help, and reducing concurrency would definitely help.
On the GitLab server side, Gitaly needs to avoid saturation by limiting resource usage by these helper processes. Both CPU time and memory are relevant to this problem.
One options is to limit the concurrency of the gRPC PostUploadPack calls, which spawn these git pack-objects helper processes. (For context, the git client's fetch, pull, clone, etc. subcommand sends an HTTP POST request to the [repo-url]/git-upload-pack endpoint, which causes Rails to send Gitaly a PostUploadPack gRPC call.) Using an instance-wide limit in combination with a smaller scoper limit per-namespace or per-project would allow the other projects to continue functioning while the busy peer is rate-limited.
Another possibility is using cgroups to impose a hard limit on resource usage by particular types of helper processes. However, under certain workload patterns, that could result in starving all tenants of a Gitaly shard. This may be better suited as a second line of defense, with the application-based concurrency limits acting as the primary defense with more graceful degradation behavior.
Impact & Metrics
- What was the impact of the incident? Slowness and intermittent errors for operations involving git repos stored on Gitaly shard
file-45. - Who was impacted by this incident? All customers having git repos stored on Gitaly shard
file-45(roughly 2% of the repos on GitLab.com). - How did the incident impact customers? Git operations would have been slow enough to sometimes timeout.
- How many attempts were made to access the impacted service/feature? Routine traffic over a 20 minute timespan
- How many customers were affected? Approximately 2% of repos were slow to access (sometimes up to timing out). Unknown how many customers actively experienced this slowness.
- How many customers tried to access the impacted service/feature? See above.
Dashboard "gitaly: Overview"
Latency apdex for the Gitaly service as a whole (i.e. not just the degraded shard) shows a drop of at most 2% during the regressions on 2020-03-16. Corresponding spikes in Gitaly error rate (due to timeouts) jumped as high as 0.6%. A similar pattern was found on the earlier dates that we later backtracked this pathology to.
Latency
Error rate
Dashboard "Host Stats"
The host-level resource usage metrics for the affected Gitaly node (file-45) clearly show the CPU usage saturation and memory usage increase.
Network egress throughput spiked as a result of sending the clients the git objects they needed, but the network throughput was nowhere near the saturation point. Disk I/O (not shown) was not significantly affected. Disk read throughput moderately increased but most disk I/O was satisfied by the filesystem cache, with physical disk I/O only increasing as a result of the cache being eroded by the anonymous memory allocations by the numerous git pack-objects processes.
CPU usage
Memory usage
Network usage
Command line list of processes
top shows the CPU usage was predominantly from the git pack-objects processes spawned by Gitaly, rather than by threads within Gitaly itself. Note that each of these processes holds a significant amount of resident memory beyond its shared memory.
Also, each of these git pack-objects processes has accumulated many seconds of CPU time (some have several minutes of CPU time). Follow-up analysis showed that these processes tended to have 40-60% of wall clock time spent on-CPU, at least during the regression; we suspect these processes are entirely CPU-bound and that their off-CPU time may have just been waiting for their next time slice, due to CPU starvation.
These git pack-objects processes were mostly operating on the same git repo directory (not shown), which is one of the ways we traced the regression back to its triggering conditions.
Top processes using CPU time
top - 20:03:16 up 67 days, 23:14, 2 users, load average: 82.07, 71.93, 43.87
Tasks: 972 total, 75 running, 700 sleeping, 0 stopped, 30 zombie
%Cpu(s): 94.9 us, 4.6 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.4 si, 0.0 st
KiB Mem : 12376827+total, 18716604 free, 71279952 used, 33771716 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 49609072 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
30400 git 20 0 12.944g 2.407g 18684 S 87.3 2.0 3104:23 /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml
24383 git 20 0 3666708 0.985g 724520 R 55.9 0.8 0:11.10 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
18792 git 20 0 4010080 1.421g 910388 R 52.3 1.2 0:44.75 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
15964 git 20 0 4048068 1.616g 995520 R 52.0 1.4 1:06.52 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
17808 git 20 0 4014652 1.480g 945304 R 51.3 1.3 0:50.92 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
19394 git 20 0 3912712 1.389g 891808 R 50.3 1.2 0:41.27 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
10993 git 20 0 4409636 3.195g 2.279g R 50.0 2.7 1:36.05 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
16897 git 20 0 4531220 2.765g 1.732g R 47.4 2.3 3:41.96 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
13408 git 20 0 4305772 1.813g 0.988g R 47.1 1.5 1:24.43 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
14087 git 20 0 4264240 1.769g 0.983g R 47.1 1.5 1:19.99 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
14833 git 20 0 4264240 1.758g 0.979g R 46.7 1.5 1:18.16 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
14516 git 20 0 4531220 2.773g 1.739g R 46.1 2.3 3:51.42 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
19094 git 20 0 4010080 1.418g 909592 R 46.1 1.2 0:43.87 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
4721 git 20 0 4431444 3.173g 2.237g R 45.8 2.7 1:43.69 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
14081 git 20 0 4531220 2.923g 1.888g R 45.4 2.5 3:57.74 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
16557 git 20 0 4035388 1.556g 971636 R 45.1 1.3 0:59.52 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
21825 git 20 0 3852644 1.219g 824620 R 45.1 1.0 0:26.51 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
12431 git 20 0 4406380 3.212g 2.300g R 44.4 2.7 1:32.37 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
23211 git 20 0 3769068 1.121g 776424 R 44.4 0.9 0:18.12 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
16331 git 20 0 4036872 1.577g 980652 R 44.1 1.3 1:01.84 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
24308 git 20 0 4495184 2.531g 1.533g R 44.1 2.1 3:13.28 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
15668 git 20 0 4049636 1.626g 976.6m R 43.8 1.4 1:07.32 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
17750 git 20 0 4021540 1.504g 953560 R 43.8 1.3 0:52.94 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
5924 root 20 0 775092 211420 12276 S 43.1 0.2 1190:39 /opt/td-agent/embedded/bin/ruby -Eascii-8bit:ascii-8bit /opt/td-agent/embedded/bin/fluentd --log /var/log/td-agent/td-agent.log --daem+
12601 git 20 0 4531220 2.895g 1.860g R 42.8 2.5 3:58.59 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
18466 git 20 0 4011404 1.438g 919280 R 42.5 1.2 0:46.23 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
19942 git 20 0 3906556 1.356g 876976 R 42.5 1.1 0:37.52 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
17034 git 20 0 4035388 1.551g 969276 R 42.2 1.3 0:58.61 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
13404 git 20 0 4305772 1.803g 0.985g R 41.8 1.5 1:22.21 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
15915 git 20 0 4047368 1.610g 991744 R 41.5 1.4 1:05.33 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
19079 git 20 0 3912712 1.386g 890100 R 41.2 1.2 0:42.30 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
2105 git 20 0 3624628 3.441g 3600 R 39.2 2.9 2:59.31 /opt/gitlab/embedded/libexec/git-core/git unpack-objects --pack_header=2,33 --strict --max-input-size=10485760000
18049 git 20 0 4018152 1.483g 943724 R 39.2 1.3 0:50.88 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
22826 git 20 0 4495184 2.525g 1.528g R 38.9 2.1 3:18.47 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
16607 git 20 0 4031684 1.536g 963384 R 38.6 1.3 0:57.23 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
21679 git 20 0 3854772 1.221g 823376 R 38.6 1.0 0:28.12 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
23295 git 20 0 4495184 2.544g 1.547g R 38.6 2.2 3:17.52 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
16630 git 20 0 4035388 1.569g 978100 R 38.2 1.3 1:00.71 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
19127 git 20 0 3913500 1.391g 890976 R 37.9 1.2 0:42.71 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
12430 git 20 0 4414044 3.219g 2.300g R 37.6 2.7 1:32.96 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
17812 git 20 0 4013020 1.475g 944612 R 37.6 1.2 0:50.06 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
17967 git 20 0 4016844 1.482g 943656 R 37.6 1.3 0:50.65 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
13384 git 20 0 4305772 1.812g 0.987g R 37.3 1.5 1:23.82 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
14866 git 20 0 4264240 1.716g 985.4m R 37.3 1.5 1:11.54 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
15103 git 20 0 4264240 1.722g 989.3m R 37.3 1.5 1:12.10 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
7408 git 20 0 2235824 870708 578048 R 36.9 0.7 1:44.94 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --progress --delta-base-offset --include-tag
7871 git 20 0 4409636 3.151g 2.235g R 36.9 2.7 1:37.31 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
22605 git 20 0 4495184 2.534g 1.536g R 36.9 2.1 3:21.44 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset
...
CPU profile of Gitaly host file-45 during the regression
A CPU profile of all on-CPU processes shows that git pack-objects dominated CPU time and that they collectively spent most of their time calculating the deltas between what git objects the client request says it already has versus what the server-side bare repo has available to offer. This suggests that the pathology is more likely to affect git repos with a large complex graph of object trees to compare.
perf profile of on-CPU processes' stacks during the regression
msmiley@file-45-stor-gprd.c.gitlab-production.internal:~$ sudo perf record --freq 99 -a -g -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 5.118 MB perf.data (31680 samples) ]
msmiley@file-45-stor-gprd.c.gitlab-production.internal:~$ sudo perf script --header > perf-script.during_cpu_saturation_by_many_git_pack_objects_pids.out
msmiley@saoirse:~/src/git/public/FlameGraph [master|✔] $ cat /tmp/perf-script.during_cpu_saturation_by_many_git_pack_objects_pids.out | ./stackcollapse-perf.pl | ./flamegraph.pl > /tmp/file-45.during_git_pack_objects_cpu_saturation.svg
Complete SVG of flame graph: file-45.during_git_pack_objects_cpu_saturation.svg
Detection & Response
Start with the following:
- How was the incident detected? PagerDuty alerts such as here and here: "Gitaly error rate is too high"
- Did alarming work as expected? Yes.
- How long did it take from the start of the incident to its detection? 7 minutes.
- How long did it take from detection to remediation? The regressions self-resolved roughly 10 minutes after the alert triggered. Analysis of the regression continued after incidents resolved, and once we discovered enough about the pathology and triggering conditions, we reached out to the owner of the triggering repo. We temporarily blocked the behavior for that repo to protect other customers from the side-effects, and worked with the customer to understand their use-case and mitigate the impact.
- Were there any issues with the response to the incident? No. We had all the necessary access to people and technology.
Root Cause Analysis
A Gitaly node saturated its CPUs and significantly changed its memory usage profile.
- Why? - More CPU-bound memory-hungry processes were spawned than there are CPUs available on the host.
- Why? - Gitaly received numerous
PostUploadPackgRPC calls from clients runninggit fetch(or similar), and Gitaly processed too many of them concurrently. - Why? - Most repos are not prone to this pathology, so it has not yet been a focus of attention for Gitaly tuning.
- Why? - To trigger the pathology, the repo being accessed must have a very large object graph to traverse when composing a response to the client's
git fetch(more specifically thegitclient's HTTP POST to[repo_path]/git-upload-pack). Even for repos that meet this condition, to cause a regression on the Gitaly node, over 30 client requests to this repo must be concurrently running. - Why? - The Gitaly server does not have enough CPU and memory capacity to handle workloads beyond the above specifications. Once reaching the saturation point, all other Gitaly operations begin to slow down due to CPU starvation, with the degree of effect being proportional to the number of CPU-bound processes running.
What went well
- Cross-team collaboration was fantastic!
- IMOC and CMOC did a wonderful job (as usual!) of facilitating the investigation, getting help where needed, doing regular pulse checks, and guiding the balance of what info was ok to release publicly (as we were unsure at first whether this was a precursor to a malicious attack or an accidental outcome of a legitimate use-case).
- Alerting worked appropriately.
- Gitaly's and Rails' event logging to Elasticsearch gave us excellent observability into the app behavior and client request profile.
- The CPU profiling tools (
perfand BCC) gave us a clearer picture of what the CPU time was being spent on and which of the numerous git repos were associated with the bulk of the activity.
What can be improved
- To prevent this from happening again, we need to prevent Gitaly from spawning enough
gitprocesses to saturate machine resources such as CPU and memory. Gitaly already supports some forms of concurrency limiting, and tuning those may be sufficient. Additional layers of protection (e.g. cgroups) could also be implemented, although we should consider the effects of saturation in each case and choose the most desirable failure mode for the health of the service as a whole when saturation is reached. (For example, we prefer: slowness over downtime; downtime over possible corruption; automatic over manual recovery; etc.) - Each of the incidents in this 11-day series were low enough severity (
S3) that we did not take the time to do root cause analysis until we noticed they were part of a recurring pattern. Our on-call engineers field enough alerts, questions, and interruptions that we rarely have time to analyze self-recovering regressions like this. This may be worth discussing as an iterative process improvement for incident triage. Or maybe it's an acceptable cost of our existing prioritization scheme.
Corrective actions
- Prevent Gitaly from running too many
git pack-objectsprocesses.- Issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9554
- Estimated date of completion: TBD
- Owner: TBD
- For short-term mitigation, add support for removing user access to CI/CD without having to fully block the user.
- Issue: gitlab-org/gitlab#35346
- Estimated date of completion: TBD
- Owner: TBD





