2020-07-27: Memory and CPU saturation on Gitaly node file-42
Summary
Memory and CPU starvation on Gitaly node file-42
appears to have been caused by a batch of git
processes each holding 200 MB to 2 GB of memory. Some additional processes were also killed, but the git
processes were by far the biggest memory consumer and were the cause of the abnormal spike in memory usage.
As follow-up to the incident, we profiled git processes that both allocated a large amount of memory and held that memory for a non-trivial amount of time. The results highlight that 2 git subcommands spawned from Gitaly are prone to large and long-lived memory allocations:
-
git pack-objects
(the child process underlyinggit upload-pack
) git archive
During the incident itself, git upload-pack
processes (and implicitly their git pack-objects
children) dominated the host's memory usage. The after-incident memory usage profile summarized above represents a lower concurrency form of the profile observed during the incident. Improving this memory usage pattern should help reduce memory pressure in both the general case and the extreme case that lead to this incident.
As a corrective action, we should explore ways to either (or both):
- Reduce memory usage by
git pack-objects
andgit archive
processes. - Decouple the lifespan of those processes from the end-user's client, so that low-bandwidth clients do not implicitly exacerbate server-side memory pressure.
Timeline
All times UTC.
2020-07-27
- 08:24 - Gitaly node
file-42
abruptly saturates its CPU and memory. - 08:29 - Gitaly node
file-42
returns to normal CPU and memory usage.
In addition to the CPU and memory usage mentioned above, the following symptoms occurred during the incident:
- The kernel out-of-memory killer was triggered 224 times over a 70-second timespan.
- The host's the softirq rate dropped to practically nil. This typically corresponds to a drop in
- The host's physical disk throughput on
sdb
(where the git repos are stored) spiked to 10x higher than the normal read ops/second (but no such spike occurred in read bytes/second). Extra disk I/O is a predictable side-effect of the filesystem cache having been eroded (in this case by the memory-intensive git processes).
Metrics
WARNING: Some of the log events presented in Kibana appear to have timestamps up to a few minutes offset from the raw kernel logs. This makes correlating events by timestamp more challenging.
Raw kernel logs: file-42-stor-gprd.20200727.kern.log.gz
Dashboard links:
-
Grafana Host Stats dashboard - Shows host-level resource usage for
file-42
, zoomed-in to the relevant timespan. -
Memory usage summary by git subcommand for this Gitaly node - Shows that during the memory starvation,
git upload-pack
processes (and their children,git pack-objects
) collectively held over 80 GB of memory. This agrees reasonably well with the OOMK snapshot of per-process memory usage, and unlike the OOMK process listing, this view tallies by git subcommand rather than just showing each process as an unspecified git command. - Kibana kernel log events for OOMK on Gitaly nodes - Shows which process was killed each time the OOMK was invoked, including how much virtual and physical memory that process had allocated at its time of death.
- Kibana log events for Gitaly errors near the time of the OOMK events - These may or may not be directly related to the memory and CPU saturation.
-
Kibana log events for Workhorse - Shows incoming HTTP POST requests to
[repo_path]/git-upload-pack
whose duration is over 60 seconds. This typically correlates with large response payloads (fieldjson.written_bytes
).
CPU saturation
Memory saturation
Note: The memory cgroup for Gitaly-related processes is constrained to slightly less than the host's total physical memory capacity.
Memory usage was mostly anonymous (non-file-backed) allocations
This suggests the processes were mallocing a lot of private memory.
Since the OOMK events show that there were both ruby
and git
processes holding a lot of resident anonymous memory, this may imply that some gRPC commands may have spawned git
processes that ended up requiring a lot of memory. The kernel's log events do not include the specific details of the git
and ruby
command-line args.
We later determined that the killed git
processes that had a large memory footprint were git pack-objects
commands that were spawned by git upload-pack
commands, which in turn were created by gitaly
for handling PostUploadPack
or 'SSHUploadPackgRPC calls. And the killed
ruby` processes were Gitaly's pool of ruby helpers -- they may have been unrelated to the increased memory pressure.
Disk throughput (ops/sec and bytes/sec)
During the memory saturation, the rate of physical read ops/second shows the same shape as the memory saturation. However, the read bytes/second rate did not, suggesting that the burst of additional read ops were smaller than the typical op size (perhaps many small single-block reads rather than larger sequential reads of larger files). This is presumably a side effect of the filesystem cache shrinking dramatically and thus having a much lower cache hit rate.
Also, a few minutes before the incident began, the write rate (both ops/sec and bytes/sec) spiked significantly. This may be completely unrelated to the incident, or it might have been a lead-up event (e.g. a large push to a repo that subsequently many clients concurrently fetched).
Incoming connections dropped
Gitaly normally receives roughly 1000 new connections per minute. During this event, roughly 170 incoming connection attempts were dropped, probably because Gitaly was unable to accept
new connections as quickly as they arrived while the host's CPU and memory resources were saturated.
Timestamps of the OOMK events as recorded in the raw kernel logs
This shows accurate timestamps for when memory pressure became severe enough that the kernel's out-of-memory killer had to activate and start choosing processes to kill. We can compare these timestamps with logs and metrics from other sources.
msmiley@file-42-stor-gprd.c.gitlab-production.internal:~$ sudo cat /tmp/file-42-stor-gprd.20200727.kern.log.gz | grep 'Killed process' | wc -l
224
msmiley@file-42-stor-gprd.c.gitlab-production.internal:~$ sudo cat /var/log/kern.log | grep 'Killed process' | awk '{ print $1, $2, $3, $10 }' | uniq -c
6 Jul 27 08:24:52 (ruby)
10 Jul 27 08:24:53 (ruby)
4 Jul 27 08:24:54 (ruby)
5 Jul 27 08:24:54 (pre-receive)
7 Jul 27 08:24:54 (bundle)
2 Jul 27 08:24:54 (git)
1 Jul 27 08:24:55 (git)
1 Jul 27 08:24:55 (bundle)
8 Jul 27 08:24:55 (git)
10 Jul 27 08:24:56 (git)
8 Jul 27 08:24:57 (git)
9 Jul 27 08:24:58 (git)
6 Jul 27 08:24:59 (git)
1 Jul 27 08:24:59 (bundle)
1 Jul 27 08:25:00 (git)
1 Jul 27 08:25:00 (bundle)
7 Jul 27 08:25:00 (git)
7 Jul 27 08:25:01 (git)
1 Jul 27 08:25:01 (bundle)
1 Jul 27 08:25:02 (git)
1 Jul 27 08:25:02 (bundle)
3 Jul 27 08:25:02 (git)
1 Jul 27 08:25:02 (bundle)
9 Jul 27 08:25:03 (git)
19 Jul 27 08:25:04 (git)
26 Jul 27 08:25:05 (git)
26 Jul 27 08:25:06 (git)
5 Jul 27 08:25:07 (git)
4 Jul 27 08:25:07 (ps)
8 Jul 27 08:25:07 (git)
1 Jul 27 08:25:07 (gitaly)
1 Jul 27 08:25:08 (git)
1 Jul 27 08:25:11 (git)
1 Jul 27 08:25:12 (git)
2 Jul 27 08:25:13 (git)
1 Jul 27 08:25:14 (git)
3 Jul 27 08:25:21 (git)
1 Jul 27 08:25:25 (git)
2 Jul 27 08:25:26 (git)
2 Jul 27 08:25:28 (git)
1 Jul 27 08:25:29 (git)
1 Jul 27 08:25:35 (git)
2 Jul 27 08:25:37 (git)
2 Jul 27 08:25:39 (git)
2 Jul 27 08:25:41 (git)
2 Jul 27 08:25:43 (git)
1 Jul 27 08:26:08 (git)
Contrast with Kibana's timestamps for the same 224 events:
Tally memory usage by git-subcommand
The processes running git upload-pack
were by far the dominant memory consumer in Gitaly's memory cgroup at the time of the OOMK incident. At peak, they held 85 GB out of the cgroup's 90 GB limit.
For reference, git upload-pack
spawns a git pack-objects
child process, which is what accumulates and holds a large memory footprint. The Prometheus process_exporter metric is presumably attributing the child process's memory usage to the parent process.
Incident Review
Summary
See above.
- Service(s) affected : Gitaly
- Team attribution : Gitaly
- Minutes downtime or degradation : 5 minutes
Customer Impact
- Who was impacted by this incident?
- Customers (both external and internal) whose git repos are stored on the affected Gitaly node (
file-42
).
- Customers (both external and internal) whose git repos are stored on the affected Gitaly node (
- What was the customer experience during the incident?
- Any kind of interactions with git repos on this node had a high likelihood of failure. Retrying after the incident should have succeeded.
- How many customers were affected?
- Unknown.
- If a precise customer impact number is unknown, what is the estimated potential impact?
- At most 2% of customers have repos on this Gitaly node. But in practice a small percentage of those customers would have been active during the 5-minute incident timespan.
- Most calls to this Gitaly node would have failed due to timeouts, and a small percentage would have had other errors.
- This specific Gitaly node's apdex dropped to 20% during the incident and recovered immediately afterwards.
Incident Response Analysis
- How was the event detected?
- Manual detection. Discovered while searching all hosts' kernel logs for OOMK events, while analyzing an unrelated problem involving memory saturation on a different host: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10886#note_386134858
- How could detection time be improved?
- Alert whenever the kernel logs an OOMK event. This could take the form of an
mtail
Prometheus metric with an associated alerting rule.
- Alert whenever the kernel logs an OOMK event. This could take the form of an
- How did we reach the point where we knew how to mitigate the impact?
- The incident self-resolved, thanks to the kernel's OOMK.
- For a terse summary of how we discovered what happened, what triggered it, and how to better mitigate it in the future, see the next section: "Post Incident Analysis", question 1.
- How could time to mitigation be improved?
- Generally the kernel's OOM killer deals with the memory starvation more quickly than a human admin could, but if we need to deal with sustained saturation, we could try tactics such as imposing a timeout, blocking abusive clients, migrating targeted repos to an isolated quarantine Gitaly node (to reduce impact to other customers), caching git pack files, etc.
Post Incident Analysis
- How was the root cause diagnosed?
- The kernel logs for the OOMK events indicate a large group of
git
processes were collectively responsible for saturating memory. Details: #2457 (comment 386385034) - The process_exporter metrics identified which git subcommand was associated with those memory-intensive git processes. Details: #2457 (comment 387363243)
- Follow-up analysis of memory usage patterns under normal conditions for the affected host revealed a chronic problem: Clients with slow Internet connections trying to git-fetch a large set of git objects can lead to accumulating many of these memory-hungry processes, and that collectively leads to memory saturation as occurred during this incident. Details: #2457 (comment 387892195) and #2457 (comment 388444226)
- Concluded that
git pack-objects
andgit archive
are implicitly behaving similar to an in-memory buffer, and to reduce the timespan of their memory usage, we could let them spool to a local temp file, rather than stream to a potentially very slow pipe. Details: #2457 (comment 388447144)
- The kernel logs for the OOMK events indicate a large group of
- How could time to diagnosis be improved?
- No clear big improvements come to mind. The diagnosis felt pretty efficient this time -- mostly it was spent forming a richer mental model of the relevant behaviors through observation and testing. Once it became clear that the lifespan of the bloated process was indirectly controlled by the end-user's network bandwidth, the problem and solution became clear.
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- No, not to my knowledge.
- The problem is conceptually similar to other recent discussions of the trade-offs of buffering HTTP request/response payloads. The bloated git process is not really buffering, but the effect is similar because it does not bother to conserve memory by deallocating or recycling its existing allocations.
- Was this incident triggered by a change?
- No, it was triggered by normal functionality. This incident exposes a drawback of a design choice in Gitaly that we may want to make configurable.
5 Whys
- Gitaly node
file-42
abruptly grew its memory usage profile to the point of saturation. Why?- Many clients tried to
git fetch
a large amount of data from a single git repo. - Each of those spawned a pair of processes:
git upload-pack
andgit pack-objects
. - Each
git pack-objects
process holds a lot of memory for a long time.
- Many clients tried to
- Why did the
git pack-objects
processes allocate a lot of memory?- The repo includes a 4 GB pack file, and the clients needed that pack file to satisfy their fetch request.
- Including that pack file in the response to the client implicitly requires reading it into memory, hence the memory allocation.
- Why did the
git pack-objects
processes hold that memory for so long?- The
git pack-objects
process streams its output to its parent process (git upload-pack
), which streams it to Gitaly, which streams it up the call chain (through workhorse, nginx, haproxy, and cloudflare). None of those layers will buffer the entire response payload, so the origin of the response data -- thegit pack-objects
process -- must pause and wait to continue sending data until its downstream can accept more data. While it sleeps, its memory allocation remains in place, contributing to the overall memory pressure.
- The
- Why do none of the downstream layers buffer the complete response?
- Doing so would bloat their memory footprint, making them susceptible to the same problem we see here.
- Cleaning up bloat from in-memory buffering is harder in long-lived processes, which have to deal with a menagerie of garbage collection challenges. Spooling to temp files on disk is in many cases a cheaper and more scalable alternative.
Lessons Learned
-
git upload-pack
and its childgit pack-objects
can cause severe memory pressure under certain conditions. This incident improved our understanding of those conditions, and we have a few ideas for mitigations. - Other git subcommands (e.g.
git archive
) may have similar conditions, where the size and duration of the memory footprint is largely dependent on uncontrolled factors such as client's network bandwidth and the size of the git objects being transported.
Corrective Actions
To solve the problem, we need to either:
-
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11019 - Help git spend less memory when composing a pack file.
- Consider tuning pack.windowMemory to constrain the per-thread limit of each
git pack-objects
process. Will need to test this on a repo with pre-existing large pack files, assuming that pack.allowPackReuse is enabled (default=true). - Consider ensuring that all saved pack objects in
[repo_dir].git/objects/pack/
have an associated bitmap index. If pack.packSizeLimit is set to constrain the max size of server-side pack files, a bitmap will not be generated if the pack must be split across multiple files. See also pack.useBitmaps.
- Consider tuning pack.windowMemory to constrain the per-thread limit of each
-
gitlab-org/gitaly#3003 (closed) - Decouple the lifespan of the
git pack-objects
process from the speed of the client's network connection by buffering the output somewhere in the pipeline.- Consider spooling the output of
git upload-pack
to a file that Gitaly reads, so that the git processes exit as quickly as possible. This carries trade-offs briefly described here and may not be appropriate for all GitLab deployments. Even for GitLab.com, we may want to configure a separate filesystem for storing these temp files.
- Consider spooling the output of
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)