2020-07-27: Memory and CPU saturation on Gitaly node file-42

Summary

Memory and CPU starvation on Gitaly node file-42 appears to have been caused by a batch of git processes each holding 200 MB to 2 GB of memory. Some additional processes were also killed, but the git processes were by far the biggest memory consumer and were the cause of the abnormal spike in memory usage.

As follow-up to the incident, we profiled git processes that both allocated a large amount of memory and held that memory for a non-trivial amount of time. The results highlight that 2 git subcommands spawned from Gitaly are prone to large and long-lived memory allocations:

git pack-objects (the child process underlying git upload-pack)
git archive

During the incident itself, git upload-pack processes (and implicitly their git pack-objects children) dominated the host's memory usage. The after-incident memory usage profile summarized above represents a lower concurrency form of the profile observed during the incident. Improving this memory usage pattern should help reduce memory pressure in both the general case and the extreme case that lead to this incident.

As a corrective action, we should explore ways to either (or both):

Reduce memory usage by git pack-objects and git archive processes.
Decouple the lifespan of those processes from the end-user's client, so that low-bandwidth clients do not implicitly exacerbate server-side memory pressure.

Timeline

All times UTC.

2020-07-27

08:24 - Gitaly node file-42 abruptly saturates its CPU and memory.
08:29 - Gitaly node file-42 returns to normal CPU and memory usage.

In addition to the CPU and memory usage mentioned above, the following symptoms occurred during the incident:

The kernel out-of-memory killer was triggered 224 times over a 70-second timespan.
The host's the softirq rate dropped to practically nil. This typically corresponds to a drop in
The host's physical disk throughput on sdb (where the git repos are stored) spiked to 10x higher than the normal read ops/second (but no such spike occurred in read bytes/second). Extra disk I/O is a predictable side-effect of the filesystem cache having been eroded (in this case by the memory-intensive git processes).

Metrics

WARNING: Some of the log events presented in Kibana appear to have timestamps up to a few minutes offset from the raw kernel logs. This makes correlating events by timestamp more challenging.

Raw kernel logs: file-42-stor-gprd.20200727.kern.log.gz

Dashboard links:

Grafana Host Stats dashboard - Shows host-level resource usage for file-42, zoomed-in to the relevant timespan.
Memory usage summary by git subcommand for this Gitaly node - Shows that during the memory starvation, git upload-pack processes (and their children, git pack-objects) collectively held over 80 GB of memory. This agrees reasonably well with the OOMK snapshot of per-process memory usage, and unlike the OOMK process listing, this view tallies by git subcommand rather than just showing each process as an unspecified git command.
Kibana kernel log events for OOMK on Gitaly nodes - Shows which process was killed each time the OOMK was invoked, including how much virtual and physical memory that process had allocated at its time of death.
Kibana log events for Gitaly errors near the time of the OOMK events - These may or may not be directly related to the memory and CPU saturation.
Kibana log events for Workhorse - Shows incoming HTTP POST requests to [repo_path]/git-upload-pack whose duration is over 60 seconds. This typically correlates with large response payloads (field json.written_bytes).

CPU saturation

Dashboard panel

Memory saturation

Note: The memory cgroup for Gitaly-related processes is constrained to slightly less than the host's total physical memory capacity.

Dashboard panel

Memory usage was mostly anonymous (non-file-backed) allocations

This suggests the processes were mallocing a lot of private memory.

Since the OOMK events show that there were both ruby and git processes holding a lot of resident anonymous memory, this may imply that some gRPC commands may have spawned git processes that ended up requiring a lot of memory. The kernel's log events do not include the specific details of the git and ruby command-line args.

We later determined that the killed git processes that had a large memory footprint were git pack-objects commands that were spawned by git upload-pack commands, which in turn were created by gitaly for handling PostUploadPack or 'SSHUploadPackgRPC calls. And the killedruby` processes were Gitaly's pool of ruby helpers -- they may have been unrelated to the increased memory pressure.

Dashboard panel

Disk throughput (ops/sec and bytes/sec)

During the memory saturation, the rate of physical read ops/second shows the same shape as the memory saturation. However, the read bytes/second rate did not, suggesting that the burst of additional read ops were smaller than the typical op size (perhaps many small single-block reads rather than larger sequential reads of larger files). This is presumably a side effect of the filesystem cache shrinking dramatically and thus having a much lower cache hit rate.

Also, a few minutes before the incident began, the write rate (both ops/sec and bytes/sec) spiked significantly. This may be completely unrelated to the incident, or it might have been a lead-up event (e.g. a large push to a repo that subsequently many clients concurrently fetched).

Dashboard

Incoming connections dropped

Gitaly normally receives roughly 1000 new connections per minute. During this event, roughly 170 incoming connection attempts were dropped, probably because Gitaly was unable to accept new connections as quickly as they arrived while the host's CPU and memory resources were saturated.

Dashboard panel

Timestamps of the OOMK events as recorded in the raw kernel logs

This shows accurate timestamps for when memory pressure became severe enough that the kernel's out-of-memory killer had to activate and start choosing processes to kill. We can compare these timestamps with logs and metrics from other sources.

msmiley@file-42-stor-gprd.c.gitlab-production.internal:~$ sudo cat /tmp/file-42-stor-gprd.20200727.kern.log.gz | grep 'Killed process' | wc -l
224

msmiley@file-42-stor-gprd.c.gitlab-production.internal:~$ sudo cat /var/log/kern.log | grep 'Killed process' | awk '{ print $1, $2, $3, $10 }' | uniq -c
      6 Jul 27 08:24:52 (ruby)
     10 Jul 27 08:24:53 (ruby)
      4 Jul 27 08:24:54 (ruby)
      5 Jul 27 08:24:54 (pre-receive)
      7 Jul 27 08:24:54 (bundle)
      2 Jul 27 08:24:54 (git)
      1 Jul 27 08:24:55 (git)
      1 Jul 27 08:24:55 (bundle)
      8 Jul 27 08:24:55 (git)
     10 Jul 27 08:24:56 (git)
      8 Jul 27 08:24:57 (git)
      9 Jul 27 08:24:58 (git)
      6 Jul 27 08:24:59 (git)
      1 Jul 27 08:24:59 (bundle)
      1 Jul 27 08:25:00 (git)
      1 Jul 27 08:25:00 (bundle)
      7 Jul 27 08:25:00 (git)
      7 Jul 27 08:25:01 (git)
      1 Jul 27 08:25:01 (bundle)
      1 Jul 27 08:25:02 (git)
      1 Jul 27 08:25:02 (bundle)
      3 Jul 27 08:25:02 (git)
      1 Jul 27 08:25:02 (bundle)
      9 Jul 27 08:25:03 (git)
     19 Jul 27 08:25:04 (git)
     26 Jul 27 08:25:05 (git)
     26 Jul 27 08:25:06 (git)
      5 Jul 27 08:25:07 (git)
      4 Jul 27 08:25:07 (ps)
      8 Jul 27 08:25:07 (git)
      1 Jul 27 08:25:07 (gitaly)
      1 Jul 27 08:25:08 (git)
      1 Jul 27 08:25:11 (git)
      1 Jul 27 08:25:12 (git)
      2 Jul 27 08:25:13 (git)
      1 Jul 27 08:25:14 (git)
      3 Jul 27 08:25:21 (git)
      1 Jul 27 08:25:25 (git)
      2 Jul 27 08:25:26 (git)
      2 Jul 27 08:25:28 (git)
      1 Jul 27 08:25:29 (git)
      1 Jul 27 08:25:35 (git)
      2 Jul 27 08:25:37 (git)
      2 Jul 27 08:25:39 (git)
      2 Jul 27 08:25:41 (git)
      2 Jul 27 08:25:43 (git)
      1 Jul 27 08:26:08 (git)

Contrast with Kibana's timestamps for the same 224 events:

Tally memory usage by git-subcommand

The processes running git upload-pack were by far the dominant memory consumer in Gitaly's memory cgroup at the time of the OOMK incident. At peak, they held 85 GB out of the cgroup's 90 GB limit.

For reference, git upload-pack spawns a git pack-objects child process, which is what accumulates and holds a large memory footprint. The Prometheus process_exporter metric is presumably attributing the child process's memory usage to the parent process.

Dashboard panel

Incident Review

Summary

See above.

Service(s) affected : Gitaly
Team attribution : Gitaly
Minutes downtime or degradation : 5 minutes

Customer Impact

Who was impacted by this incident?
- Customers (both external and internal) whose git repos are stored on the affected Gitaly node (file-42).
What was the customer experience during the incident?
- Any kind of interactions with git repos on this node had a high likelihood of failure. Retrying after the incident should have succeeded.
How many customers were affected?
- Unknown.
If a precise customer impact number is unknown, what is the estimated potential impact?
- At most 2% of customers have repos on this Gitaly node. But in practice a small percentage of those customers would have been active during the 5-minute incident timespan.
- Most calls to this Gitaly node would have failed due to timeouts, and a small percentage would have had other errors.
- This specific Gitaly node's apdex dropped to 20% during the incident and recovered immediately afterwards.

Incident Response Analysis

How was the event detected?
- Manual detection. Discovered while searching all hosts' kernel logs for OOMK events, while analyzing an unrelated problem involving memory saturation on a different host: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10886#note_386134858
How could detection time be improved?
- Alert whenever the kernel logs an OOMK event. This could take the form of an mtail Prometheus metric with an associated alerting rule.
How did we reach the point where we knew how to mitigate the impact?
- The incident self-resolved, thanks to the kernel's OOMK.
- For a terse summary of how we discovered what happened, what triggered it, and how to better mitigate it in the future, see the next section: "Post Incident Analysis", question 1.
How could time to mitigation be improved?
- Generally the kernel's OOM killer deals with the memory starvation more quickly than a human admin could, but if we need to deal with sustained saturation, we could try tactics such as imposing a timeout, blocking abusive clients, migrating targeted repos to an isolated quarantine Gitaly node (to reduce impact to other customers), caching git pack files, etc.

Post Incident Analysis

How was the root cause diagnosed?
- The kernel logs for the OOMK events indicate a large group of git processes were collectively responsible for saturating memory. Details: #2457 (comment 386385034)
- The process_exporter metrics identified which git subcommand was associated with those memory-intensive git processes. Details: #2457 (comment 387363243)
- Follow-up analysis of memory usage patterns under normal conditions for the affected host revealed a chronic problem: Clients with slow Internet connections trying to git-fetch a large set of git objects can lead to accumulating many of these memory-hungry processes, and that collectively leads to memory saturation as occurred during this incident. Details: #2457 (comment 387892195) and #2457 (comment 388444226)
- Concluded that git pack-objects and git archive are implicitly behaving similar to an in-memory buffer, and to reduce the timespan of their memory usage, we could let them spool to a local temp file, rather than stream to a potentially very slow pipe. Details: #2457 (comment 388447144)
How could time to diagnosis be improved?
- No clear big improvements come to mind. The diagnosis felt pretty efficient this time -- mostly it was spent forming a richer mental model of the relevant behaviors through observation and testing. Once it became clear that the lifespan of the bloated process was indirectly controlled by the end-user's network bandwidth, the problem and solution became clear.
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- No, not to my knowledge.
- The problem is conceptually similar to other recent discussions of the trade-offs of buffering HTTP request/response payloads. The bloated git process is not really buffering, but the effect is similar because it does not bother to conserve memory by deallocating or recycling its existing allocations.
Was this incident triggered by a change?
- No, it was triggered by normal functionality. This incident exposes a drawback of a design choice in Gitaly that we may want to make configurable.

5 Whys

Gitaly node file-42 abruptly grew its memory usage profile to the point of saturation. Why?
- Many clients tried to git fetch a large amount of data from a single git repo.
- Each of those spawned a pair of processes: git upload-pack and git pack-objects.
- Each git pack-objects process holds a lot of memory for a long time.
Why did the git pack-objects processes allocate a lot of memory?
- The repo includes a 4 GB pack file, and the clients needed that pack file to satisfy their fetch request.
- Including that pack file in the response to the client implicitly requires reading it into memory, hence the memory allocation.
Why did the git pack-objects processes hold that memory for so long?
- The git pack-objects process streams its output to its parent process (git upload-pack), which streams it to Gitaly, which streams it up the call chain (through workhorse, nginx, haproxy, and cloudflare). None of those layers will buffer the entire response payload, so the origin of the response data -- the git pack-objects process -- must pause and wait to continue sending data until its downstream can accept more data. While it sleeps, its memory allocation remains in place, contributing to the overall memory pressure.
Why do none of the downstream layers buffer the complete response?
- Doing so would bloat their memory footprint, making them susceptible to the same problem we see here.
- Cleaning up bloat from in-memory buffering is harder in long-lived processes, which have to deal with a menagerie of garbage collection challenges. Spooling to temp files on disk is in many cases a cheaper and more scalable alternative.

Lessons Learned

git upload-pack and its child git pack-objects can cause severe memory pressure under certain conditions. This incident improved our understanding of those conditions, and we have a few ideas for mitigations.
Other git subcommands (e.g. git archive) may have similar conditions, where the size and duration of the memory footprint is largely dependent on uncontrolled factors such as client's network bandwidth and the size of the git objects being transported.

Corrective Actions

To solve the problem, we need to either:

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11019 - Help git spend less memory when composing a pack file.
- Consider tuning pack.windowMemory to constrain the per-thread limit of each git pack-objects process. Will need to test this on a repo with pre-existing large pack files, assuming that pack.allowPackReuse is enabled (default=true).
- Consider ensuring that all saved pack objects in [repo_dir].git/objects/pack/ have an associated bitmap index. If pack.packSizeLimit is set to constrain the max size of server-side pack files, a bitmap will not be generated if the pack must be split across multiple files. See also pack.useBitmaps.
gitlab-org/gitaly#3003 (closed) - Decouple the lifespan of the git pack-objects process from the speed of the client's network connection by buffering the output somewhere in the pipeline.
- Consider spooling the output of git upload-pack to a file that Gitaly reads, so that the git processes exit as quickly as possible. This carries trade-offs briefly described here and may not be appropriate for all GitLab deployments. Even for GitLab.com, we may want to configure a separate filesystem for storing these temp files.

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Aug 01, 2020 by Matt Smiley