Incident Review for intermittent CPU saturation on gitaly node file-45 on 2020-03-16

Incident being reviewed: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1774

Related incident with the same root cause: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1731

These incidents are confidential, to protect the privacy of the projects involved.

Summary

  • Service(s) affected : Gitaly
  • Minutes downtime or degradation : Several 10-25 minute periods of slowness, summing to 195 minutes spread across 11 days

One of the Gitaly shards (file-45-stor-gprd) reached CPU saturation, leading to an increased slowness and rate of errors (timeouts). Other Gitaly shards were not affected.

The CPU saturation was triggered by many clients concurrently running git fetch or git clone on a repo that has a large graph of commit objects to traverse. To generate the inventory of objects that the client needs, Gitaly runs git pack-objects, which for this repo requires significant CPU time and memory. Enough of these helper processes were running that the host starved for CPU time, resulting in the observed increased latency and error rate (timeouts) from this particular Gitaly shard.

For reference, this incident ties back to other similar regressions on this particular Gitaly shard starting on 2020-03-05. Each regression event caused roughly 10-25 minutes of degraded performance on 1 of the 52 Gitaly shards (file-45). Now that we understand the pathology, we can link it back to the following regression events:

  • 2020-03-16: 3 events = 65 minutes
  • 2020-03-13: 1 event = 20 minutes
  • 2020-03-12: 3 events = 60 minutes
  • 2020-03-09: 1 event, 10 minutes
  • 2020-03-05: 2 events, 40 minutes

Mitigations

The owner of the repo whose distributed builds triggered this regression has kindly offered to adjust their build to reduce its impact. Reusing fresher git-clones on the build servers may help, and reducing concurrency would definitely help.

On the GitLab server side, Gitaly needs to avoid saturation by limiting resource usage by these helper processes. Both CPU time and memory are relevant to this problem.

One options is to limit the concurrency of the gRPC PostUploadPack calls, which spawn these git pack-objects helper processes. (For context, the git client's fetch, pull, clone, etc. subcommand sends an HTTP POST request to the [repo-url]/git-upload-pack endpoint, which causes Rails to send Gitaly a PostUploadPack gRPC call.) Using an instance-wide limit in combination with a smaller scoper limit per-namespace or per-project would allow the other projects to continue functioning while the busy peer is rate-limited.

Another possibility is using cgroups to impose a hard limit on resource usage by particular types of helper processes. However, under certain workload patterns, that could result in starving all tenants of a Gitaly shard. This may be better suited as a second line of defense, with the application-based concurrency limits acting as the primary defense with more graceful degradation behavior.

Impact & Metrics

  • What was the impact of the incident? Slowness and intermittent errors for operations involving git repos stored on Gitaly shard file-45.
  • Who was impacted by this incident? All customers having git repos stored on Gitaly shard file-45 (roughly 2% of the repos on GitLab.com).
  • How did the incident impact customers? Git operations would have been slow enough to sometimes timeout.
  • How many attempts were made to access the impacted service/feature? Routine traffic over a 20 minute timespan
  • How many customers were affected? Approximately 2% of repos were slow to access (sometimes up to timing out). Unknown how many customers actively experienced this slowness.
  • How many customers tried to access the impacted service/feature? See above.

Dashboard "gitaly: Overview"

Latency apdex for the Gitaly service as a whole (i.e. not just the degraded shard) shows a drop of at most 2% during the regressions on 2020-03-16. Corresponding spikes in Gitaly error rate (due to timeouts) jumped as high as 0.6%. A similar pattern was found on the earlier dates that we later backtracked this pathology to.

Latency

https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&from=1584378000000&to=1584396000000&fullscreen&panelId=3&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2

Screenshot_from_2020-03-18_13-21-00

Error rate

https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&from=1584378000000&to=1584396000000&fullscreen&panelId=4&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2

Screenshot_from_2020-03-18_13-26-30

Dashboard "Host Stats"

The host-level resource usage metrics for the affected Gitaly node (file-45) clearly show the CPU usage saturation and memory usage increase.

Network egress throughput spiked as a result of sending the clients the git objects they needed, but the network throughput was nowhere near the saturation point. Disk I/O (not shown) was not significantly affected. Disk read throughput moderately increased but most disk I/O was satisfied by the filesystem cache, with physical disk I/O only increasing as a result of the cache being eroded by the anonymous memory allocations by the numerous git pack-objects processes.

CPU usage

https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-environment=gprd&var-node=file-45-stor-gprd.c.gitlab-production.internal&var-promethus=prometheus-01-inf-gprd&from=1584378000000&to=1584396000000&fullscreen&panelId=8

Screenshot_from_2020-03-18_13-29-29

Memory usage

https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-environment=gprd&var-node=file-45-stor-gprd.c.gitlab-production.internal&var-promethus=prometheus-01-inf-gprd&from=1584378000000&to=1584396000000&fullscreen&panelId=39

Screenshot_from_2020-03-18_13-31-17

Network usage

https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-environment=gprd&var-node=file-45-stor-gprd.c.gitlab-production.internal&var-promethus=prometheus-01-inf-gprd&from=1584378000000&to=1584396000000&fullscreen&panelId=12

Screenshot_from_2020-03-18_13-32-21

Command line list of processes

top shows the CPU usage was predominantly from the git pack-objects processes spawned by Gitaly, rather than by threads within Gitaly itself. Note that each of these processes holds a significant amount of resident memory beyond its shared memory.

Also, each of these git pack-objects processes has accumulated many seconds of CPU time (some have several minutes of CPU time). Follow-up analysis showed that these processes tended to have 40-60% of wall clock time spent on-CPU, at least during the regression; we suspect these processes are entirely CPU-bound and that their off-CPU time may have just been waiting for their next time slice, due to CPU starvation.

These git pack-objects processes were mostly operating on the same git repo directory (not shown), which is one of the ways we traced the regression back to its triggering conditions.

Top processes using CPU time

top - 20:03:16 up 67 days, 23:14,  2 users,  load average: 82.07, 71.93, 43.87
Tasks: 972 total,  75 running, 700 sleeping,   0 stopped,  30 zombie
%Cpu(s): 94.9 us,  4.6 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem : 12376827+total, 18716604 free, 71279952 used, 33771716 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 49609072 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                 
30400 git       20   0 12.944g 2.407g  18684 S  87.3  2.0   3104:23 /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml                                                                      
24383 git       20   0 3666708 0.985g 724520 R  55.9  0.8   0:11.10 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
18792 git       20   0 4010080 1.421g 910388 R  52.3  1.2   0:44.75 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
15964 git       20   0 4048068 1.616g 995520 R  52.0  1.4   1:06.52 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
17808 git       20   0 4014652 1.480g 945304 R  51.3  1.3   0:50.92 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
19394 git       20   0 3912712 1.389g 891808 R  50.3  1.2   0:41.27 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
10993 git       20   0 4409636 3.195g 2.279g R  50.0  2.7   1:36.05 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
16897 git       20   0 4531220 2.765g 1.732g R  47.4  2.3   3:41.96 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
13408 git       20   0 4305772 1.813g 0.988g R  47.1  1.5   1:24.43 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
14087 git       20   0 4264240 1.769g 0.983g R  47.1  1.5   1:19.99 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
14833 git       20   0 4264240 1.758g 0.979g R  46.7  1.5   1:18.16 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
14516 git       20   0 4531220 2.773g 1.739g R  46.1  2.3   3:51.42 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
19094 git       20   0 4010080 1.418g 909592 R  46.1  1.2   0:43.87 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
 4721 git       20   0 4431444 3.173g 2.237g R  45.8  2.7   1:43.69 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
14081 git       20   0 4531220 2.923g 1.888g R  45.4  2.5   3:57.74 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
16557 git       20   0 4035388 1.556g 971636 R  45.1  1.3   0:59.52 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
21825 git       20   0 3852644 1.219g 824620 R  45.1  1.0   0:26.51 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
12431 git       20   0 4406380 3.212g 2.300g R  44.4  2.7   1:32.37 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
23211 git       20   0 3769068 1.121g 776424 R  44.4  0.9   0:18.12 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
16331 git       20   0 4036872 1.577g 980652 R  44.1  1.3   1:01.84 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
24308 git       20   0 4495184 2.531g 1.533g R  44.1  2.1   3:13.28 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
15668 git       20   0 4049636 1.626g 976.6m R  43.8  1.4   1:07.32 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
17750 git       20   0 4021540 1.504g 953560 R  43.8  1.3   0:52.94 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
 5924 root      20   0  775092 211420  12276 S  43.1  0.2   1190:39 /opt/td-agent/embedded/bin/ruby -Eascii-8bit:ascii-8bit /opt/td-agent/embedded/bin/fluentd --log /var/log/td-agent/td-agent.log --daem+ 
12601 git       20   0 4531220 2.895g 1.860g R  42.8  2.5   3:58.59 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
18466 git       20   0 4011404 1.438g 919280 R  42.5  1.2   0:46.23 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
19942 git       20   0 3906556 1.356g 876976 R  42.5  1.1   0:37.52 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
17034 git       20   0 4035388 1.551g 969276 R  42.2  1.3   0:58.61 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
13404 git       20   0 4305772 1.803g 0.985g R  41.8  1.5   1:22.21 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
15915 git       20   0 4047368 1.610g 991744 R  41.5  1.4   1:05.33 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
19079 git       20   0 3912712 1.386g 890100 R  41.2  1.2   0:42.30 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
 2105 git       20   0 3624628 3.441g   3600 R  39.2  2.9   2:59.31 /opt/gitlab/embedded/libexec/git-core/git unpack-objects --pack_header=2,33 --strict --max-input-size=10485760000                       
18049 git       20   0 4018152 1.483g 943724 R  39.2  1.3   0:50.88 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
22826 git       20   0 4495184 2.525g 1.528g R  38.9  2.1   3:18.47 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
16607 git       20   0 4031684 1.536g 963384 R  38.6  1.3   0:57.23 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
21679 git       20   0 3854772 1.221g 823376 R  38.6  1.0   0:28.12 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
23295 git       20   0 4495184 2.544g 1.547g R  38.6  2.2   3:17.52 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
16630 git       20   0 4035388 1.569g 978100 R  38.2  1.3   1:00.71 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
19127 git       20   0 3913500 1.391g 890976 R  37.9  1.2   0:42.71 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
12430 git       20   0 4414044 3.219g 2.300g R  37.6  2.7   1:32.96 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
17812 git       20   0 4013020 1.475g 944612 R  37.6  1.2   0:50.06 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
17967 git       20   0 4016844 1.482g 943656 R  37.6  1.3   0:50.65 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
13384 git       20   0 4305772 1.812g 0.987g R  37.3  1.5   1:23.82 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
14866 git       20   0 4264240 1.716g 985.4m R  37.3  1.5   1:11.54 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
15103 git       20   0 4264240 1.722g 989.3m R  37.3  1.5   1:12.10 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
 7408 git       20   0 2235824 870708 578048 R  36.9  0.7   1:44.94 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --progress --delta-base-offset --include-tag              
 7871 git       20   0 4409636 3.151g 2.235g R  36.9  2.7   1:37.31 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
22605 git       20   0 4495184 2.534g 1.536g R  36.9  2.1   3:21.44 /opt/gitlab/embedded/libexec/git-core/git pack-objects --revs --thin --stdout --delta-base-offset                                       
...

CPU profile of Gitaly host file-45 during the regression

A CPU profile of all on-CPU processes shows that git pack-objects dominated CPU time and that they collectively spent most of their time calculating the deltas between what git objects the client request says it already has versus what the server-side bare repo has available to offer. This suggests that the pathology is more likely to affect git repos with a large complex graph of object trees to compare.

perf profile of on-CPU processes' stacks during the regression

msmiley@file-45-stor-gprd.c.gitlab-production.internal:~$ sudo perf record --freq 99 -a -g -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 5.118 MB perf.data (31680 samples) ]

msmiley@file-45-stor-gprd.c.gitlab-production.internal:~$ sudo perf script --header > perf-script.during_cpu_saturation_by_many_git_pack_objects_pids.out

msmiley@saoirse:~/src/git/public/FlameGraph [master|✔] $ cat /tmp/perf-script.during_cpu_saturation_by_many_git_pack_objects_pids.out | ./stackcollapse-perf.pl | ./flamegraph.pl > /tmp/file-45.during_git_pack_objects_cpu_saturation.svg

Complete SVG of flame graph: file-45.during_git_pack_objects_cpu_saturation.svg

flame-graph.git-pack-objects-saturating-cpu-on-gitaly-node-file-45

Detection & Response

Start with the following:

  • How was the incident detected? PagerDuty alerts such as here and here: "Gitaly error rate is too high"
  • Did alarming work as expected? Yes.
  • How long did it take from the start of the incident to its detection? 7 minutes.
  • How long did it take from detection to remediation? The regressions self-resolved roughly 10 minutes after the alert triggered. Analysis of the regression continued after incidents resolved, and once we discovered enough about the pathology and triggering conditions, we reached out to the owner of the triggering repo. We temporarily blocked the behavior for that repo to protect other customers from the side-effects, and worked with the customer to understand their use-case and mitigate the impact.
  • Were there any issues with the response to the incident? No. We had all the necessary access to people and technology.

Root Cause Analysis

A Gitaly node saturated its CPUs and significantly changed its memory usage profile.

  1. Why? - More CPU-bound memory-hungry processes were spawned than there are CPUs available on the host.
  2. Why? - Gitaly received numerous PostUploadPack gRPC calls from clients running git fetch (or similar), and Gitaly processed too many of them concurrently.
  3. Why? - Most repos are not prone to this pathology, so it has not yet been a focus of attention for Gitaly tuning.
  4. Why? - To trigger the pathology, the repo being accessed must have a very large object graph to traverse when composing a response to the client's git fetch (more specifically the git client's HTTP POST to [repo_path]/git-upload-pack). Even for repos that meet this condition, to cause a regression on the Gitaly node, over 30 client requests to this repo must be concurrently running.
  5. Why? - The Gitaly server does not have enough CPU and memory capacity to handle workloads beyond the above specifications. Once reaching the saturation point, all other Gitaly operations begin to slow down due to CPU starvation, with the degree of effect being proportional to the number of CPU-bound processes running.

What went well

  • Cross-team collaboration was fantastic!
  • IMOC and CMOC did a wonderful job (as usual!) of facilitating the investigation, getting help where needed, doing regular pulse checks, and guiding the balance of what info was ok to release publicly (as we were unsure at first whether this was a precursor to a malicious attack or an accidental outcome of a legitimate use-case).
  • Alerting worked appropriately.
  • Gitaly's and Rails' event logging to Elasticsearch gave us excellent observability into the app behavior and client request profile.
  • The CPU profiling tools (perf and BCC) gave us a clearer picture of what the CPU time was being spent on and which of the numerous git repos were associated with the bulk of the activity.

What can be improved

  • To prevent this from happening again, we need to prevent Gitaly from spawning enough git processes to saturate machine resources such as CPU and memory. Gitaly already supports some forms of concurrency limiting, and tuning those may be sufficient. Additional layers of protection (e.g. cgroups) could also be implemented, although we should consider the effects of saturation in each case and choose the most desirable failure mode for the health of the service as a whole when saturation is reached. (For example, we prefer: slowness over downtime; downtime over possible corruption; automatic over manual recovery; etc.)
  • Each of the incidents in this 11-day series were low enough severity (S3) that we did not take the time to do root cause analysis until we noticed they were part of a recurring pattern. Our on-call engineers field enough alerts, questions, and interruptions that we rarely have time to analyze self-recovering regressions like this. This may be worth discussing as an iterative process improvement for incident triage. Or maybe it's an acceptable cost of our existing prioritization scheme.

Corrective actions

Guidelines

Edited by Matt Smiley