2022-03-27: ssh frontend service apdex and error rate breaching SLO

Incident DRI

Current Status

The user-facing apdex and error rate regression was caused by exceeding the capacity for the GitLab Shell service in 1 of 3 GCP zones. Pod autoscaling did not compensate sufficiently. So we can classify this as saturation behavior, but there are still some open questions regarding the range of potential triggering conditions and the bias to affect only some GKE nodes in one zone at a time.

More details are covered in the Findings summary section below, but here is a quick overview of discoveries:

This outcome of gitlab-shell pods starving for memory may be a generic outcome, induced by any of several potential triggers.
- Generally anything causing the concurrent connection count to rise within a pod can lead to this outcome.
- The concurrency can rise due to either a small percentage of requests becoming remarkably slow or by the mean response rate being slightly lower than the request rate.
It might be self-reinforcing once the pressure reaches a critical point. This is a testable open question.
Growing the pod's memory budget would likely just delay the regression, not prevent it.
Adjusting the HPA policy to consider memory utilization in addition to CPU utilization may allow autoscaling to add capacity, taking some pressure off of saturated nodes by reducing their incoming connection rate.

Summary for CMOC notice / Exec summary:

Customer Impact: Intermittent slowness and timeout errors when using git-over-ssh
Service Impact: ServiceGitlab Shell
Impact Duration: start time UTC - end time UTC ( duration in minutes )
Root cause: RootCauseSaturation

Findings summary

gitlab-shell error rate and apdex spikes were triggered by an unknown condition that implicitly caused memory starvation leading to OOM kills and implicitly caused general backpressure on accepting new ssh connection requests.

Only some of the GKE nodes within the zone appear to be affected (roughly 9 our of 35 at one point). This asymmetry may imply a localized regression at the VM, hypervisor, or network level.

Affected GKE nodes consistently exhibit some distinctive features: frequent OOMK kernel log events and high load average. The memory pressure is caused by accumulating too many sshd processes within a pod. Many factors can affect the lifespan of those processes, and anything that lengthens their lifespan can lead to this memory pressure outcome.

Regardless of the triggering condition, once a pod reaches memory starvation, the kernel is forced to (1) attempt page reclaim, (2) often fail to find enough reclaimable pages, and (3) choose a victim process to kill. None of the individual processes have a large private memory footprint, so this pattern repeats again shortly after killing a process to reap its memory.

One of the distinctive features of this pathology is that some hosts (GKE nodes) are affected while others are not -- suggesting that local conditions play a role. What is this local condition? Some ideas (untested):

This condition could be something mundane like packet loss causing sshd processes to live longer (holding their memory for longer), while new such processes continue to be spawned at a constant rate, driving memory pressure towards saturation wherever the ceiling is.
Alternately, the memory reclaim attempts themselves might cost enough overhead to actively participate in an amplification cycle. If so, this would imply that once a host reaches severe enough memory starvation, it might undergo a phase transition where memory saturation becomes somewhat self-reinforcing. The idea is that spending time attempting to do memory reclaim may extend the lifespan of sshd processes, again allowing the rate of incoming connection requests to outpace the rate of completing and closing connections. (This idea is highly speculative and could probably be tested/ruled-out by host-wide CPU sampling and polling the kernel stacks of processes in a runnable or uninterruptible-sleep state.)

A recurring pattern

For context, this incident is an example of a recurring pattern. I have seen it several times over the past month, and it has occurred again since this one was closed. I think it is worth doing some follow-up analysis to guide preventative work.

About alerting

From an alerting perspective, it sometimes manifests as an alert at the HAProxy layer (as in this incident) and sometimes at the gitlab-shell service directly. These two alerts are just capturing the same regression event from 2 perspectives with different alerting thresholds, so we can treat them as equivalent during incident analysis.

About the causal chain

Saturating memory in the gitlab-shell pods may be a generic side-effect of backpressure from elsewhere in the inter-service call graph.

Any time the incoming request rate exceeds the response rate, the number of concurrently open ssh connections grows. Each one costs a few KB of memory. A pod cannot support more than about 1000 of such processes based on current memory limit. Growing the memory limit would just delay the same outcome, so that is unfortunately a fairly weak mitigation option.

The triggering condition could be anything that slows the response rate. As illustrative examples, some ssh connections could become longer-lived due to things like:

Gitaly applying backpressure via concurrency limits, resource contention, or any other cause of slowness
slow reading of response payloads by clients (slowloris)
network bandwidth or packet loss on any of the internal or external network paths

To induce memory pressure, a triggering condition does not need to affect all connections. It only needs to affect enough connections to drive up the pod's concurrent connection count to its saturation point (roughly 1000 connections). This is a case where long-lived connections can have a disproportionate impact on resource usage, entirely due to their age, apart from all other factors.

Timeline

Recent Events (available internally only):

All times UTC.

2022-03-27

15:24 - @msmiley declares incident in Slack.

Create related issues

Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:

Takeaways

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. ...
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. ...
How many customers were affected?
1. ...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...

What were the root causes?

Incident Response Analysis

How was the incident detected?
1. ...
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. ...
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. ...
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. ...

What went well?

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Apr 04, 2022 by Matt Smiley