2021-01-06 Degraded service

Summary

A networking issue in GCP resulted in a small number of nodes (2 x web, 3 x API, and 1 x database replica) being unable to talk to some other nodes. Manifested as timeouts for a small number of requests (initially assumed to be a larger issue, although in the end only a fairly minor degradation/disruption rather than a major outage).

It is the same basic network issue as #3281 (closed), but affecting more than one node.

Timeline

All times UTC.

2021-01-06

06:23:05 - first 'slow request 'logged;
06:25 - first alert received: Increased Error Rate Across Fleet
06:30 - Alert: Increased Serve Response Errors
06:31-06:33 - A number of additional alerts received, including Puma Saturation and Postgres Replication Lag
06:33 - cmiskell declares incident in Slack
06:35 - Issue identified as probably the same as #3281 (closed), but on more machines.
06:39 - web-16 + web-19 restarted
06:41 - api-19 restarted
06:42 - some alerts start clearing, likely because the two broken web nodes were restored to service, slow request log rates return to base levels.
06:45 - api-21 restarted
06:47 - most service restored; load graphs are looking reasonable again.
06:50 - api-24 restarted
06:53 - api-cny-03 restarted
06:57 - patroni-04 restarted
07:00 - Most alerts are cleared
07:27 - final alert (workhorse_auth_api SLI of the git service) clears

Corrective Actions

While for short incidents the IMOC can take care of CMOC duties we might want to allow some variations there (particularly with respect to some small spots in the timezones where IMOC coverage is from someone who should be asleep or close to it). https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12289 - Completed
Document (just as a pointer) https://console.cloud.google.com/net-intelligence/ in our runbooks, and share that more directly with the SRE team, perhaps in a DNA meeting: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12322, DRI @cmiskell, ETA: 2021-01-20

Incident Review

----

Summary

An odd networking event in GCP resulted in a sudden random inability of some client machines (typically web + api, but some others too) to make connections to servers (redis and gitaly specifically noticed). This caused Puma (Rails worker) threads to stall for up to 60 seconds before they timed out; requests that hit this state received an HTTP 500. Once all the threads on the affected nodes were busy/stuck, the healthchecks began failing and those nodes dropped from the load balancer. Patroni-04, a DB replica, lost it's ability to communicate with the primary DB, stopped replicating, and was dropped from usage.

This behavior had been observed on a single node (web-27) earlier in the day, so it was fairly easily identified. The fix was to perform a full shutdown (from the OS) and then restart (via GCP console or CLI); a simple reboot (which likely leaves the VM running on the same hypervisor, subject to the broken state) was insufficient. Once the web nodes were rebooted most of the symptoms subsided. API nodes seemed to have a lesser effect, for reasons I'll discuss later. The loss of patroni-04 from the replica rotation (recovered after reboot) had minimal effect on the system.

Service(s) affected: Web + API
Team attribution: ~"team::Core-Infra"
Time to detection: ~2 minutes
Minutes downtime or degradation:

From the Rails logs (request timeouts):
- Substantial degradation: 5 minutes
- Including low grade degradation: 18 minutes
From the platform metrics dashboard:
- Latency apdex
  - API: 23 minutes
  - Web: 18 minutes
  - frontend (not sure what this is, perhaps haproxy?): 35 minutes
  - git: 35 minutes
- Error ratios:
  - API: 26 minutes
  - web: 18 minutes
  - frontend: 28 minutes
  - git: 23 minutes

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Some (random) subset of gitlab.com users at the time.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Some (random) requests would timeout at 60 seconds with an HTTP 500.
How many customers were affected?
1. In the order of 18000 unique IPs (actual user count unknown), from a total of about 147000 unique IPs seen (by workhorse) during the period.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Approximately 12% of active IP addresses saw errors, but only 0.6% of all requests received an error, i.e. it was fairly sporadic/random for the users who saw it.

What were the root causes?

"5 Whys" Not sure we can go very deep here:

Why did requests time out? GCP networking failure meant TCP reply packets got dropped somewhere. Actual details are probably out of our control, although we await a response from GCP to confirm this.
Why did users get errors? The healthcheck on VMs just checks if the process is responding, so as long as it has a free thread the healthcheck is highly likely to pass, even if the actual request that depends on other resources will go nowhere. Note that this is mostly ok, as having an expensive 'check everything' healthcheck has it's own problems (flakier, slower, more expensive). It might be possible to make the healthcheck report some periodically obtained status, and I believe in kubernetes the readiness vs liveness checks disambiguate this a bit more.

Incident Response Analysis

How was the incident detected?
1. Error rates and SLI monitoring
How could detection time be improved?
1. Unlikely that it could; 2 minutes from onset to first alert is pretty good
How was the root cause diagnosed?
1. A bit of luck from having seen it in a less impactful way (on a single node) a few hours earlier. For that incident, by looking at the symptoms and surmising that it looked like network timeouts, before finding the required evidence in the output of netstat.
How could time to diagnosis be improved?
1. This is a pretty rare sort of random event. It shows up fairly strongly as a rise in node_netstat_TcpExt_TCPSynRetrans metric which we could specifically alert on with links to this case and known fixes, although that feels very specific to something that might never recur in this fashion.
How did we reach the point where we knew how to mitigate the impact?
1. By correlating the outage behavior with earlier observations; in the previous incident, by experimentation (restarting puma, rebooting the box, then a full stop/start cycle).
How could time to mitigation be improved?
1. Not sure it could; this was a pretty obscure failure mode. If the major incident was the first we saw it would have taken quite some time to figure out I think (from added pressure).
What went well?
1. Nodes did drop out of the load balancer eventually, keeping the problem fairly contained.
2. Databases were fine; patroni-04 dropped out when it got too far behind, and a simple reboot brought it back (patroni/postgres handled things nicely). Load rose on the others, including the primary, but the system handled the loss of a single node as designed/expected.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. #3281 (closed) earlier in the day (by a few hours)
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No.

Lessons Learned

Cloud networking is just someone else's networking gear, and it can do weird things.
The IMOC and CMOC roles are different and what assuming an S1 we likely needed comms not incident-management, and I should have paged the CMOC (or maybe the IMOC; see Corrective Actions)
Sometimes just turning it off and on again is the simplest and most correct answer:

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Incident Review Stakeholders

Edited Jan 12, 2021 by Craig Miskell

2021-01-06 Degraded service

Summary

Timeline

Corrective Actions

Incident Review

Summary

Metrics

Requests that timed out in Rails

500s from Workhorse

Platform metrics:

Customer Impact

What were the root causes?

Incident Response Analysis

Post Incident Analysis

Lessons Learned

Guidelines

Resources

Incident Review Stakeholders