Gitaly node file-32 lost network connectivity

Summary

Status: Service is restored.

One of our Gitaly nodes (file-32-stor-gprd) is effectively down due to severe network connectivity loss. Git repos stored on that node are temporarily inaccessible. When network connectivity is restored, these repos will again be available.

This network outage affects application ports (e.g. 9999) and system service ports (e.g. 22). ICMP pings fail. GCP console fails to load. Stackdriver logs are available, and they show that locally running processes also detect network connectivity is disrupted.

Hard reboot did not resolve the problem.

Packet tracing on client hosts showed 0 packets from file-32, even on the handful of connections where the client TCP state was still listed as ESTABLISHED.

GCP Support determined that the hypervisor was blackholing egress packets sent by the guest VM. This is consistent with the fact that Stackdriver logs showed that processes running on file-32 were receiving UDP messages but peers did not acknowledge receiving outgoing UDP messages. It also explains why TCP clients still listed connections as ESTABLISHED after we rebooted the VM -- the clients never received TCP RESET responses from file-32's kernel because the hypervisor dropped those response packets.

Support ticket with Rackspace: https://portal.rackspace.com/1173105/tickets/details/191015-ord-0001000

Service(s) affected: Gitaly

Team attribution: Infrastructure

Minutes downtime or degradation: 6h 22m (382 minutes, from 2019-10-15 18:31 to 2019-10-16 00:53 UTC)

Cause: GCP black-holed egress network traffic from this VM. Reason TBD.

Timeline

2019-10-15

18:31 UTC - Inferred start of the disruption. Start of the rise in the rate of HTTP 5xx responses from frontends, and start of the decline in number of established healthy TCP connections to the file-32 Gitaly node.
19:08 UTC - First PagerDuty alert: "High Rails Error Rate on Front End". This alert was triggered by the HTTP 5xx response rate from frontend services briefly exceeded SLO. The spike quickly subsided and the alert auto-cleared, but this event is what started our investigation.
19:26 UTC - PagerDuty alert about file-32 being down. Quickly determined this was likely the cause of the HTTP 5xx responses we had been investigating. Repeated attempts to connect to this VM via ssh and console all failed. Stackdriver logs are available and confirm the VM's running processes are also reporting network disruption.
20:07 UTC - Hard reboot of file-32 still did not restore network connectivity.
21:19 UTC - GCP Support confirms that the hypervisor running our file-32 VM is dropping egress traffic from that VM. Hence both new and established connections are broken. Support is working on a fix.
22:26 UTC - GCP Support escalated to their Engineering team for assistance. Results so far: "By looking at the status of the case, and the internal case opened with the SRE team, it seems the return packets from "file-32-stor-gprd" to the client VM(s) is blackhole-ed."

2019-10-16

00:53 UTC - Bidirectional network connectivity is restored to the file-32 VM. Gitaly and related services on that VM immediately recovered and returned to service. No news yet from GCP Support on why this VM was blackholed (and why others were not) or what was done to clear that state.
01:45 UTC - Rebooted the file-32 VM to ensure there are no residual runtime effects.
01:46 UTC - file-32 is back online and in service. Downtime for the reboot was approximately 56 seconds (01:45:50 to 01:46:46 UTC).

Edited Aug 03, 2020 by 🤖 GitLab Bot 🤖