2020-06-29: Connectivity loss with nodes in `us-east1-c1` during deployment.

Summary

2020-06-29: Google Cloud Platform Virtual Machine Network Connectivity Issues

Following a deploy, we noticed that we'd lost a number of nodes in our web fleet. We'd initially believed this to be a result of the release, but after investigating further we came to the conclusion that VMs in us-east1-c, specifically of type C, were unreachable. Google Cloud Platform has since acknowledge an issue with networking and VM instance creation in us-east1.

Our strategy for mitigating the issue was to scale up the fleet across other Availability Zones (AZs), but it's possible now that we'll be stymied by the inability to instantiate new VMs — and the fact that we're not in hosted in any other GCP regions.

Timeline

All times UTC.

2020-06-29

Status.io incident

14:52 - nodes go offline
14:57 - PagerDuty Alert
15:07 - cmcfarland declares incident in Slack using /incident declare command.
15:18 - status.io update
15:19 - @AnthonySandoval reached out to Rackspace support in a dedicated Slack channel.
15:23 - An incident ticket in the Rackspace portal was created.
15:31 - status.io update
15:41 - @cmcfarland opens an MR to increase the number of nodes across all Availability Zones (AZs) https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1869
15:45 - The issue is updated to an ~S1 incident.
15:46 - Redis persistent KV has reported we lost a secondary.
15:47 - status.io update indicating CI queuing.
15:48 - Google Cloud updates their status page with an incident - https://status.cloud.google.com/incident/cloud-networking/20005
15:56 - Google Cloud updates their status page with a second incident - https://status.cloud.google.com/incident/compute/20004
16:05 - status.io update
16:10 - Our Web Apdex score fully recovers as overall traffic decreases.
16:20 - status.io update
16:27 - Google Cloud update that the issue is likely to be resolved in 30 minutes for us-east1-d.
16:39 - status.io update
16:50 - New capacity from Terraform changes went into affect and Web fleet saturation dropped below normal rates.
16:57 - The queue of CI jobs is down to zero — pipelines are processing normally again.
16:59 - status.io update
17:00 - Google Cloud updates their status page indicating full restoration of service to us-east1-d. Indicates there is no ETA for service restoration in us-east1-c.
17:30, 18:00, 18:45 - Google cloud updates their status page. us-east1-c is restored except for issues with Persistent Disk. Persistent Disks are network devices in GCP. We continue to experience issues ssh'ing to VMs in us-east1-c that have persistent disks.
20:06 - Google Cloud Platform issues the all clear.
20:09 - web-01 came back online and took all the traffic from fe- HAProxy load balancers and became oversaturated.
20:30 - The remaining web-* VMs rebooted successfully and have alleviated the single node saturation off web-01.

Incident Review

Summary

A Google Cloud outage removed approximately 33% of our fleet. Web, load balancers, sidekiq, api, git, etc

Service(s) affected: Almost all fleets of servers affected
Team attribution:
Minutes downtime or degradation: Roughly 1.5 hours of poor performance though the event lasted longer

Metrics

Web Fleet Apdex:

Web Fleet Requests:

Git Fleet Saturation:

Customer Impact

Who was impacted by this incident? All customers.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...) Slow response and poor performance, slow CI jobs, etc
How many customers were affected?
If a precise customer impact number is unknown, what is the estimated potential impact?

Incident Response Analysis

How was the event detected? PagerDuty alerts were the first indications
How could detection time be improved? 5 minutes to first page. Not sure we can do much better.
How did we reach the point where we knew how to mitigate the impact? Determining that this was not a service failure and we could not access nodes helped us establish this as a Google Outage. At that point, mitigation was discussed to create new nodes in other availability zones.
How could time to mitigation be improved? Autoscaling capabilities in any form would have possibly mitigated quicker.

Post Incident Analysis

How was the root cause diagnosed? Examining the Google Web console clearly told us that us-east-1c was down
How could time to diagnosis be improved? Unknown
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? I think that any autoscaling system (like Kubernetes) could have mitigated or reduced the impact of this type of outage.
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? No

5 Whys

Alerts were generated noting that we had lost redundancy in web nodes, that some redis services had stopped, etc.
- Due to a Google Cloud Provider outage, at least one availability zone had stopped functioning, specifically with a set of processors we relied on for several types of nodes. This caused saturation of our remaining web and git nodes.
Why did this cause saturation of our remaining web and git nodes?
- The number of git and web nodes in 2/3 (or even 1/3) of our availability zones was not enough to handle the requests being generated.
Why were the number of nodes provisioned in each availability zone (or two of them) not enough to handle the traffic?
- The methods of sizing our fleet may not take into account an availability zone outage.
Why don't the methods of sizing our fleet take into account an availability zone outage?
- There are two possible responses to this:
  1. After updating many nodes to a more efficient processor type, we downsized to save money. I don't know if this process involved any allowances for extra fleet overhead for an outage. That could be a corrective item, especially in the short term. It is expensive to keep idle resources around to prevent an outage, but it might be a requirement until suggestion 2 below is implemented.
  2. We don't use dynamic sizing for our fleets based on demand. Wether by Kubernetes migration, or migrating into auto-scaling groups for VMs, either of these probably would have scaled up in functioning availability zones to automatically handle the growing traffic needs.

Lessons Learned

Corrective Actions

Scale up web nodes - July 10 2020 - @cmcfarland
Re-size fleets to allow for single availability zone failure - August 31 2020 - TBD
GitLab.com on Kubernetes - This is an Epic of work going on to move Gitlab.com into Kubernetes
Audit zonal redundancies

Guidelines

Blameless RCA Guideline

Edited Aug 04, 2020 by AnthonySandoval