2020-06-29: Connectivity loss with nodes in `us-east1-c1` during deployment.
Summary
2020-06-29: Google Cloud Platform Virtual Machine Network Connectivity Issues
Following a deploy, we noticed that we'd lost a number of nodes in our web fleet. We'd initially believed this to be a result of the release, but after investigating further we came to the conclusion that VMs in us-east1-c
, specifically of type C
, were unreachable. Google Cloud Platform has since acknowledge an issue with networking and VM instance creation in us-east1
.
Our strategy for mitigating the issue was to scale up the fleet across other Availability Zones (AZs), but it's possible now that we'll be stymied by the inability to instantiate new VMs — and the fact that we're not in hosted in any other GCP regions.
Timeline
All times UTC.
2020-06-29
Status.io incident
- 14:52 - nodes go offline
- 14:57 - PagerDuty Alert
- 15:07 - cmcfarland declares incident in Slack using
/incident declare
command. - 15:18 - status.io update
- 15:19 - @AnthonySandoval reached out to Rackspace support in a dedicated Slack channel.
- 15:23 - An incident ticket in the Rackspace portal was created.
- 15:31 - status.io update
- 15:41 - @cmcfarland opens an MR to increase the number of nodes across all Availability Zones (AZs) https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1869
- 15:45 - The issue is updated to an ~S1 incident.
- 15:46 - Redis persistent KV has reported we lost a secondary.
- 15:47 - status.io update indicating CI queuing.
- 15:48 - Google Cloud updates their status page with an incident - https://status.cloud.google.com/incident/cloud-networking/20005
- 15:56 - Google Cloud updates their status page with a second incident - https://status.cloud.google.com/incident/compute/20004
- 16:05 - status.io update
- 16:10 - Our Web Apdex score fully recovers as overall traffic decreases.
- 16:20 - status.io update
- 16:27 - Google Cloud update that the issue is likely to be resolved in 30 minutes for
us-east1-d
. - 16:39 - status.io update
- 16:50 - New capacity from Terraform changes went into affect and Web fleet saturation dropped below normal rates.
- 16:57 - The queue of CI jobs is down to zero — pipelines are processing normally again.
- 16:59 - status.io update
- 17:00 - Google Cloud updates their status page indicating full restoration of service to
us-east1-d
. Indicates there is no ETA for service restoration inus-east1-c
. - 17:30, 18:00, 18:45 - Google cloud updates their status page.
us-east1-c
is restored except for issues with Persistent Disk. Persistent Disks are network devices in GCP. We continue to experience issues ssh'ing to VMs inus-east1-c
that have persistent disks. - 20:06 - Google Cloud Platform issues the all clear.
- 20:09 -
web-01
came back online and took all the traffic fromfe-
HAProxy load balancers and became oversaturated. - 20:30 - The remaining
web-*
VMs rebooted successfully and have alleviated the single node saturation offweb-01
.
Incident Review
Summary
A Google Cloud outage removed approximately 33% of our fleet. Web, load balancers, sidekiq, api, git, etc
- Service(s) affected: Almost all fleets of servers affected
- Team attribution:
- Minutes downtime or degradation: Roughly 1.5 hours of poor performance though the event lasted longer
Metrics
Customer Impact
- Who was impacted by this incident? All customers.
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...) Slow response and poor performance, slow CI jobs, etc
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected? PagerDuty alerts were the first indications
- How could detection time be improved? 5 minutes to first page. Not sure we can do much better.
- How did we reach the point where we knew how to mitigate the impact? Determining that this was not a service failure and we could not access nodes helped us establish this as a Google Outage. At that point, mitigation was discussed to create new nodes in other availability zones.
- How could time to mitigation be improved? Autoscaling capabilities in any form would have possibly mitigated quicker.
Post Incident Analysis
- How was the root cause diagnosed? Examining the Google Web console clearly told us that us-east-1c was down
- How could time to diagnosis be improved? Unknown
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? I think that any autoscaling system (like Kubernetes) could have mitigated or reduced the impact of this type of outage.
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? No
5 Whys
- Alerts were generated noting that we had lost redundancy in web nodes, that some redis services had stopped, etc.
- Due to a Google Cloud Provider outage, at least one availability zone had stopped functioning, specifically with a set of processors we relied on for several types of nodes. This caused saturation of our remaining web and git nodes.
- Why did this cause saturation of our remaining web and git nodes?
- The number of git and web nodes in 2/3 (or even 1/3) of our availability zones was not enough to handle the requests being generated.
- Why were the number of nodes provisioned in each availability zone (or two of them) not enough to handle the traffic?
- The methods of sizing our fleet may not take into account an availability zone outage.
- Why don't the methods of sizing our fleet take into account an availability zone outage?
- There are two possible responses to this:
- After updating many nodes to a more efficient processor type, we downsized to save money. I don't know if this process involved any allowances for extra fleet overhead for an outage. That could be a corrective item, especially in the short term. It is expensive to keep idle resources around to prevent an outage, but it might be a requirement until suggestion 2 below is implemented.
- We don't use dynamic sizing for our fleets based on demand. Wether by Kubernetes migration, or migrating into auto-scaling groups for VMs, either of these probably would have scaled up in functioning availability zones to automatically handle the growing traffic needs.
- There are two possible responses to this:
Lessons Learned
Corrective Actions
- Scale up web nodes - July 10 2020 - @cmcfarland
- Re-size fleets to allow for single availability zone failure - August 31 2020 - TBD
- GitLab.com on Kubernetes - This is an Epic of work going on to move Gitlab.com into Kubernetes
- Audit zonal redundancies