2024-05-16: Incident Review: GCP Networking Outage
Incident Review
There is really not much to be said for this one. Google had a global outage related to VPC networking. We experienced some side effects of it. There wasn't anything we could have done differently, we just had to wait until they fixed the issue. They knew about it immediately. We diagnosed it fairly quickly. Once their end resolved, there was nothing more we needed to do to restore service. It was all in google's hands.
The DRI for the incident review is the issue assignee.
-
Announce the incident review in the incident channel on Slack.
:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics - If there is a need to schedule a synchronous review, complete the following steps:
-
In this issue, @
mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue. -
Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it. -
Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.
-
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Internal and external customers who were either working on projects hosted on one of the 2 effected Gitaly nodes,
- or who were waiting on delayed CI jobs
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Customers on the effected Gitaly nodes were getting 503 errors
- Customers who ran CI jobs during the network issues saw those jobs delayed up to 20 minutes
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
- GCP VPC issues as described here: #18023 (comment 1909749200)
Incident Response Analysis
-
How was the incident detected?
- First indication was alerts about two unavailable Gitaly nodes
- We also got alerts regarding CI delays
- Customers opened tickets regarding CI delays
- Customers opened tickets regarding 503 errors
-
How could detection time be improved?
- Detection was fast enough, but we could have had a faster way to confirm that it was caused by GCP networking after we started having those suspicions.
-
How was the root cause diagnosed?
- Checking the GCP status page
-
How could time to diagnosis be improved?
- We didn't think to check it until the symptoms started looking like networking issues. An alert about networking issues would have been a good cue to check it sooner
-
How did we reach the point where we knew how to mitigate the impact?
- We had already suspected that there were network issues based on the behavior of the two Gitlay nodes (running but not reachable, and logging outgoing connection issues as well as incoming). But it was confirmed when we checked the status page.
-
How could time to mitigation be improved?
- Alerts around network issues
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- No
What went well?
- A representative of every team that we needed help from showed up on the incident call without prompting. We didn't have to go looking for anyone.
Guidelines
Edited by Devin Sylva