2022-01-31 gitlab.com Site-wide outage

Incident DRI

Current Status

This incident is resolved. The root cause is determined to have been power issue in a Google data center. Full details can be found here: https://gitlab.com/gitlab-com/gl-infra/production/uploads/36132a9050597f8ce711e214772c9b26/GCP_PD_Unhealthy_Devices_RCA_-31_Jan_2022-_omg-47701-is.pdf

From 15:12 UTC to 15:39 UTC, all users of Gitlab.com experienced errors. The root cause was a GCP power outage that affected a server floor in the datacenter where two Patroni replicas and three Gitaly nodes were located.

We investigated the impact in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15095

Three Patroni replica nodes as well as three Gitaly nodes experienced an increased load caused by high iowait values. All affected nodes were on the us-east-1-c AZ. Two of these nodes patroni-v12-0{4,7} were showing increased error counts as well as PostgresSQL processes being hung due to iowait, therefore we decided to set them to maintenance mode and drain the traffic, which helped recover the site.

Summary for CMOC notice / Exec summary:

Customer Impact: All web users of GItLab.com saw 503 errors. Git operations failed.
Service Impact: ServiceGCP ServiceAPI ServicePatroni ServiceWeb ServiceGitaly ServiceWebsockets ServiceGit
Impact Duration: 15:12 UTC - 15:39 UTC (27 minutes)
Root cause: Power outage in GCP datacenter where multiple Patroni and Gitaly servers were located

GCP Support Case Status

We are concerned at this time about potential dataloss and have asked GCP to clarify whether it was possible that disk writes might have been acknowledged by the kernel, but not written to the block device.

2022-02-01 05:33: GCP is investigating and will provide an update on 2022-02-01
2022-02-01 15:35: GCP is still investigating

Timeline

Recent Events (available internally only):

All times UTC.

2022-01-31

15:08 - Received an alert regarding gitlab.com being down as well as internal users reported 503 errors on gitlab.com
15:20 - We identified two patroni nodes that were under high load and placed them on maintenance mode.
15:22 - Status page updated
15:30 - Site recovered after removing these two patroni replicas from the rotation.
15:32 - Support ticket opened with GCP
15:51 - @igorwwwwwwwwwwwwwwwwwwww re-declares incident in Slack.
16:00 - patroni-v12-04 was powered off and powered back on
16:12 - patroni-v12-04 was taken out of maintenance mode and started receiving traffic.
17:05 - Our Cloud Provider have identified an issue on their end and is now mitigated.
18:11 - patroni-v12-07 was taken out of maintenance mode and started receiving traffic.
19:00 - Prevent patroni from choosing any of the 3 affected nodes as failover candidates: patroni-v12-01, 04, 07. This is a precaution until we determine if there is block-level data loss on those nodes. Chef is temporarily disabled on those nodes.

2022-02-01

06:00 - We have decided to mark this incident resolved while we wait for additional information from GCP, if we believe there is any user impact from dataloss a new incident will be opened.
07:37 - Status page set to resolved

Takeaways

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Drop replicas from the load balancing pool when they are unhealthy &689
Consider synchronously engaging GCP TAM where when symptoms seem related to outage. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15131
Give all incident managers access to GCP ticket system. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15132
Investigate the Consul service discovery of replicas when replicas have failed (rejection of bad replicas). https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15098
Consider investigating different node behaviour with the same symptoms (01 vs 07). https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15095
Investigate node lockup behaviour automated response. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15098
Create easy way to initial incident create on Status Page https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15116
Async communication during S1 incidents https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15256
dashboards.gitlab.net (Grafana) unresponsive during S1 incidents https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15257

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Create a confidential issue

Click to expand or collapse the Incident Review section.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Both internal and external customers of Gitlab.com
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Gitlab.com web, api, https_git, ssh and websockets components were unavailable and responded with 5XX errors to client requests.
How many customers were affected?
1. All Gitlab.com customers were affected by the outage.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?

Affected Backends	Availability during outage
"api"	67.2%
"canary_api"	40.8%
"canary_https_git"	43.4%
"canary_web"}	30.6%
"https_git"	50.4%
"main_api"	64.3%
"main_web"	59.4%
"ssh"}	85.4%
"web"	61.3%
"websockets"	46.9 %

Graph Source

What were the root causes?

Power outage in GCP datacenter where multiple Patroni and Gitaly servers were located

Incident Response Analysis

How was the incident detected?
1. Monitoring/Alerts.
2. Internal User reports.
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. Team identified the iowait was affecting several instances across different workloads and within the same AZ on GCP and opened an incident with the cloud provider.
2. Cloud provider confirmed there was a related outage on their end and that has been mitigated.
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. We identified the affected Patroni nodes and set them to maintenance mode to drain their traffic.
How could time to mitigation be improved?
1. Improved monitoring and automation for removal of failed Patroni instances from receiving traffic.
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No

What went well?

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Aug 24, 2022 by Anthony Fappiano