2022-01-31 gitlab.com Site-wide outage
Incident DRI
Current Status
This incident is resolved. The root cause is determined to have been power issue in a Google data center. Full details can be found here: https://gitlab.com/gitlab-com/gl-infra/production/uploads/36132a9050597f8ce711e214772c9b26/GCP_PD_Unhealthy_Devices_RCA_-31_Jan_2022-_omg-47701-is.pdf
From 15:12 UTC to 15:39 UTC, all users of Gitlab.com experienced errors. The root cause was a GCP power outage that affected a server floor in the datacenter where two Patroni replicas and three Gitaly nodes were located.
We investigated the impact in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15095
Three Patroni replica nodes as well as three Gitaly nodes experienced an increased load caused by high iowait
values. All affected nodes were on the us-east-1-c
AZ. Two of these nodes patroni-v12-0{4,7}
were showing increased error counts as well as PostgresSQL processes being hung due to iowait, therefore we decided to set them to maintenance mode and drain the traffic, which helped recover the site.
Summary for CMOC notice / Exec summary:
- Customer Impact: All web users of GItLab.com saw 503 errors. Git operations failed.
- Service Impact: ServiceGCP ServiceAPI ServicePatroni ServiceWeb ServiceGitaly ServiceWebsockets ServiceGit
- Impact Duration:
15:12 UTC
-15:39 UTC
(27 minutes) - Root cause: Power outage in GCP datacenter where multiple Patroni and Gitaly servers were located
GCP Support Case Status
We are concerned at this time about potential dataloss and have asked GCP to clarify whether it was possible that disk writes might have been acknowledged by the kernel, but not written to the block device.
- 2022-02-01 05:33: GCP is investigating and will provide an update on 2022-02-01
- 2022-02-01 15:35: GCP is still investigating
Timeline
Recent Events (available internally only):
- Deployments
- Feature Flag Changes
- Infrastructure Configurations
- GCP Events (e.g. host failure)
All times UTC.
2022-01-31
-
15:08
- Received an alert regarding gitlab.com being down as well as internal users reported 503 errors on gitlab.com -
15:20
- We identified two patroni nodes that were under high load and placed them on maintenance mode. -
15:22
- Status page updated -
15:30
- Site recovered after removing these two patroni replicas from the rotation. -
15:32
- Support ticket opened with GCP -
15:51
- @igorwwwwwwwwwwwwwwwwwwww re-declares incident in Slack. -
16:00
- patroni-v12-04 was powered off and powered back on -
16:12
- patroni-v12-04 was taken out of maintenance mode and started receiving traffic. -
17:05
- Our Cloud Provider have identified an issue on their end and is now mitigated. -
18:11
- patroni-v12-07 was taken out of maintenance mode and started receiving traffic. -
19:00
- Prevent patroni from choosing any of the 3 affected nodes as failover candidates:patroni-v12-01
,04
,07
. This is a precaution until we determine if there is block-level data loss on those nodes. Chef is temporarily disabled on those nodes.
2022-02-01
-
06:00
- We have decided to mark this incident resolved while we wait for additional information from GCP, if we believe there is any user impact from dataloss a new incident will be opened. -
07:37
- Status page set to resolved
Takeaways
- ...
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- Drop replicas from the load balancing pool when they are unhealthy &689
- Consider synchronously engaging GCP TAM where when symptoms seem related to outage. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15131
- Give all incident managers access to GCP ticket system. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15132
- Investigate the Consul service discovery of replicas when replicas have failed (rejection of bad replicas). https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15098
- Consider investigating different node behaviour with the same symptoms (01 vs 07). https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15095
- Investigate node lockup behaviour automated response. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15098
- Create easy way to initial incident create on Status Page https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15116
- Async communication during S1 incidents https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15256
- dashboards.gitlab.net (Grafana) unresponsive during S1 incidents https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15257
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
-
Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary -
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Both internal and external customers of Gitlab.com
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Gitlab.com
web
,api
,https_git
,ssh
andwebsockets
components were unavailable and responded with 5XX errors to client requests.
- Gitlab.com
-
How many customers were affected?
- All Gitlab.com customers were affected by the outage.
- If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
Affected Backends | Availability during outage |
---|---|
"api" | 67.2% |
"canary_api" | 40.8% |
"canary_https_git" | 43.4% |
"canary_web"} | 30.6% |
"https_git" | 50.4% |
"main_api" | 64.3% |
"main_web" | 59.4% |
"ssh"} | 85.4% |
"web" | 61.3% |
"websockets" | 46.9 % |
What were the root causes?
- Power outage in GCP datacenter where multiple Patroni and Gitaly servers were located
Incident Response Analysis
-
How was the incident detected?
- Monitoring/Alerts.
- Internal User reports.
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- Team identified the iowait was affecting several instances across different workloads and within the same AZ on GCP and opened an incident with the cloud provider.
- Cloud provider confirmed there was a related outage on their end and that has been mitigated.
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- We identified the affected Patroni nodes and set them to maintenance mode to drain their traffic.
-
How could time to mitigation be improved?
- Improved monitoring and automation for removal of failed Patroni instances from receiving traffic.
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- No
What went well?
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)