Google Cloud Platform Issue - IAM Failures
Summary
On 08 APR 2020, Google Cloud Platform (GCP) experienced an incident with Cloud IAM that impacted multiple platform services. We were first alerted to the issue in the #alerts-general
Slack channel for an SLO violation with our Registry service, which primarily operates via Google Cloud Storage.
We observed a drop in Application Performance Index (Apdex) scores across multiple services.
Timeline
All times UTC.
2020-04-08
-
14:01 - Slackline alert fires
The registry service (main stage) has an error-ratio exceeding SLO
-
14:02 - Slackline alert fires:
The pgbouncer service (main stage), pgbouncer_async_pool component has a saturation exceeding SLO and is close to its capacity limit.
-
14:03 - Slackline alert fires:
The sidekiq service (main stage) has a apdex score (latency) below SLO
Many more Slackline alerts fire ...
- 14:10 - Incident declared from Slack
- 14:11 - Incident posted to Status.io
- 14:20 - Rackspace pinged in shared Slack to escalate to GCP support.
- 14:27 - @jarv found 503 errors from Google Storage API in artifact service logs. We're investigating the possibility this may be a quota issue.
- 14:30 - Status.io update with proper deprecated/disrupted status added to each individual component.
- 14:37 - Rackspace issue opened.
- 14:41 - Phone call placed to Rackspace.
- 14:42 - Recoveries are observed.
- 14:47 - status.cloud.google.com has been updated, reflecting an Infrastructure issue having started at 14:35: https://status.cloud.google.com/incident/zall/20005.
- 14:49 - Rackspace confirms the incident on phone call and have indicated that they're aware the incident is ongoing and are escalating "Infrastructure API issues" to their GCP contacts.
- 14:54 - @bjk-gitlab @brendan and @cmcfarland confirm that this is an IAM issue that was creating issues with an API interactions to GCP. We discovered Terraform pipelines failing with similar error response codes to what we observed with GCS bucket interactions.
- 14:59 - Our Rackspace Acct. Exec replied in Slack and is updating our open issue with all available information.
- 15:03 - @AnthonySandoval engages the Support team CMOC via PagerDuty, asking for additional communications around the incident and pending recovery. (https://gitlab.pagerduty.com/incidents/P4BO5RM)
- 15:05 - @tristan joins the zoom call and is being brought up to speed.
- 15:05 - Rackspace TAM indicates that other customers are observing recoveries, too.
- 15:36 - Google status declared resolved.
Details
For metrics, we'll be using the time range 13:45 to 15:00 UTC, which adds roughly 25 minutes of pre and post incident data.
We've recorded the following Primary Services Average Availability for Period
- git 100%
- api 100%
- web 73.33%
- sidekiq 46.43%
- registry 42.78%
Source
Incident declared by marin in Slack via /incident declare
command.
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by AnthonySandoval