Google Cloud Platform Issue - IAM Failures

Summary

On 08 APR 2020, Google Cloud Platform (GCP) experienced an incident with Cloud IAM that impacted multiple platform services. We were first alerted to the issue in the #alerts-general Slack channel for an SLO violation with our Registry service, which primarily operates via Google Cloud Storage.

We observed a drop in Application Performance Index (Apdex) scores across multiple services.

Timeline

All times UTC.

2020-04-08

14:01 - Slackline alert fires The registry service (main stage) has an error-ratio exceeding SLO
14:02 - Slackline alert fires: The pgbouncer service (main stage), pgbouncer_async_pool component has a saturation exceeding SLO and is close to its capacity limit.
14:03 - Slackline alert fires: The sidekiq service (main stage) has a apdex score (latency) below SLO

Many more Slackline alerts fire ...

14:10 - Incident declared from Slack
14:11 - Incident posted to Status.io
14:20 - Rackspace pinged in shared Slack to escalate to GCP support.
14:27 - @jarv found 503 errors from Google Storage API in artifact service logs. We're investigating the possibility this may be a quota issue.
14:30 - Status.io update with proper deprecated/disrupted status added to each individual component.
14:37 - Rackspace issue opened.
14:41 - Phone call placed to Rackspace.
14:42 - Recoveries are observed.
14:47 - status.cloud.google.com has been updated, reflecting an Infrastructure issue having started at 14:35: https://status.cloud.google.com/incident/zall/20005.
14:49 - Rackspace confirms the incident on phone call and have indicated that they're aware the incident is ongoing and are escalating "Infrastructure API issues" to their GCP contacts.
14:54 - @bjk-gitlab @brendan and @cmcfarland confirm that this is an IAM issue that was creating issues with an API interactions to GCP. We discovered Terraform pipelines failing with similar error response codes to what we observed with GCS bucket interactions.
14:59 - Our Rackspace Acct. Exec replied in Slack and is updating our open issue with all available information.
15:03 - @AnthonySandoval engages the Support team CMOC via PagerDuty, asking for additional communications around the incident and pending recovery. (https://gitlab.pagerduty.com/incidents/P4BO5RM)
15:05 - @tristan joins the zoom call and is being brought up to speed.
15:05 - Rackspace TAM indicates that other customers are observing recoveries, too.
15:36 - Google status declared resolved.