2020-12-14 Google services are intermittently available

Summary

Due to widespread issues reported by our cloud provider multiple GitLab services are significantly degraded. This includes (but are not limited to) the following services:

CI job logs and pipeline processing
Mail processing
GitLab Container Registry
GitLab Issue attachments

RCA from Google: https://status.cloud.google.com/incident/zall/20013

The most relevant quote seems to be this:

Google Cloud Storage

Approximately 15% of requests to Google Cloud Storage (GCS) were impacted during the outage, specifically those using OAuth, HMAC or email authentication. After 2020-12-14 04:31 US/Pacific, the majority of impact was resolved, however, there was lingering impact, for <1% of clients that attempted to finalize resumable uploads that started during the window. These uploads were left in a non-resumable state; the error code GCS returned was retryable, but subsequent retries were unable to make progress, leaving these objects unfinalized.

It is safe to assume the 15% is proportionate of the global number of requests to GCS. In almost all, if not all, cases, our usage of GCS is performed with signed URLs, which are the requests predominantly affected by the outage. This explains why the impact on us was so severe.

Timeline (joint with Google's RCA timeline)

All times UTC.

2020-12-14

11:43 - [Google] The quota for the account database was reduced, which prevented the Paxos leader from writing
12:07 - [GitLab] marin declares incident in Slack.
12:08 - [Google] the root cause and a potential fix were identified
12:22 - [Google] disabling the quota enforcement in one datacenter
12:27 - [Google] the same mitigation was applied to all datacenters
12:31 - [Google] the majority of impact was resolved, however, there was lingering impact
12:33 - [Google] error rates at normal levels
12:35 - [GitLab] CI traces are shipped again indicating start of recovery from GitLab side

Corrective Actions

Edit or Add runbook for handling of long-term GCS outages - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12140

Click to expand or collapse the Incident Review section.

Incident Review

Summary

For a period of 52 minutes (between 2020-12-14 11:43 UTC and 2020-12-14 12:35 UTC), GitLab.com was affected as a downstream to a Google Cloud Platform incident, which caused Google Cloud Platform Services which rely on authentication to fail over time.

Service(s) affected: ServiceContainer Registry ServiceCI Runners ServiceGCP ServiceMailroom ServiceAPI ServiceGitLab Rails
Team attribution: teamReliability
Time to detection: 24 minutes (Google's incident start to GitLab incident declaration)
Minutes downtime or degradation: 52 minutes

Metrics

Registry service

https://dashboards.gitlab.net/dashboard/snapshot/Ie4mlDMywQ6xI876klh8bykGU230DzBP

Shared runners

https://dashboards.gitlab.net/dashboard/snapshot/q2NSm9wzwJ67Ljns4DyR29EFaEJKX8Fy

Sidekiq service:

https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=1607937590710&to=1607948390710&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2

Redis memory consumption is also growing because of the CI jobs https://dashboards.gitlab.net/dashboard/snapshot/180D13wFGuHjJeiHkKAgC2JqTSVFKLec

CI runners have fully recovered: https://dashboards.gitlab.net/dashboard/snapshot/q2NSm9wzwJ67Ljns4DyR29EFaEJKX8Fy?orgId=1&from=1607942173396&to=1607952973396
Mailroom recovered: https://dashboards.gitlab.net/d/mailroom-main/mailroom-overview?orgId=1&from=1607936361768&to=1607953259999
Sidekiq recovered: https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?viewPanel=80616&orgId=1&from=1607943827860&to=1607953010573&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-queue=All

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. external customers
2. internal customers
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Any interactions with assets passing through Google Cloud Storage (Container Registry, CI traces, attachments) could fail randomly.
  - This was caused by an internal problem at Google where they were unable to determine whether a request is authenticated, or not.
  - This issue's impact varied on requests. Some requests were successful (especially towards the beginning of the incident), whereas some were not (especially after some time passed).
How many customers were affected?
1. Unknown. Expecting every customer to have been hit by the symptoms
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Unknown. See 2.
2. Google estimates a request failure rate of 15% to GCS. However, this seems to be a global percentage, including unauthenticated requests. In almost all, if not all, cases, our usage of GCS is performed with signed URLs, which are the requests predominantly affected by the outage. So our failure rate will be much higher.

What were the root causes?

"5 Whys"

Incident Response Analysis

How was the incident detected?
1. The site's functionality was impaired and prompted an incident declaration.
How could detection time be improved?
1. Improve monitoring of managed services' availability
How was the root cause diagnosed?
1. See Google's RCA linked at the top.
How could time to diagnosis be improved?
1. This was an external dependency, we were not able to diagnose.
How did we reach the point where we knew how to mitigate the impact?
1. We could not mitigate, as this was an external dependency without a fallback plan.
How could time to mitigation be improved?
1. N/A
What went well?
1. We had word-of-mouth confirmation, that this was indeed a Google issue during our investigation.

Post Incident Analysis

Did we have other events in the past with the same root cause?
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Not on GitLab side.

Lessons Learned

We should establish comms to Google, that does not rely on Google Services in case of an emergency. We were unable to use the GCP ticket system, as authentication was not working.
We should incorporate sanity checks for managed services. We do have a metric for the Container Registry, that maps to GCS almost 1:1, but we do not (to my knowledge) have metrics to monitor our request successrate to managed services by itself.

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Incident Review Stakeholders

Edited Feb 24, 2021 by Hendrik Meyer