2020-12-14 Google services are intermittently available
Summary
Due to widespread issues reported by our cloud provider multiple GitLab services are significantly degraded. This includes (but are not limited to) the following services:
- CI job logs and pipeline processing
- Mail processing
- GitLab Container Registry
- GitLab Issue attachments
RCA from Google: https://status.cloud.google.com/incident/zall/20013
The most relevant quote seems to be this:
Google Cloud Storage
Approximately 15% of requests to Google Cloud Storage (GCS) were impacted during the outage, specifically those using OAuth, HMAC or email authentication. After 2020-12-14 04:31 US/Pacific, the majority of impact was resolved, however, there was lingering impact, for <1% of clients that attempted to finalize resumable uploads that started during the window. These uploads were left in a non-resumable state; the error code GCS returned was retryable, but subsequent retries were unable to make progress, leaving these objects unfinalized.
It is safe to assume the 15% is proportionate of the global number of requests to GCS. In almost all, if not all, cases, our usage of GCS is performed with signed URLs, which are the requests predominantly affected by the outage. This explains why the impact on us was so severe.
Timeline (joint with Google's RCA timeline)
All times UTC.
2020-12-14
- 11:43 - [Google] The quota for the account database was reduced, which prevented the Paxos leader from writing
- 12:07 - [GitLab] marin declares incident in Slack.
- 12:08 - [Google] the root cause and a potential fix were identified
- 12:22 - [Google] disabling the quota enforcement in one datacenter
- 12:27 - [Google] the same mitigation was applied to all datacenters
- 12:31 - [Google] the majority of impact was resolved, however, there was lingering impact
- 12:33 - [Google] error rates at normal levels
- 12:35 - [GitLab] CI traces are shipped again indicating start of recovery from GitLab side
Corrective Actions
- Edit or Add runbook for handling of long-term GCS outages - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12140
Click to expand or collapse the Incident Review section.
Incident Review
Summary
For a period of 52 minutes (between 2020-12-14 11:43 UTC and 2020-12-14 12:35 UTC), GitLab.com was affected as a downstream to a Google Cloud Platform incident, which caused Google Cloud Platform Services which rely on authentication to fail over time.
- Service(s) affected: ServiceContainer Registry ServiceCI Runners ServiceGCP ServiceMailroom ServiceAPI ServiceGitLab Rails
- Team attribution: teamReliability
- Time to detection: 24 minutes (Google's incident start to GitLab incident declaration)
- Minutes downtime or degradation: 52 minutes
Metrics
Registry service
https://dashboards.gitlab.net/dashboard/snapshot/Ie4mlDMywQ6xI876klh8bykGU230DzBP
Shared runners
https://dashboards.gitlab.net/dashboard/snapshot/q2NSm9wzwJ67Ljns4DyR29EFaEJKX8Fy
Sidekiq service:
Redis memory consumption is also growing because of the CI jobs https://dashboards.gitlab.net/dashboard/snapshot/180D13wFGuHjJeiHkKAgC2JqTSVFKLec
CI runners have fully recovered: https://dashboards.gitlab.net/dashboard/snapshot/q2NSm9wzwJ67Ljns4DyR29EFaEJKX8Fy?orgId=1&from=1607942173396&to=1607952973396
Mailroom recovered: https://dashboards.gitlab.net/d/mailroom-main/mailroom-overview?orgId=1&from=1607936361768&to=1607953259999
Sidekiq recovered: https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?viewPanel=80616&orgId=1&from=1607943827860&to=1607953010573&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-queue=All
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- external customers
- internal customers
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Any interactions with assets passing through Google Cloud Storage (Container Registry, CI traces, attachments) could fail randomly.
- This was caused by an internal problem at Google where they were unable to determine whether a request is authenticated, or not.
- This issue's impact varied on requests. Some requests were successful (especially towards the beginning of the incident), whereas some were not (especially after some time passed).
- Any interactions with assets passing through Google Cloud Storage (Container Registry, CI traces, attachments) could fail randomly.
-
How many customers were affected?
- Unknown. Expecting every customer to have been hit by the symptoms
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Unknown. See 2.
- Google estimates a request failure rate of 15% to GCS. However, this seems to be a global percentage, including unauthenticated requests. In almost all, if not all, cases, our usage of GCS is performed with signed URLs, which are the requests predominantly affected by the outage. So our failure rate will be much higher.
What were the root causes?
Incident Response Analysis
-
How was the incident detected?
- The site's functionality was impaired and prompted an incident declaration.
-
How could detection time be improved?
- Improve monitoring of managed services' availability
-
How was the root cause diagnosed?
- See Google's RCA linked at the top.
-
How could time to diagnosis be improved?
- This was an external dependency, we were not able to diagnose.
-
How did we reach the point where we knew how to mitigate the impact?
- We could not mitigate, as this was an external dependency without a fallback plan.
-
How could time to mitigation be improved?
- N/A
-
What went well?
- We had word-of-mouth confirmation, that this was indeed a Google issue during our investigation.
Post Incident Analysis
- Did we have other events in the past with the same root cause?
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- Not on GitLab side.
Lessons Learned
- We should establish comms to Google, that does not rely on Google Services in case of an emergency. We were unable to use the GCP ticket system, as authentication was not working.
- We should incorporate sanity checks for managed services. We do have a metric for the Container Registry, that maps to GCS almost 1:1, but we do not (to my knowledge) have metrics to monitor our request successrate to managed services by itself.
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)