Incident Review: Google cloud outage causing multiple issues

INC-1799: Google cloud outage causing multiple issues

Generated by Kam Kyrala on 12 Jun 2025 22:55. All timestamps are local to Etc/UTC

Key Information

Metric	Value
Customers Affected	All users of Registry, CI/CD runners, Google Cloud Storage dependent features, GitLab Pages, and Web services
Requests Affected	Multiple services experienced complete outages or severe degradation
Incident Severity	Severity 1
Impact Start Time	Thu, 12 Jun 2025 17:50:00 UTC
Impact End Time	Thu, 12 Jun 2025 20:46:00 UTC
Total Duration	2 hours, 56 minutes
Link to Incident Issue	https://app.incident.io/gitlab/incidents/1799

Link to GCP Incident Issue: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW

Summary

Problem: A global Google Cloud Platform outage caused by an incorrect change to their API endpoints resulted in a crash loop affecting their global infrastructure, and any customers with dependencies on these services.

Impact: Critical GitLab services became unavailable or severely degraded, including:

Git operations (elevated error rates)
GitLab Pages (complete outage)
Container Registry (high error rates, CDN failures)
CI/CD runners (job processing delays, backlogs)
Google Cloud Storage dependent features (LFS, artifacts, traces, monitoring services dependent on storage)
Web services (image scaling issues)

Root Cause: Google Cloud experienced a global outage stemming from an incorrect change to their API endpoints, which caused a crash loop and affected their global infrastructure across all services.

Detection: Multiple monitoring alerts fired simultaneously at 18:06 UTC, with the EOC receiving 17+ pages within minutes.

Resolution: Google Cloud engineers identified and mitigated the issue. Services began recovering around 19:55 UTC, with full recovery achieved by 20:46 UTC.

What went well?

Rapid incident detection and classification: Our monitoring systems immediately detected the widespread impact across multiple services, allowing quick escalation to Severity 1.
Effective incident command structure: The incident response team quickly assembled, with clear roles (Sarah Walker as Incident Lead, Cameron McFarland providing technical analysis, Alex Ives managing escalations, and Caleb Williamson on communications).
Quick root cause identification: The team rapidly identified the issue as a Google Cloud problem by correlating failures across multiple services that shared Google Cloud Storage as a dependency.
Proactive communication: Despite the handbook being down, the team found alternative ways to communicate, including using the incident.io platform and Slack channels.
Cross-team collaboration: Multiple teams (SRE, Infrastructure, Registry, Runners) joined quickly to assess impact on their services.
Monitoring proved effective: Our comprehensive monitoring caught all affected services and provided visibility into the scope of the impact.

What was difficult?

Unable to engage Google Cloud support normally: The Google Cloud Console was non-functional, preventing the team from opening a support case through standard channels. The team had to rely on Slack channels and email.
Delayed vendor acknowledgment: It took approximately 20 minutes for Google to acknowledge the issue after initial contact, and their public status page remained green for an extended period despite widespread issues.
Cascading failures complicated troubleshooting: With Google Cloud DNS, IAM, and Storage all affected, it was challenging to determine the full scope and which services to prioritize.
Incomplete metrics during the incident: Our own metrics became spotty because we store metrics data in Google Cloud Storage, making it harder to assess the full impact in real-time.
Lack of alternative escalation paths: When the handbook was inaccessible, the team struggled to remember the exact escalation procedures and contact information (e.g., organization ID for Google).
Limited mitigation options: Since all three zones were equally affected and the issue was global, the team couldn't implement typical DR strategies like traffic steering.

Investigation Details

Timeline

Incident Timeline

18:06 - Incident opened by Sarah Walker as Severity 2

Multiple alerts firing across services
Initial investigation began

18:09 - CMOC/IMOC thread established

18:16 - Escalated to Severity 1

Recognized widespread impact across multiple critical services
All three operational zones equally affected

18:25 - Cameron McFarland identified Google Cloud incident page

Difficulty accessing GCP console due to the outage itself

18:32 - Alex Ives reached out via ext-google-cloud Slack channel

Standard support channels unavailable

18:35 - Google Cloud DNS identified as having widespread issues

Sarah unable to load GCP support page

18:44 - First signs of recovery: GitLab Pages sites returned

18:47 - First comprehensive status update posted

Confirmed impact across all three zones
Listed affected services
Noted attempts to reach Google support

18:49 - Google Cloud status page finally acknowledged the incident

18:55 - Services beginning to show signs of recovery

Handbook and Pages accessible again

19:05 - Cloudflare also reported outages (no impact to GitLab observed)

19:11 - Cloudflare traffic verified as normal via dashboards

20:00 - Major update: Root cause identified by Google

Recovery in all locations except us-central1
CI job backlog being processed
Registry CDN still showing elevated errors

20:07 - 60% confidence in recovery achieved

Container registry errors not impacting service

20:20 - Google Cloud status update: us-central1 still affected

20:24 - Root cause confirmed by Google TAM

"Incorrect change to API endpoints caused crash loop"
Affected global infrastructure

20:47 - Most services recovered

CI runners improved
Registry errors dropped below SLA

20:48 - Incident moved to Monitoring status

20:49 - Continued monitoring of AI Gateway and runner metrics

22:42 - Incident resolved

All services confirmed recovered
Post-incident review process initiated

Technical Details

Google Cloud confirmed the root cause as "an incorrect change to our API endpoints, which caused a crash loop and affected our global infrastructure, impacting all services."

Affected Google Cloud Services

The incident affected multiple GCP services across all regions:

Google Cloud DNS
Google Cloud Storage
Google Compute Engine
Identity and Access Management (IAM)
Cloud CDN
Load Balancers
40+ other services (full list documented in incident notes)

GitLab Service Impact Analysis

GitLab Pages: Complete outage due to dependency on Google Cloud Storage buckets
Container Registry: CDN errors and elevated error rates when accessing GCS buckets
CI/CD: Job processing delays, artifact upload failures, runner provisioning issues
Git operations: LFS operations impacted due to object storage dependency
Web services: Image scaler failures due to inability to access stored images

Lessons Learned

Cloud provider dependencies: This incident highlighted our deep dependency on Google Cloud Platform. While multi-zone deployment provides resilience for zonal failures, global cloud provider issues can still cause widespread impact.
Importance of offline documentation: Critical escalation procedures and contact information should be available offline or in multiple locations to ensure accessibility during outages.
Vendor communication channels: We need established emergency communication channels with cloud providers that don't depend on their own infrastructure.
Monitoring dependencies: Our monitoring infrastructure has dependencies on the same cloud provider we're monitoring, which can create blind spots during major outages.
Blameless culture in action: The team's response demonstrated excellent blameless practices - focusing on understanding the problem and coordinating response rather than attribution.
Proactive internal communication: Establishing clear criteria for when an outage should trigger company-wide internal communications would reduce the operational overhead on the response team.

Review Guidelines

This review has been completed following our blameless postmortem culture, focusing on improving our systems and processes rather than individual actions. The incident revealed several areas for infrastructure and process improvements that will help us respond more effectively to similar incidents in the future.

For the person opening the Incident Review

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Set a Severity::* label which matches the incident
In the Key Information section, make sure to include a link to the incident issue
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.

For the assigned DRI

Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
Create a few short sentences in the Summary section summarizing what happened (TL;DR)
Link any corrective actions and describe any other actions or outcomes from the incident
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
Once discussion wraps up in the comments, summarize any takeaways in the details section
If the incident timeline does not contain any sensitive information and this review can be made public, turn off the issue's confidential mode and link this review to the incident issue.
Close the review before the due date
Go back to the incident channel or page and close out the remaining post-incident tasks

Edited Jun 19, 2025 by Sarah Walker