Incident Review: Google cloud outage causing multiple issues
INC-1799: Google cloud outage causing multiple issues
Generated by Kam Kyrala on 12 Jun 2025 22:55. All timestamps are local to Etc/UTC
Key Information
Metric | Value |
---|---|
Customers Affected | All users of Registry, CI/CD runners, Google Cloud Storage dependent features, GitLab Pages, and Web services |
Requests Affected | Multiple services experienced complete outages or severe degradation |
Incident Severity | Severity 1 |
Impact Start Time | Thu, 12 Jun 2025 17:50:00 UTC |
Impact End Time | Thu, 12 Jun 2025 20:46:00 UTC |
Total Duration | 2 hours, 56 minutes |
Link to Incident Issue | https://app.incident.io/gitlab/incidents/1799 |
Link to GCP Incident Issue: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW
Summary
Problem: A global Google Cloud Platform outage caused by an incorrect change to their API endpoints resulted in a crash loop affecting their global infrastructure, and any customers with dependencies on these services.
Impact: Critical GitLab services became unavailable or severely degraded, including:
- Git operations (elevated error rates)
- GitLab Pages (complete outage)
- Container Registry (high error rates, CDN failures)
- CI/CD runners (job processing delays, backlogs)
- Google Cloud Storage dependent features (LFS, artifacts, traces, monitoring services dependent on storage)
- Web services (image scaling issues)
Root Cause: Google Cloud experienced a global outage stemming from an incorrect change to their API endpoints, which caused a crash loop and affected their global infrastructure across all services.
Detection: Multiple monitoring alerts fired simultaneously at 18:06 UTC, with the EOC receiving 17+ pages within minutes.
Resolution: Google Cloud engineers identified and mitigated the issue. Services began recovering around 19:55 UTC, with full recovery achieved by 20:46 UTC.
What went well?
-
Rapid incident detection and classification: Our monitoring systems immediately detected the widespread impact across multiple services, allowing quick escalation to Severity 1.
-
Effective incident command structure: The incident response team quickly assembled, with clear roles (Sarah Walker as Incident Lead, Cameron McFarland providing technical analysis, Alex Ives managing escalations, and Caleb Williamson on communications).
-
Quick root cause identification: The team rapidly identified the issue as a Google Cloud problem by correlating failures across multiple services that shared Google Cloud Storage as a dependency.
-
Proactive communication: Despite the handbook being down, the team found alternative ways to communicate, including using the incident.io platform and Slack channels.
-
Cross-team collaboration: Multiple teams (SRE, Infrastructure, Registry, Runners) joined quickly to assess impact on their services.
-
Monitoring proved effective: Our comprehensive monitoring caught all affected services and provided visibility into the scope of the impact.
What was difficult?
-
Unable to engage Google Cloud support normally: The Google Cloud Console was non-functional, preventing the team from opening a support case through standard channels. The team had to rely on Slack channels and email.
-
Delayed vendor acknowledgment: It took approximately 20 minutes for Google to acknowledge the issue after initial contact, and their public status page remained green for an extended period despite widespread issues.
-
Cascading failures complicated troubleshooting: With Google Cloud DNS, IAM, and Storage all affected, it was challenging to determine the full scope and which services to prioritize.
-
Incomplete metrics during the incident: Our own metrics became spotty because we store metrics data in Google Cloud Storage, making it harder to assess the full impact in real-time.
-
Lack of alternative escalation paths: When the handbook was inaccessible, the team struggled to remember the exact escalation procedures and contact information (e.g., organization ID for Google).
-
Limited mitigation options: Since all three zones were equally affected and the issue was global, the team couldn't implement typical DR strategies like traffic steering.
Investigation Details
Timeline
Incident Timeline
18:06 - Incident opened by Sarah Walker as Severity 2
- Multiple alerts firing across services
- Initial investigation began
18:09 - CMOC/IMOC thread established
18:16 - Escalated to Severity 1
- Recognized widespread impact across multiple critical services
- All three operational zones equally affected
18:25 - Cameron McFarland identified Google Cloud incident page
- Difficulty accessing GCP console due to the outage itself
18:32 - Alex Ives reached out via ext-google-cloud Slack channel
- Standard support channels unavailable
18:35 - Google Cloud DNS identified as having widespread issues
- Sarah unable to load GCP support page
18:44 - First signs of recovery: GitLab Pages sites returned
18:47 - First comprehensive status update posted
- Confirmed impact across all three zones
- Listed affected services
- Noted attempts to reach Google support
18:49 - Google Cloud status page finally acknowledged the incident
18:55 - Services beginning to show signs of recovery
- Handbook and Pages accessible again
19:05 - Cloudflare also reported outages (no impact to GitLab observed)
19:11 - Cloudflare traffic verified as normal via dashboards
20:00 - Major update: Root cause identified by Google
- Recovery in all locations except us-central1
- CI job backlog being processed
- Registry CDN still showing elevated errors
20:07 - 60% confidence in recovery achieved
- Container registry errors not impacting service
20:20 - Google Cloud status update: us-central1 still affected
20:24 - Root cause confirmed by Google TAM
- "Incorrect change to API endpoints caused crash loop"
- Affected global infrastructure
20:47 - Most services recovered
- CI runners improved
- Registry errors dropped below SLA
20:48 - Incident moved to Monitoring status
20:49 - Continued monitoring of AI Gateway and runner metrics
22:42 - Incident resolved
- All services confirmed recovered
- Post-incident review process initiated
Technical Details
Google Cloud confirmed the root cause as "an incorrect change to our API endpoints, which caused a crash loop and affected our global infrastructure, impacting all services."
Affected Google Cloud Services
The incident affected multiple GCP services across all regions:
- Google Cloud DNS
- Google Cloud Storage
- Google Compute Engine
- Identity and Access Management (IAM)
- Cloud CDN
- Load Balancers
- 40+ other services (full list documented in incident notes)
GitLab Service Impact Analysis
- GitLab Pages: Complete outage due to dependency on Google Cloud Storage buckets
- Container Registry: CDN errors and elevated error rates when accessing GCS buckets
- CI/CD: Job processing delays, artifact upload failures, runner provisioning issues
- Git operations: LFS operations impacted due to object storage dependency
- Web services: Image scaler failures due to inability to access stored images
Lessons Learned
-
Cloud provider dependencies: This incident highlighted our deep dependency on Google Cloud Platform. While multi-zone deployment provides resilience for zonal failures, global cloud provider issues can still cause widespread impact.
-
Importance of offline documentation: Critical escalation procedures and contact information should be available offline or in multiple locations to ensure accessibility during outages.
-
Vendor communication channels: We need established emergency communication channels with cloud providers that don't depend on their own infrastructure.
-
Monitoring dependencies: Our monitoring infrastructure has dependencies on the same cloud provider we're monitoring, which can create blind spots during major outages.
-
Blameless culture in action: The team's response demonstrated excellent blameless practices - focusing on understanding the problem and coordinating response rather than attribution.
-
Proactive internal communication: Establishing clear criteria for when an outage should trigger company-wide internal communications would reduce the operational overhead on the response team.
Review Guidelines
This review has been completed following our blameless postmortem culture, focusing on improving our systems and processes rather than individual actions. The incident revealed several areas for infrastructure and process improvements that will help us respond more effectively to similar incidents in the future.
For the person opening the Incident Review
-
Set the title to Incident Review: (Incident issue name)
-
Assign a Service::*
label (most likely matching the one on the incident issue) -
Set a Severity::*
label which matches the incident -
In the Key Information
section, make sure to include a link to the incident issue -
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
For the assigned DRI
-
Fill in the remaining fields in the Key Information
section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find. -
If there are metrics showing Customers Affected
orRequests Affected
, link those metrics in those fields -
Create a few short sentences in the Summary section summarizing what happened (TL;DR) -
Link any corrective actions and describe any other actions or outcomes from the incident -
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported? -
Once discussion wraps up in the comments, summarize any takeaways in the details section -
If the incident timeline does not contain any sensitive information and this review can be made public, turn off the issue's confidential mode and link this review to the incident issue. -
Close the review before the due date -
Go back to the incident channel or page and close out the remaining post-incident tasks