Incident Review: Gitaly is down on gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal

INC-649: Gitaly is down on gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal

Generated by Florian Forster on 2 May 2025 12:07. All timestamps are local to Etc/UTC

Key Information

Metric Value
Customers Affected 3385
Requests Affected 39644
Incident Severity Severity 2
Impact Start Time Fri, 02 May 2025 09:22:00 UTC
Impact End Time Fri, 02 May 2025 11:35:00 UTC
Total Duration 2 hours, 8 minutes
Link to Incident Issue #19754 (closed)

Requests Affected

SCR-20250506-kmqm Source - 7-day retention

Total impacted requests: 39644.

Customers Affected

SCR-20250506-kopj Source

Total customers impacted: 3385

Top impacted users and projects (ranked by impacted requests)

Top impacted users Top impacted projects
SCR-20250506-kpwv SCR-20250506-kqjz

Support impact

4 Support tickets were raised by the customers. Search query in Zendesk

Summary

Problem: Gitaly is down on gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal.

Impact: Customers using this host cannot access their Git repositories.

Causes: An application-level configuration change was made one day before the incident see details. This change targeted two nodes in the gprd Gitaly fleet, including the affected node. After the configuration update, the host began experiencing heavy disk activity and gradual memory buildup. When the incident started, the Gitaly server became memory saturated and unresponsive and stopped accepting requests.

Response strategy: We worked to identify the root cause. Our initial hypothesis wrongly suggested a GCP host networking issue. We spent time examining GCP logs, checking migration events, and trying to contact GCP support. Later, we correctly identified that a configuration change caused the problem. We then focused our efforts on rolling back and reapplying the configuration, which effectively resolved the issue.

What went well?

  • I highly appreciated the coordination of incident responders. The whole team, including IMOC, SRE on-call, Gitaly on-call, and CMOC collaborated smoothly and efficiently while investigating the root cause.
  • We developed multiple hypotheses and gathered data to verify them quickly.
  • Eventually, we identified the root cause and resolved the incident successfully.

What was difficult?

  • The rollout issue associated with this incident was incorrectly classified as C3 instead of C2. This meant it didn't receive enough attention or announcements as it should have. During the incident, IMOC, on-call SREs, and Gitaly team members were unaware of these changes.
  • The timing of the rollout was inconvenient (during a public holiday).
  • Incident responders couldn't access the affected host after migrating our Gitaly node to the new project structure. We spent significant time figuring out how to SSH to these nodes to trigger the startup script, which was an essential part of the rollback instructions documented in the rollout issue.
  • (Gitaly team only) There were no instructions for capturing performance data or traces from the saturated Gitaly process on that node. We couldn't wait for the DRIs of the rollout to come online. The urgency to rollback was high since customers couldn't access their data.
  • For our first hypothesis, we tried to submit a support ticket to GCP. Unfortunately, the new Gitaly fleet project structure didn't give us enough support priority. The Gitaly fleet has moved to a new project structure where VMs are distributed across multiple GCP projects.

Investigation Details

This comment sparks the details of the root cause, investigation, and timeline leading to the event.

Timeline

Incident Timeline

2025-05-02

09:40:58 Incident reported by Florian Forster

Florian Forster reported the incident

Severity: Severity 2

Status: Investigating

09:47:28 Update shared

Florian Forster shared an update

Has shared graphs that seem to indicate that a migration has disrupted the service, but it may already have been resolved.

09:47:37 Message from Rehab Hassanein

Rehab Hassanein's message was pinned by Florian Forster

Link to graph: https://dashboards.gitlab.net/goto/5EX13ObNg?orgId=1

09:49:03 Image posted by Bob Van Landuyt

Bob Van Landuyt posted an image to the channel

per-host SLIs seem to be recovering: https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1&from=20[…]qdn=gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal (https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1&from=2025-05-02T08:23:06.452Z&to=2025-05-02T09:44:59.999Z&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-fqdn=gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal)

a slack image

09:49:47 Message from Florian Forster

Florian Forster pinned their own message

The host-down event started at 9:21 UTC and ended at 9:38 UTC, i.e. a 17 minute window.

09:50:31 Update shared

Florian Forster shared an update

said this problem coincided with a Gitaly deployment.

09:52:55 Update shared

Florian Forster shared an update

is verifying that service has been restored for gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal

09:56:06 Incident resolved and entered the post-incident flow

Florian Forster shared an update

Status: Investigating → Documenting

From what we can see, the incident is mitigated. We'll continue investigating what happened.

10:11:03 Incident re-opened

Florian Forster shared an update

Status: Documenting → Investigating

Reopening since the same host is experiencing problems again.

10:14:31 Image posted by Quang-Minh Nguyen

Quang-Minh Nguyen posted an image to the channel

Logs from Gitaly side:

a slack image

10:45:20 Message from Quang-Minh Nguyen

Quang-Minh Nguyen pinned their own message

New finding: this node was added to a group of experimental nodes for rolling out new Gitaly transactions: #19748 (closed). The config was udpated 2025-05-01 15:27 UTC.

10:45:34 Image posted by Quang-Minh Nguyen

Quang-Minh Nguyen posted an image to the channel

Right after that, disk write throughput peaked until now

a slack image

10:46:06 Image posted by Quang-Minh Nguyen

Quang-Minh Nguyen posted an image to the channel

Disk IOPS also increased significantly to 10k ops/s

a slack image

10:47:42 Message from Quang-Minh Nguyen

Quang-Minh Nguyen's message was pinned by Florian Forster

Rollback this MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5944/diffs

10:48:59 Message from Quang-Minh Nguyen

Quang-Minh Nguyen pinned their own message

Similarly, the other node in the experiment follow the same increments in disk IOPS: https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1&from=now-24h&to=now&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-fqdn=gitaly-07-stor-gprd.c.gitlab-gitaly-gprd-0fe1.internal

10:49:47 Message from Florian Forster

Florian Forster pinned their own message

Transactions have recently been enabled for gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal: #19748 (comment 2480277398)

10:51:45 Message from Quang-Minh Nguyen

Quang-Minh Nguyen pinned their own message

Another MR adding two nodes to the runlist: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10935/diffs

10:53:28 Update shared

Florian Forster shared an update

We have identified a recent change, enabling transactions on the problematic host and one more. We'll roll back this change.

10:59:45 Update shared

Florian Forster shared an update

Decision: we'll revert https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10935 per #19748 (closed) rollback instructions.

11:05:34 Update shared

Florian Forster shared an update

Terraform rollback has been applied via https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10947

11:10:35 Update shared

Florian Forster shared an update

We're running Chef on the affected instance to pick up the changes.

11:19:46 Status changed from Investigating → Monitoring

Florian Forster shared an update

Status: Investigating → Monitoring

We have rolled back a change request that we believe caused these problems on this one Gitaly host. We're monitoring the situation for a little while to make sure the host remains healthy.

11:20:41 Image posted by Quang-Minh Nguyen

Quang-Minh Nguyen posted an image to the channel

No more partitions are started on gitaly-04-stor-gprd-c-gitlab-gitaly-gprd-83fd, indicating that the Transcations has been stopped on this node. Source (https://dashboards.gitlab.net/d/cdqjq90oyrzswb/transactions?orgId=1&from=now-15m&to=now&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-stage=main).

a slack image

11:23:04 Image posted by Quang-Minh Nguyen

Quang-Minh Nguyen posted an image to the channel

Disk read/write IOs were backed to normal

a slack image

11:24:35 Image posted by Quang-Minh Nguyen

Quang-Minh Nguyen posted an image to the channel

All apdex is back to normal now

a slack image

11:49:44 Incident resolved and entered the post-incident flow

Florian Forster shared an update

Status: Monitoring → Documenting

Based on all available data, we believe this incident to be fixed.

Investigation Notes

Any details you may want to add about the investigation can go here.

Actions

Action

Owner

Let's revert this terraform one instead: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10935

Rehab Hassanein

Follow-ups

|----|----|

Follow-up Owner
Improve performance of Gitaly's Transactions system @jcaigitlab
Review severity guidance Unassigned
Revisit how to SSH into Gitaly hosts. Unassigned
Improve GCP support priority for the new Gitaly fleet project structure Unassigned
Figure out which customers are affected. @dserafin-gitlab

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

  • Set the title to Incident Review: (Incident issue name)
  • Assign a Service::* label (most likely matching the one on the incident issue)
  • Set a Severity::* label which matches the incident
  • In the Key Information section, make sure to include a link to the incident issue
  • Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.

For the assigned DRI

  • Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
  • If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
  • Create a few short sentences in the Summary section summarizing what happened (TL;DR)
  • Link any corrective actions and describe any other actions or outcomes from the incident
  • Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
  • Once discussion wraps up in the comments, summarize any takeaways in the details section
  • If the incident timeline does not contain any sensitive information and this review can be made public, turn off the issue's confidential mode and link this review to the incident issue.
  • Close the review before the due date
  • Go back to the incident channel or page and close out the remaining post-incident tasks
Edited by John Cai