Incident Review: Gitaly is down on gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal

INC-649: Gitaly is down on gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal

Generated by Florian Forster on 2 May 2025 12:07. All timestamps are local to Etc/UTC

Key Information

Metric	Value
Customers Affected	3385
Requests Affected	39644
Incident Severity	Severity 2
Impact Start Time	Fri, 02 May 2025 09:22:00 UTC
Impact End Time	Fri, 02 May 2025 11:35:00 UTC
Total Duration	2 hours, 8 minutes
Link to Incident Issue	#19754 (closed)

Requests Affected

Source - 7-day retention

Total impacted requests: 39644.

Customers Affected

Source

Total customers impacted: 3385

Top impacted users and projects (ranked by impacted requests)

Top impacted users	Top impacted projects

Support impact

4 Support tickets were raised by the customers. Search query in Zendesk

Summary

Problem: Gitaly is down on gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal.

Impact: Customers using this host cannot access their Git repositories.

Causes: An application-level configuration change was made one day before the incident see details. This change targeted two nodes in the gprd Gitaly fleet, including the affected node. After the configuration update, the host began experiencing heavy disk activity and gradual memory buildup. When the incident started, the Gitaly server became memory saturated and unresponsive and stopped accepting requests.

Response strategy: We worked to identify the root cause. Our initial hypothesis wrongly suggested a GCP host networking issue. We spent time examining GCP logs, checking migration events, and trying to contact GCP support. Later, we correctly identified that a configuration change caused the problem. We then focused our efforts on rolling back and reapplying the configuration, which effectively resolved the issue.

What went well?

I highly appreciated the coordination of incident responders. The whole team, including IMOC, SRE on-call, Gitaly on-call, and CMOC collaborated smoothly and efficiently while investigating the root cause.
We developed multiple hypotheses and gathered data to verify them quickly.
Eventually, we identified the root cause and resolved the incident successfully.

What was difficult?

The rollout issue associated with this incident was incorrectly classified as C3 instead of C2. This meant it didn't receive enough attention or announcements as it should have. During the incident, IMOC, on-call SREs, and Gitaly team members were unaware of these changes.
The timing of the rollout was inconvenient (during a public holiday).
Incident responders couldn't access the affected host after migrating our Gitaly node to the new project structure. We spent significant time figuring out how to SSH to these nodes to trigger the startup script, which was an essential part of the rollback instructions documented in the rollout issue.
(Gitaly team only) There were no instructions for capturing performance data or traces from the saturated Gitaly process on that node. We couldn't wait for the DRIs of the rollout to come online. The urgency to rollback was high since customers couldn't access their data.
For our first hypothesis, we tried to submit a support ticket to GCP. Unfortunately, the new Gitaly fleet project structure didn't give us enough support priority. The Gitaly fleet has moved to a new project structure where VMs are distributed across multiple GCP projects.

Investigation Details

This comment sparks the details of the root cause, investigation, and timeline leading to the event.

Timeline

Incident Timeline

2025-05-02

09:40:58 Incident reported by Florian Forster

Florian Forster reported the incident

Severity: Severity 2

Status: Investigating

09:47:28 Update shared

Florian Forster shared an update

Has shared graphs that seem to indicate that a migration has disrupted the service, but it may already have been resolved.

09:47:37 Message from Rehab Hassanein

Rehab Hassanein's message was pinned by Florian Forster

Link to graph: https://dashboards.gitlab.net/goto/5EX13ObNg?orgId=1

09:49:03 Image posted by Bob Van Landuyt

Bob Van Landuyt posted an image to the channel

per-host SLIs seem to be recovering: https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1&from=20[…]qdn=gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal (https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1&from=2025-05-02T08:23:06.452Z&to=2025-05-02T09:44:59.999Z&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-fqdn=gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal)

09:49:47 Message from Florian Forster

Florian Forster pinned their own message

The host-down event started at 9:21 UTC and ended at 9:38 UTC, i.e. a 17 minute window.

09:50:31 Update shared

Florian Forster shared an update

said this problem coincided with a Gitaly deployment.

09:52:55 Update shared

Florian Forster shared an update

is verifying that service has been restored for gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal

09:56:06 Incident resolved and entered the post-incident flow

Florian Forster shared an update

Status: ~~Investigating~~ → Documenting

From what we can see, the incident is mitigated. We'll continue investigating what happened.

10:11:03 Incident re-opened

Florian Forster shared an update

Status: ~~Documenting~~ → Investigating

Reopening since the same host is experiencing problems again.

10:14:31 Image posted by Quang-Minh Nguyen

Quang-Minh Nguyen posted an image to the channel

Logs from Gitaly side:

10:45:20 Message from Quang-Minh Nguyen

Quang-Minh Nguyen pinned their own message

New finding: this node was added to a group of experimental nodes for rolling out new Gitaly transactions: #19748 (closed). The config was udpated 2025-05-01 15:27 UTC.

10:45:34 Image posted by Quang-Minh Nguyen

Quang-Minh Nguyen posted an image to the channel

Right after that, disk write throughput peaked until now

10:46:06 Image posted by Quang-Minh Nguyen

Quang-Minh Nguyen posted an image to the channel

Disk IOPS also increased significantly to 10k ops/s

10:47:42 Message from Quang-Minh Nguyen

Quang-Minh Nguyen's message was pinned by Florian Forster

Rollback this MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5944/diffs

10:48:59 Message from Quang-Minh Nguyen

Quang-Minh Nguyen pinned their own message

10:49:47 Message from Florian Forster

Florian Forster pinned their own message

Transactions have recently been enabled for gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal: #19748 (comment 2480277398)

10:51:45 Message from Quang-Minh Nguyen

Quang-Minh Nguyen pinned their own message

Another MR adding two nodes to the runlist: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10935/diffs

10:53:28 Update shared

Florian Forster shared an update

We have identified a recent change, enabling transactions on the problematic host and one more. We'll roll back this change.

10:59:45 Update shared

Florian Forster shared an update

Decision: we'll revert https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10935 per #19748 (closed) rollback instructions.

11:05:34 Update shared

Florian Forster shared an update

Terraform rollback has been applied via https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10947

11:10:35 Update shared

Florian Forster shared an update

We're running Chef on the affected instance to pick up the changes.

11:19:46 Status changed from Investigating → Monitoring

Florian Forster shared an update

Status: ~~Investigating~~ → Monitoring

We have rolled back a change request that we believe caused these problems on this one Gitaly host. We're monitoring the situation for a little while to make sure the host remains healthy.

11:20:41 Image posted by Quang-Minh Nguyen

Quang-Minh Nguyen posted an image to the channel

No more partitions are started on gitaly-04-stor-gprd-c-gitlab-gitaly-gprd-83fd, indicating that the Transcations has been stopped on this node. Source (https://dashboards.gitlab.net/d/cdqjq90oyrzswb/transactions?orgId=1&from=now-15m&to=now&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-stage=main).

11:23:04 Image posted by Quang-Minh Nguyen

Quang-Minh Nguyen posted an image to the channel

Disk read/write IOs were backed to normal

11:24:35 Image posted by Quang-Minh Nguyen

Quang-Minh Nguyen posted an image to the channel

All apdex is back to normal now

11:49:44 Incident resolved and entered the post-incident flow

Florian Forster shared an update

Status: ~~Monitoring~~ → Documenting

Based on all available data, we believe this incident to be fixed.

Investigation Notes

Any details you may want to add about the investigation can go here.

Actions

Action

Owner

Let's revert this terraform one instead: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10935

Rehab Hassanein

Follow-ups

|----|----|

Follow-up	Owner
Improve performance of Gitaly's Transactions system	@jcaigitlab
Review severity guidance	Unassigned
Revisit how to SSH into Gitaly hosts.	Unassigned
Improve GCP support priority for the new Gitaly fleet project structure	Unassigned
Figure out which customers are affected.	@dserafin-gitlab

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Set a Severity::* label which matches the incident
In the Key Information section, make sure to include a link to the incident issue
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.

For the assigned DRI

Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
Create a few short sentences in the Summary section summarizing what happened (TL;DR)
Link any corrective actions and describe any other actions or outcomes from the incident
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
Once discussion wraps up in the comments, summarize any takeaways in the details section
If the incident timeline does not contain any sensitive information and this review can be made public, turn off the issue's confidential mode and link this review to the incident issue.
Close the review before the due date
Go back to the incident channel or page and close out the remaining post-incident tasks

Edited May 14, 2025 by John Cai