Incident Review: Gitaly is down on gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal
INC-649: Gitaly is down on gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal
Generated by Florian Forster on 2 May 2025 12:07. All timestamps are local to Etc/UTC
Key Information
| Metric | Value |
|---|---|
| Customers Affected | 3385 |
| Requests Affected | 39644 |
| Incident Severity | Severity 2 |
| Impact Start Time | Fri, 02 May 2025 09:22:00 UTC |
| Impact End Time | Fri, 02 May 2025 11:35:00 UTC |
| Total Duration | 2 hours, 8 minutes |
| Link to Incident Issue | #19754 (closed) |
Requests Affected
Total impacted requests: 39644.
Customers Affected
Total customers impacted: 3385
Top impacted users and projects (ranked by impacted requests)
| Top impacted users | Top impacted projects |
|---|---|
![]() |
![]() |
Support impact
4 Support tickets were raised by the customers. Search query in Zendesk
Summary
Problem: Gitaly is down on gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal.
Impact: Customers using this host cannot access their Git repositories.
Causes: An application-level configuration change was made one day before the incident see details. This change targeted two nodes in the gprd Gitaly fleet, including the affected node. After the configuration update, the host began experiencing heavy disk activity and gradual memory buildup. When the incident started, the Gitaly server became memory saturated and unresponsive and stopped accepting requests.
Response strategy: We worked to identify the root cause. Our initial hypothesis wrongly suggested a GCP host networking issue. We spent time examining GCP logs, checking migration events, and trying to contact GCP support. Later, we correctly identified that a configuration change caused the problem. We then focused our efforts on rolling back and reapplying the configuration, which effectively resolved the issue.
What went well?
- I highly appreciated the coordination of incident responders. The whole team, including IMOC, SRE on-call, Gitaly on-call, and CMOC collaborated smoothly and efficiently while investigating the root cause.
- We developed multiple hypotheses and gathered data to verify them quickly.
- Eventually, we identified the root cause and resolved the incident successfully.
What was difficult?
- The rollout issue associated with this incident was incorrectly classified as C3 instead of C2. This meant it didn't receive enough attention or announcements as it should have. During the incident, IMOC, on-call SREs, and Gitaly team members were unaware of these changes.
- The timing of the rollout was inconvenient (during a public holiday).
- Incident responders couldn't access the affected host after migrating our Gitaly node to the new project structure. We spent significant time figuring out how to SSH to these nodes to trigger the startup script, which was an essential part of the rollback instructions documented in the rollout issue.
- (Gitaly team only) There were no instructions for capturing performance data or traces from the saturated Gitaly process on that node. We couldn't wait for the DRIs of the rollout to come online. The urgency to rollback was high since customers couldn't access their data.
- For our first hypothesis, we tried to submit a support ticket to GCP. Unfortunately, the new Gitaly fleet project structure didn't give us enough support priority. The Gitaly fleet has moved to a new project structure where VMs are distributed across multiple GCP projects.
Investigation Details
This comment sparks the details of the root cause, investigation, and timeline leading to the event.
Timeline
Incident Timeline
2025-05-02
09:40:58 Incident reported by Florian Forster
Florian Forster reported the incident
Severity: Severity 2
Status: Investigating
09:47:28 Update shared
Florian Forster shared an update
Has shared graphs that seem to indicate that a migration has disrupted the service, but it may already have been resolved.
09:47:37 Message from Rehab Hassanein
Rehab Hassanein's message was pinned by Florian Forster
Link to graph: https://dashboards.gitlab.net/goto/5EX13ObNg?orgId=1
09:49:03 Image posted by Bob Van Landuyt
Bob Van Landuyt posted an image to the channel
per-host SLIs seem to be recovering: https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1&from=20[…]qdn=gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal (https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1&from=2025-05-02T08:23:06.452Z&to=2025-05-02T09:44:59.999Z&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-fqdn=gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal)
09:49:47 Message from Florian Forster
Florian Forster pinned their own message
The host-down event started at 9:21 UTC and ended at 9:38 UTC, i.e. a 17 minute window.
09:50:31 Update shared
Florian Forster shared an update
said this problem coincided with a Gitaly deployment.
09:52:55 Update shared
Florian Forster shared an update
is verifying that service has been restored for gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal
09:56:06 Incident resolved and entered the post-incident flow
Florian Forster shared an update
Status: Investigating → Documenting
From what we can see, the incident is mitigated. We'll continue investigating what happened.
10:11:03 Incident re-opened
Florian Forster shared an update
Status: Documenting → Investigating
Reopening since the same host is experiencing problems again.
10:14:31 Image posted by Quang-Minh Nguyen
Quang-Minh Nguyen posted an image to the channel
Logs from Gitaly side:
10:45:20 Message from Quang-Minh Nguyen
Quang-Minh Nguyen pinned their own message
New finding: this node was added to a group of experimental nodes for rolling out new Gitaly transactions: #19748 (closed). The config was udpated 2025-05-01 15:27 UTC.
10:45:34 Image posted by Quang-Minh Nguyen
Quang-Minh Nguyen posted an image to the channel
Right after that, disk write throughput peaked until now
10:46:06 Image posted by Quang-Minh Nguyen
Quang-Minh Nguyen posted an image to the channel
Disk IOPS also increased significantly to 10k ops/s
10:47:42 Message from Quang-Minh Nguyen
Quang-Minh Nguyen's message was pinned by Florian Forster
Rollback this MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5944/diffs
10:48:59 Message from Quang-Minh Nguyen
Quang-Minh Nguyen pinned their own message
Similarly, the other node in the experiment follow the same increments in disk IOPS: https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1&from=now-24h&to=now&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-fqdn=gitaly-07-stor-gprd.c.gitlab-gitaly-gprd-0fe1.internal
10:49:47 Message from Florian Forster
Florian Forster pinned their own message
Transactions have recently been enabled for gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal: #19748 (comment 2480277398)
10:51:45 Message from Quang-Minh Nguyen
Quang-Minh Nguyen pinned their own message
Another MR adding two nodes to the runlist: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10935/diffs
10:53:28 Update shared
Florian Forster shared an update
We have identified a recent change, enabling transactions on the problematic host and one more. We'll roll back this change.
10:59:45 Update shared
Florian Forster shared an update
Decision: we'll revert https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10935 per #19748 (closed) rollback instructions.
11:05:34 Update shared
Florian Forster shared an update
Terraform rollback has been applied via https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10947
11:10:35 Update shared
Florian Forster shared an update
We're running Chef on the affected instance to pick up the changes.
11:19:46 Status changed from Investigating → Monitoring
Florian Forster shared an update
Status: Investigating → Monitoring
We have rolled back a change request that we believe caused these problems on this one Gitaly host. We're monitoring the situation for a little while to make sure the host remains healthy.
11:20:41 Image posted by Quang-Minh Nguyen
Quang-Minh Nguyen posted an image to the channel
No more partitions are started on gitaly-04-stor-gprd-c-gitlab-gitaly-gprd-83fd, indicating that the Transcations has been stopped on this node. Source (https://dashboards.gitlab.net/d/cdqjq90oyrzswb/transactions?orgId=1&from=now-15m&to=now&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-stage=main).
11:23:04 Image posted by Quang-Minh Nguyen
Quang-Minh Nguyen posted an image to the channel
Disk read/write IOs were backed to normal
11:24:35 Image posted by Quang-Minh Nguyen
Quang-Minh Nguyen posted an image to the channel
All apdex is back to normal now
11:49:44 Incident resolved and entered the post-incident flow
Florian Forster shared an update
Status: Monitoring → Documenting
Based on all available data, we believe this incident to be fixed.
Investigation Notes
Actions
Action
Owner
Rehab Hassanein
Follow-ups
|----|----|
Review Guidelines
This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.
For the person opening the Incident Review
-
Set the title to Incident Review: (Incident issue name) -
Assign a Service::*label (most likely matching the one on the incident issue) -
Set a Severity::*label which matches the incident -
In the Key Informationsection, make sure to include a link to the incident issue -
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
For the assigned DRI
-
Fill in the remaining fields in the Key Informationsection, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find. -
If there are metrics showing Customers AffectedorRequests Affected, link those metrics in those fields -
Create a few short sentences in the Summary section summarizing what happened (TL;DR) -
Link any corrective actions and describe any other actions or outcomes from the incident -
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported? -
Once discussion wraps up in the comments, summarize any takeaways in the details section -
If the incident timeline does not contain any sensitive information and this review can be made public, turn off the issue's confidential mode and link this review to the incident issue. -
Close the review before the due date -
Go back to the incident channel or page and close out the remaining post-incident tasks



