2025-09-08: The rails_request SLI of the web service (cny stage) has an error rate violating SLO
The rails_request SLI of the web service (cny stage) has an error rate violating SLO (Severity 3 (Medium))
Problem: The rails_request SLI of the web service (cny stage) has an elevated error rate violating the Service Level Objective (SLO) due to errors primarily associated with the gitlab-org/gitlab project.
Impact: This issue has led to the degradation of the apdex score and error ratio, affecting the performance and reliability of the web and goserver services.
Causes: The incident was triggered by errors coming from Gitlab::git::CommandTimedOut on the gitaly-cny-01-stor-gprd.c.gitlab-production.internal instance, with a specific focus on the gitlab-org/gitlab project. These problems were further compounded by issues related to cache, as indicated by the errors iterating objects: context deadline exceeded and one or more cache generations are pending transition for the current repository.
Response strategy: The feature flag related to the problem was turned off, which led to a temporary recovery of the web service and improvements in the goserver SLI apdex. Additionally, engagement with the team responsible for Gitaly changes was initiated to investigate the relation of the errors to recent changes.
This ticket was created to track INC-3768, by incident.io