2025-04-19 : Gitaly goserver SLI violation impacting cny
stage apdex
cny
stage apdex (Severity 2)
Gitaly goserver SLI violation impacting Problem: Intermittent high memory pressure and high CPU usage on the Gitaly canary node (gitaly-cny-01
) results in increased errors for both web and GitLab in the cny
stage, impacting Gitaly goserver SLI and apdex.
Impact: The increase in errors is negatively affecting the performance and reliability of the repositories hosted on gitaly-cny-01
, leading to 500
errors through the UI and errors during git operations.
Causes: Git pack-objects processes with large memory footprints linger for extended periods, leading to concurrency limits being reached and gRPC calls being queued and eventually timing out.
Response strategy: A mitigation script to kill long-lived, memory-intensive git pack-objects processes to maintain low memory pressure was used until a permanent fix was implemented. The option to disable backpressure of pack-objects caching was deployed to production and we subsequently disabled backpressure of pack objects caching, however this did not reveal any improvement. Further investigation revealed that we should set min_occurrences
to 0
. This latest change significantly reduced memory pressure, as intended. The temporary mitigation is no longer necessary and has been disabled.
This ticket was created to track INC-518, by incident.io