2025-10-07: The gitlab_sshd git service unavailable (apdex violating SLO)
The gitlab_sshd git service unavailable (apdex violating SLO) (Severity 2 (High))
Problem: Git operations were degraded in both main and canary environments, confirmed by increased error rates and slow responses from multiple customer reports.
Impact: Customers experienced slow git operations in both main and canary environments between 07:16 UTC and 09:29 UTC. Multiple emergency tickets were raised, with some users, especially in the EU region, reporting delays of up to 60 seconds for git pull operations.
Causes: At the onset of the incident, pods in us-east-1c and us-east-1d experienced evictions and repeated readiness probe failures, returning HTTP 503 errors. These pod failures directly correlate with the period of degraded git operations. The exact reason for the readiness probe failures and pod evictions is still under investigation.
Response strategy: We escalated to incident managers, involved relevant teams, and updated the status page. After identifying pod failures in specific zones, service availability recovered. We are continuing to monitor git service health and have reached out to the GitLab Shell owning team for further assistance.
This ticket was created to track INC-4555, by incident.io