2020-10-14 Gitaly ruby error rate spikes due to auth failures during FetchRemote to 1 remote domain
Summary
Quick reference for on-call engineers
This alert will probably continue to repeat. It has already done so 3 times in the last 2 hours. Use this Kibana query to check if any new alerts correlate with another spike of this pattern:
https://log.gprd.gitlab.net/goto/395a9eee3f6ffe4d54a296aaa5e8d4be
Pathology
A single remote domain is the target of thousands of GitLab projects trying and failing to run FetchRemote gRPC calls.  (For the specific domain name, see the last filter expression in this Kibana query.)
These FetchRemote gRPC calls are failing due to an authentication failure to that remote domain.  A large enough group of projects is affected by this that Gitaly is intermittently exceeding its error rate SLO.
PagerDuty alerts (serially triggered, all following the above pattern):
- https://gitlab.pagerduty.com/incidents/PQW09WW
- https://gitlab.pagerduty.com/incidents/PYYRYLL
- https://gitlab.pagerduty.com/incidents/P2ALZQ8
Next steps
Practically speaking, I don't know that we can fix this. And it will continue to intermittently alert as long as the auth failures are occurring.
I'm going to see if a single GitLab user owns these repos, and if so, hopefully Support can contact that user.
In the mean time, this alert may be noisy. I'll mention it in the on-call handover, but I'll avoid silencing it for now.
Timeline
All times UTC.
2020-10-14
- 03:26 - Start of a series of intermittent spikes in the error rate of Gitaly's FetchRemotegRPC endpoint. This eventually alerts when the error rate exceeds the SLO's multi-burn rate thresholds (specifically the 5-minute and 1-hour burn rates).
- 03:58 - 1st of a series of PagerDuty alerts all indicating that Gitaly's gitalyrubycomponent is exceeding its error rate SLO: https://gitlab.pagerduty.com/incidents/PQW09WW The alert does not identify the affected gRPC call or the common pattern among the affected repos. Because many repos are involved, no one Gitaly node stands out as implicated.
- 04:03 - msmiley declares incident in Slack using /incident declarecommand.
- 05:12 - Summarized what we learned so far -- that all of the error events are associated with authentication failures to single remote domain: #2822 (comment 429258321)
Timeline showing the error events so far (03:00 to 05:15 UTC): https://log.gprd.gitlab.net/goto/3ee165d7c5c8f3063180e3264040bad4
Incident Review
- Service(s) affected: ServiceGitaly
- Minutes downtime or degradation: 0
