For the assigned roles when the incident declared, see the Timelines tab. For timeline feedback see the dogfooding issue. To save time entering timeline events, use the quick action /timeline.
Current Status
While rolling out a feature flag, we noticed increased error rates on the Web, API & git fleets. Disabling the feature flag resolved the issue.
More information will be added as we investigate the issue.
For customers believed to be affected by this incident, please subscribe to this issue or monitor our status page for further updates.
Summary for CMOC notice / Exec summary:
Customer Impact: About 0.05% of requests would intermittently fail when accessing a crashed Gitaly node.
Note:
In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally.
By default, all information we can share, will be public, in accordance to our transparency value.
I'm still surprised this FF has a negative impact, looking at the response times:
So the median latency is significantly lower, which is a good thing. I don't know exactly why it's only represented in the median though. I think it's because the caching on Gitaly's side. When there were no changes on the default branch since the last call of that RPC, Gitaly can simply return the cached value. So this would be case if you look at the same project page over and over again. But when another project is visited, stats need to be calculated (from scratch). Since we flipped the FF recently, there won't be any cached stats for most projects, so many calls will result in full language stat recalculation.
So in a way I also think this issue is a ramp-up thing. I might have been too aggressive in bumping up the FF. If we would enable it for e.g. 25% for a few days, we would be able to create cache for a subset of projects more gracefully.
@reprazent Keep in mind we also suffer from the thundering herd issue discussed in gitlab-org/gitaly#3759 (comment 1126360625). So if no cache is available, concurrent page loads might race to populate the cache, each of them doing the time-consuming language stats calculation.
Now that doesn't explain to me while a time-consuming RPC can take down a server by making it incapable of handling other RPCs.
@toon The error we were seeing were, as far as I know, clientside errors. They weren't for a particular RPC or a particular host. Since they were a clientside error. So they wouldn't reach Gitaly, and thus not be recorded in Gitaly metrics.
This issue now has the NeedsCorrectiveActions label, this label will be removed automatically when there is
at least one related issue that is labeled with corrective action or ~"infradev". While it is not strictly required,
having an issue related with these labels helps to ensure a similar incident doesn't happen again.
So the error happens in the DetectRepositoryLanguagesWorker (I forgot repo languages are refreshed in the background). It calls CommitLanguages and that fails with 14:Socket closed.. There isn't much to learn from that, sadly.
I also think we might tighten up the SLO for web/api/git services a bit to alert faster (gitlab-com/runbooks!5049 (merged)). We can't always count on @willmeek to be in the neighbourhood of QA tests .
Thanks Bob, the QA DRI on Pipeline Triage would hopefully notice Production failures, though there might be a delay between QA failures, initial investigation and incident declare, probably useful to look at alerts .
This wasn't limited to a certain set of users or projects: it would happen when a Gitaly node hosted a project that needed to have it's language statistics refreshed. The job triggered a Gitaly crash, causing it to become briefly unavailable for all projects on that node. The failure would be intermitted and a retry of whatever failed operation would allow the user to proceed.
@reprazent given the above new information (thank you!), it sounds like this wasn't a severity2. What would you suggest we re-classify it to according to our severity definitions?
@sloyd I was looking for that table to justify my thinking of reclassifying as severity3, thank you for linking.
I think this applies:
Broad impact on GitLab.com and minor inconvenience to typical user's workflow
Because every occurrence of the problem was brief per Gitaly host, the next try, the operation was likely to succeed. Lowering the severity, let me know you think that is inappropriate.