On-Call Handover 2020-06-02 23:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
Summary:
What (if any) time-critical work is being handed over?
N/A
What contextual info may be useful for the next few on-call shifts?
CI Runner jobs are sometimes failing in their attempts to send job trace output to the gitlab.com API via HTTP PATCH requests. When this happens, the job appears to stall.
We initially thought this was related to https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2211 where an API node handles a large (multi-gigabyte) upload, fills its filesystem by buffering the request to disk, and subsequently fails all other concurrent requests until that large request's buffer gets deleted (which I suspect should happen automatically, making this a short-lived intermittent failure). But we could find no evidence of this happening today, judging from prometheus metrics and nginx logs (which go to a different filesystem that does not fill up). So we are looking for other explanations.
These CI job send their trace output through Cloudflare to reach gitlab.com/api, and the first example we checked had a surprising cloudflare log entry. Currently looking for more examples to see if that was an anomaly or not.
Currently most of this is in a slack thread. Now that we suspect it's not the same as the existing open issue, I'm going to open a new one.
Ongoing alerts/incidents:
All of these are carry-over from previous shifts.
This is the only one that's really still actionable, and it's the one mentioned above.
The following are really historical incidents that are pending incident reviews, but they are not labeled consistently and our handover automation does not yet know about the new labeling scheme anyway.
-
production#2207 (closed) - 2020-05-30: dev.gitlab.org is down
-
production#2203 (closed) - 401s on user actions from the MR page
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2201
-
production#2191 (closed) - Investigate Atlassian IPs being blocked by Cloudflare
-
production#2132 (closed) - Degraded performance on shared CI runners