On-Call Handover 2020-06-01 15:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
- EOC egress: @ahmadsherif
- EOC ingress: @alejandro
Summary:
Overall a quiet shift, the same 3 api alerts but this time we got to the bottom of it and it's in the hands of the support team. A bit of abuse on Gitaly's GetArchive (was downloading the handbook repo), but it ended on its own.
What (if any) time-critical work is being handed over?
Please list any urgent tasks that cannot wait, such as mitigating an active incident or preventing imminent breakage.
What contextual info may be useful for the next few on-call shifts?
Please help the incoming on-callers to efficiently interpret symptoms or alerts. For example:
- Are any system components known to be in an abnormal, degraded, or risky state?
- Do you think any recent problems may recur or have delayed or lingering side effects?
- If a significant incident occurred, can you briefly summarize its current status?
Ongoing alerts/incidents:
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2211
-
production#2208 (closed) - 2020-05-30: A spike of requests to a single Pages site
-
production#2207 (closed) - 2020-05-30: dev.gitlab.org is down
-
production#2203 (closed) - 401s on user actions from the MR page
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2201
-
production#2191 (closed) - Investigate Atlassian IPs being blocked by Cloudflare
-
production#2132 (closed) - Degraded performance on shared CI runners
Resolved actionable alerts:
-
https://gitlab.pagerduty.com/incidents/P5XXGLN - [#21314] Firing 1 - 5% disk space left
-
https://gitlab.pagerduty.com/incidents/P9TAKEN - [#21317] Firing 1 - 5% disk space left
-
https://gitlab.pagerduty.com/incidents/PFDK2ZN - [#21326] Firing 1 - Gitaly latency on file-praefect-01-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m
-
https://gitlab.pagerduty.com/incidents/PX15MTZ - [#21334] Firing 1 - The
sidekiq
service (main
stage) has a apdex score (latency) below SLO -
https://gitlab.pagerduty.com/incidents/PIFQR1H - [#21335] Firing 1 - The
sidekiq
service (main
stage) has a apdex score (latency) below SLO