On-Call Handover 2021-02-22 07:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
- EOC egress: @craig
- EOC ingress: @igorwwwwwwwwwwwwwwwwwwww
Summary:
Had some excitement with runners in production#3672 (closed) and production#3674 (closed) today, but both the former should hopefully be settled now. There will be some follow-up work for both, but the primary impacts from each have been mitigated.
Edit: after further testing on production#3674 (closed) we ran into problems with filesystem permissions, after adjusting the mount point for the temp directory where the local copy of the repo is stored for builds of gitlab forks
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
For production#3672 (closed) we updated the docker-machine
versions on the runners on https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5033. This change was applied not only to the gitlab-shared-runners
but to all runners, so if you see any (more) issues with runners this should be a prime suspsect.
We did see one issue after the above update, as noted (now mitigated) in production#3674 (closed). The fix for this one required updating the temporary path into which repository contents are extracted before initiating the volume mount and launching the container (build optimization for gitlab forks).
Ongoing alerts/incidents:
-
production#3670 (closed) - 2021-02-20: Prometheus has no targets
-
production#3669 (closed) - 2021-02-20: Prometheus has no targets
-
production#3667 (closed) - 2021-02-20: SSL certificate for https://registry.pre.gitlab.com expires soon
-
production#3662 (closed) - 2021-02-19: Thanos compaction has not run in 24 hours.
-
production#3655 (closed) - 2021-02-18: Thanos compaction halted
-
production#3642 (closed) - 2021-02-17: Elasticsearch indexing falling behind since 2021-02-16 11:00 UTC
-
production#3633 (closed) - 2021-02-16: PrometheusUnreachable on prometheus-01-inf-db-benchmarking
Resolved actionable alerts:
-
https://gitlab.pagerduty.com/incidents/PNVDNAY - [#37405] Please see incident declaration in Slack channel: https://slack.com/app_redirect?channel=CB7P5CJS1&team=T02592416
-
https://gitlab.pagerduty.com/incidents/PA2IZA9 - [#37406] Firing 1 - The Disk Space Utilization per Device per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.