On-Call Handover 2021-02-22 07:00 UTC

On-Call Handover

Brought to you by the Slack slash command: /sre-oncall handover

EOC egress: @craig
EOC ingress: @igorwwwwwwwwwwwwwwwwwwww

Summary:

Had some excitement with runners in production#3672 (closed) and production#3674 (closed) today, but ~~both~~ the former should hopefully be settled now. There will be some follow-up work for both, but the primary impacts from each have been mitigated.

Edit: after further testing on production#3674 (closed) we ran into problems with filesystem permissions, after adjusting the mount point for the temp directory where the local copy of the repo is stored for builds of gitlab forks

What (if any) time-critical work is being handed over?

What contextual info may be useful for the next few on-call shifts?

For production#3672 (closed) we updated the docker-machine versions on the runners on https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5033. This change was applied not only to the gitlab-shared-runners but to all runners, so if you see any (more) issues with runners this should be a prime suspsect.

We did see one issue after the above update, as noted (now mitigated) in production#3674 (closed). The fix for this one required updating the temporary path into which repository contents are extracted before initiating the volume mount and launching the container (build optimization for gitlab forks).

Ongoing alerts/incidents:

production#3670 (closed) - 2021-02-20: Prometheus has no targets
production#3669 (closed) - 2021-02-20: Prometheus has no targets
production#3667 (closed) - 2021-02-20: SSL certificate for https://registry.pre.gitlab.com expires soon
production#3662 (closed) - 2021-02-19: Thanos compaction has not run in 24 hours.
production#3655 (closed) - 2021-02-18: Thanos compaction halted
production#3642 (closed) - 2021-02-17: Elasticsearch indexing falling behind since 2021-02-16 11:00 UTC
production#3633 (closed) - 2021-02-16: PrometheusUnreachable on prometheus-01-inf-db-benchmarking

Resolved actionable alerts:

https://gitlab.pagerduty.com/incidents/PNVDNAY - [#37405] Please see incident declaration in Slack channel: https://slack.com/app_redirect?channel=CB7P5CJS1&team=T02592416
https://gitlab.pagerduty.com/incidents/PA2IZA9 - [#37406] Firing 1 - The Disk Space Utilization per Device per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.

Unactionable alerts:

Resolved production incidents:

Change issues:

Edited Feb 22, 2021 by Craig Barrett