On-Call Handover 2020-10-24 15:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
- EOC egress: @hphilipps
- EOC ingress: @cmcfarland
Summary:
All alerts below were caused by one S1 incident, where the automatic DB reindexing cron job that was activated tonight for the weekends caused a bad query plan bringing down all of GitLab. We fixed it by running ANALYSE on that index and disabled the reindex cron feature flag to prevent future issues.
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
Ongoing alerts/incidents:
- production#2778 (closed) - 2020-09-30: Some Git Pull Mirroring Jobs not being executed
Resolved actionable alerts:
-
https://gitlab.pagerduty.com/incidents/PGMYM66 - [#29792] Firing 3 - IncreasedErrorRateOtherBackends
-
https://gitlab.pagerduty.com/incidents/PIPO23D - [#29793] Pingdom check check:https://gitlab.com/projects/new is down
-
https://gitlab.pagerduty.com/incidents/P67L3YN - [#29794] Firing 37 - IncreasedServerResponseErrors
-
https://gitlab.pagerduty.com/incidents/P444OGV - [#29796] Pingdom check check:https://gitlab.com/ is down
-
https://gitlab.pagerduty.com/incidents/PDICCPN - [#29797] Firing 17 - IncreasedBackendConnectionErrors
-
https://gitlab.pagerduty.com/incidents/P6KNA9Y - [#29800] Firing 14 - BlackboxProbeFailures
-
https://gitlab.pagerduty.com/incidents/PSLUQZY - [#29803] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down
-
https://gitlab.pagerduty.com/incidents/PDF8FKI - [#29804] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down
-
https://gitlab.pagerduty.com/incidents/P3ERQ0A - [#29809] Firing 1 - The Puma Worker Saturation per Node resource of the git service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
-
https://gitlab.pagerduty.com/incidents/PFS1OJ4 - [#29810] Firing 1 - The Puma Worker Saturation per Node resource of the web service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
-
https://gitlab.pagerduty.com/incidents/P32J6LY - [#29811] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
-
https://gitlab.pagerduty.com/incidents/PEG3R34 - [#29829] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down
-
https://gitlab.pagerduty.com/incidents/PCNWA3S - [#29830] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/merge_requests/ is down
-
https://gitlab.pagerduty.com/incidents/PWW97GU - [#29831] Pingdom check check:https://gitlab.com/ is down
-
https://gitlab.pagerduty.com/incidents/PC2B2DK - [#29832] Firing 1 - Increased Server Response Errors
-
https://gitlab.pagerduty.com/incidents/PZTLBWQ - [#29833] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down
Unactionable alerts:
Resolved production incidents:
- production#2886 (closed) - 2020-10-24: Increased backend errors