On-Call Handover 2021-03-18 15:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
- EOC egress: @igorwwwwwwwwwwwwwwwwwwww
- EOC ingress: @nnelson
Summary:
What (if any) time-critical work is being handed over?
production#4011 (closed) - it's recovering now but needs some attention.
What contextual info may be useful for the next few on-call shifts?
- We’re caught up on WALs
- We are in a mixed state with redis-sidekiq right now
- production#3913 (closed)
- 2 nodes are running redis 6
- the master is running redis 5
- we will want to failover to redis 6 and upgrade the last node soon.
- once that other incident has cooled off.
Ongoing alerts/incidents:
-
production#3994 (closed) - 2021-03-17: feature flag propagation inconsistent
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3987 - 2021-03-16 rate of repository_update_remote_mirror Sidekiq jobs significantly lower since the database latency incident
-
production#3881 (closed) - 2021-03-09 : Postgres Disk I/O issues
-
production#3810 (closed) - 2021-02-25: CF 502/520 when importing large projects
Resolved actionable alerts:
-
https://gitlab.pagerduty.com/incidents/PP9INP0 - [#39228] Firing 1 - Hot spot tuple fetches on the postgres %(postgresLocation)s in the
packages_packages
table,packages_packages
. -
https://gitlab.pagerduty.com/incidents/POODV5H - [#39230] Firing 1 - Redis Switch Master
-
https://gitlab.pagerduty.com/incidents/PYBTBAY - [#39231] Firing 1 - Hot spot tuple fetches on the postgres %(postgresLocation)s in the
packages_packages
table,packages_packages
. -
https://gitlab.pagerduty.com/incidents/PZI24KX - [#39235] Firing 1 - The Redis Primary CPU Utilization per Node resource of the redis-sidekiq service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
-
https://gitlab.pagerduty.com/incidents/PYL9ES1 - [#39237] Firing 1 - Hot spot tuple fetches on the postgres primary
patroni-03-db-gprd.c.gitlab-production.internal
in thenamespaces
table,namespaces
. -
https://gitlab.pagerduty.com/incidents/P7UFBNS - [#39239] Firing 1 - The Puma Worker Saturation per Node resource of the web service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.