2019-08-20 Elevated web latency

Summary

Starting at 2019-08-20 13:15UTC - we started experiencing a minor degraded state on GitLab.com web and api requests. This was similar to the events of August 14. Currently this is an S3 incident with the degradation dropping our appdex scores below normal scores, but just above our SLO for degraded. We started to investigate since the event was similar to last week. At 14:00 UTC we engaged the infra-dev escalation process to work with backend engineers.

As of 2019-08-20 - 14:27UTC, backend engineers, infra and ongres are discussing the behavior we are seeing on one of the six read-only databases where we are seeing issues.

As of 2019-08-20 15:15 UTC, latencies are back at normal levels.

Web Appdex just above degraded SLO line https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&from=now-3h&to=now

current primary db: patroni-04-db-gprd.c.gitlab-production.internal
Replicas - patroni-0{2,3,5,6,7,8}

replica.patroni.service.consul.	0 IN	A	10.220.16.107
replica.patroni.service.consul.	0 IN	A	10.220.16.102
replica.patroni.service.consul.	0 IN	A	10.220.16.108
replica.patroni.service.consul.	0 IN	A	10.220.16.106
replica.patroni.service.consul.	0 IN	A	10.220.16.105
replica.patroni.service.consul.	0 IN	A	10.220.16.103

Timeline

2019-08-20

13:14 UTC - alert fires for elevated web latencies
13:30 UTC - decision to investigate high usage on one of the read-only postgres hosts.
13:40 UTC - Tried taking patroni-08 out of the rotation, and the queued client
13:52 UTC - paged and had backend engineer join
13:55 UTC - working on investigating why one read-only replica is building a queue of requests.
14:06 UTC - testing changing poolsize on patroni-08 from 100 to 50

Edited Aug 03, 2020 by 🤖 GitLab Bot 🤖