2019-08-20 Elevated web latency
Summary
Starting at 2019-08-20 13:15UTC - we started experiencing a minor degraded state on GitLab.com web and api requests. This was similar to the events of August 14. Currently this is an S3 incident with the degradation dropping our appdex scores below normal scores, but just above our SLO for degraded. We started to investigate since the event was similar to last week. At 14:00 UTC we engaged the infra-dev escalation process to work with backend engineers.
As of 2019-08-20 - 14:27UTC, backend engineers, infra and ongres are discussing the behavior we are seeing on one of the six read-only databases where we are seeing issues.
As of 2019-08-20 15:15 UTC, latencies are back at normal levels.
Web Appdex just above degraded SLO line
https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&from=now-3h&to=now

-
current primary db:
patroni-04-db-gprd.c.gitlab-production.internal - Replicas -
patroni-0{2,3,5,6,7,8}
replica.patroni.service.consul. 0 IN A 10.220.16.107
replica.patroni.service.consul. 0 IN A 10.220.16.102
replica.patroni.service.consul. 0 IN A 10.220.16.108
replica.patroni.service.consul. 0 IN A 10.220.16.106
replica.patroni.service.consul. 0 IN A 10.220.16.105
replica.patroni.service.consul. 0 IN A 10.220.16.103
Timeline
2019-08-20
- 13:14 UTC - alert fires for elevated web latencies
- 13:30 UTC - decision to investigate high usage on one of the read-only postgres hosts.
- 13:40 UTC - Tried taking patroni-08 out of the rotation, and the queued client
- 13:52 UTC - paged and had backend engineer join
- 13:55 UTC - working on investigating why one read-only replica is building a queue of requests.
- 14:06 UTC - testing changing poolsize on patroni-08 from 100 to 50