2019-11-27 Increased latency on API fleet

Summary

Immediately following the deploy to production there was an increased latency across the api fleet and an increase in db statement timeouts.

Slow queries

https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s&fullscreen&panelId=1352

Increased db errors

https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&from=now-6h&to=now&refresh=30s&fullscreen&panelId=3

Queue backlog

https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s&fullscreen&panelId=5

Increase in IO wait on api

Increased writes on the share nfs server:

More information will be added as we investigate the issue.

Timeline

All times UTC.

2019-11-27

08:32 - Deployment started 12.6.201911262205-9e0448ab9ef.46555054fa5
10:08 - Various API metrics fall out of line (latency and utilization)
10:19 - Deployment finished
10:30 - API latency SLO alert fired
10:35 - Acknowledged by EOC
10:38 - Postgres transaction timeout alert fired
10:43 - Postgres transaction timeout alert resolved itself
11:07 - API latency SLO alert resolved itself
11:13 - API metrics returned to usual levels
11:16 - API metrics fall out of line again

Edited Nov 27, 2019 by John Jarvis