2019-11-27 Increased latency on API fleet
Summary
Immediately following the deploy to production there was an increased latency across the api fleet and an increase in db statement timeouts.
Slow queries
https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s&fullscreen&panelId=1352
Increased db errors
Queue backlog
https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s&fullscreen&panelId=5
Increase in IO wait on api
Increased writes on the share nfs server:
More information will be added as we investigate the issue.
Timeline
All times UTC.
2019-11-27
- 08:32 - Deployment started 12.6.201911262205-9e0448ab9ef.46555054fa5
- 10:08 - Various API metrics fall out of line (latency and utilization)
- 10:19 - Deployment finished
- 10:30 - API latency SLO alert fired
- 10:35 - Acknowledged by EOC
- 10:38 - Postgres transaction timeout alert fired
- 10:43 - Postgres transaction timeout alert resolved itself
- 11:07 - API latency SLO alert resolved itself
- 11:13 - API metrics returned to usual levels
- 11:16 - API metrics fall out of line again
Edited by John Jarvis