2019-11-27 Increased latency on API fleet

Summary

Immediately following the deploy to production there was an increased latency across the api fleet and an increase in db statement timeouts.

Slow queries

https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s&fullscreen&panelId=1352

Screen_Shot_2019-11-27_at_4.53.25_PM

Increased db errors

https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&from=now-6h&to=now&refresh=30s&fullscreen&panelId=3

Screen_Shot_2019-11-27_at_4.56.32_PM

Queue backlog

https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s&fullscreen&panelId=5

Screen_Shot_2019-11-27_at_4.55.23_PM

Increase in IO wait on api

Screen_Shot_2019-11-27_at_4.52.16_PM

Increased writes on the share nfs server:

Screen_Shot_2019-11-27_at_4.51.17_PM

More information will be added as we investigate the issue.

Timeline

All times UTC.

2019-11-27

  • 08:32 - Deployment started 12.6.201911262205-9e0448ab9ef.46555054fa5
  • 10:08 - Various API metrics fall out of line (latency and utilization)
  • 10:19 - Deployment finished
  • 10:30 - API latency SLO alert fired
  • 10:35 - Acknowledged by EOC
  • 10:38 - Postgres transaction timeout alert fired
  • 10:43 - Postgres transaction timeout alert resolved itself
  • 11:07 - API latency SLO alert resolved itself
  • 11:13 - API metrics returned to usual levels
  • 11:16 - API metrics fall out of line again
Edited Nov 27, 2019 by John Jarvis
Assignee Loading
Time tracking Loading