2019-10-15 High latency for ci-runners

Summary

Incident resolved.

Total Time degraded:

08:47 UTC to 18:20 UTC - 9h32m

CI latency is poor, a large backlog of jobs is building up. This appears to be a GCP quota change on GCP's compute API that is adversely affecting GitLab. The drop in performance corresponds to an unrelated quota increase request that we theorize may be related this drop. The previous value of this quota was 12000 and was dropped to 2000. We are working with Google support to expedite the restoration of the previous quota value.

Screen_Shot_2019-10-15_at_3.13.20_PM

Note: the quota limit in this graph only reflects the current value, not the historical limit which was previously set to 12000

Timeline

All times UTC

2019-10-15

  • 08:30 - The GCE API error rate in our CI project sharply increases.
  • 09:30 - By this time, the backlog of pending jobs is larger than normal.
  • 10:53 - SRE on-call becomes aware of the issue.
  • 11:00 - We notice that our GCE API request rate had suddenly plunged at around 09:30. Specifically, it plunged from around 10k requests per 100s to 2k, the quota limit.
  • 11:10 - Google support are contacted asking them to raise the quota.
  • 14:00 - Latest reply from Google support: "We are now on the last stage of evaluating your quota increase request. Rest assured that we are working on it to be applied the soonest time possible."
  • 14:45 - Initial notification that API quotas have been raised- verifying
  • 16:45 - Continuing to monitor recovery of pending jobs queue - 50%
  • 18:16 - Initial Recovery posted.
  • 18:20 - System appears recovered.
  • 18:48 - Issue/ Incident closed.

Post Incident Review started on https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8176

Edited Oct 15, 2019 by Dave Smith
Assignee Loading
Time tracking Loading