2019-10-24: Elevated CI job queue durations
Summary
We see elevated CI job queue durations and users are reporting stuck jobs.
Timeline
All times UTC.
2019-10-24
- 10:00 - Job queue durations start rising, rate of runner machine creation drops, shared runner queue starts rising
- 10:27 - 90th percentile job queue durations reach 1h, 50th percentile 1m
- 10:42 - reports from customer support: https://gitlab.slack.com/archives/C101F3796/p1571913724286000
- 11:15 - incident opened
- 11:31 - status.io update
- 14:25 - we stopped some pipelines behaving badly, leading to shared runners picking up more jobs again
- 18:37 - we identified and canceled some more problematic pipelines https://gitlab.slack.com/archives/C8HG8D9MY/p1571942240221200?thread_ts=1571916475.179600&cid=C8HG8D9MY
- 18:51 -
ci-runners
SLO alert fires again https://gitlab.pagerduty.com/incidents/P8QVAZT?utm_source=slack&utm_campaign=channel - 20:11 - alert cleared
Edited by Alejandro Rodríguez