FY20-Q4 OKR Get CI queue times (p95) under 1 min
This issue is to track development optimization of CI queue times until they are reliably under 1 min, and infrastructure is comfortable alerting on breach of that SLA:
- Based on what I’ve found, the initial concern here was related to large spikes in CI queue time every hour due to scheduled jobs all being enqueued at the same time: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7814
- Two changes are currently being tested to mitigate this problem: gitlab-com/gl-infra/production#1136 (closed)
- Test 1: Adjust warm pool for shared runner managers - https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1784
- Test 2: Adjust cron schedule for pipeline_schedule_worker from ‘19 * * * *’ - https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1806 The early results of these changes are promising, but the tests are still in progress and there will be some additional evaluations still to be done gitlab-com/gl-infra/production#1136 (comment 218753472)
Update 2019-09-30 to include a graph with metric that will be used to measure this outcome.
CC @clefelhocz1 @ansdval @glopezfernandez
Additional context from product scaling agenda at https://docs.google.com/document/d/1nMJzrDfG7C14WP5v7P226oPFuXkwqIk7bdIT8ai0DNU/edit?ts=5d84fb07&skip_itp2_check=true&pli=1#bookmark=id.acbz08dge98p
Edited by Jason Yavorska