This issue is to track development optimization of CI queue times until they are reliably under 1 min, and infrastructure is comfortable alerting on breach of that SLA:
To refine this a bit it would be good to define the target perc for the < 1 min queue time goal. Should this be max, p99, p95, etc. It seems like a good starting point would be p95 and we could iterate from there.
@darbyfrey I agree with you on starting with a percentile. Although, as far as actual expectation for runner times, we may need to be even more specific and alert on timing based on the job being performed. I called out in the Google Doc that project exporter times run long. We might miss out on noisy alerting by using a percentile, but 1 min is completely unreasonable for large projects.
@ansdval, could you link to the Google Doc you referenced here so we're all looking at the same thing?
Also, perhaps we need to clarify the specific queue time we are referring to. My understanding is that we're talking the time a CI jobs sits in the queued state before it kicks off, so specifically the Ci::BuildPrepare worker, which wouldn't include project exporters, importers, etc. Does that sound right?
Here's a relevant snippet from the above issue's description:
https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1 shows the metrics in detail. It's been raised that certain–maybe even all?–background processing jobs should adhere to a 1 min latency average. Though, at first glance, this may not be possible for all jobs–project exports take substantially longer than others.
I can't comment about whether or not that sounds right, because I don't know enough about the workers. The fact that you can identify a worker by name indicates to me that you have a much better understanding of the architecture and workflow than me.
The initial issue that was referenced was focused on the job_queue_duration_seconds_bucket metric in prometheus. So I'd like to propose we focus on the p95 for that metric for this issue.
Also, a change was deployed on 2019-09-30 to change the cron for scheduled jobs to run every 10 minutes instead of every 20 minutes. This is expected to improve the spikes in queue times as well. Will monitor the results.
@ansdval I have not been focused on CI job queue times, but it's important to understand that CI queue times (what this issue is about) and Sidekiq queue times (what I've been focused on) have no correlation. Certainly, you can't use sidekiq queue latencies to measure CI job queue latencies.
FWIW (and not strictly related to this request), the p95 scheduling time for all CI Sidekiq jobs over the past 7 days is 120 milliseconds.
According to this, during week days, about 95% of CI jobs start within 1 minute, on the weekend it's only about 75%.
What's really strange about this graph is that it's the only graph I've ever seen at GitLab where the performance during the week is better than the performance on the weekends. With most of our other metrics, we see better performance on the weekends on account of the relative overprovisioning during this time.
Obviously with shared runners we have autoscaling in place, leading to few machines being available during the weekend, and slower CI startup times as machines need to be provisioned before jobs can run.
There's not much data yet, but here's how it looks:
The dotted red line shows us the initial SLO, which I've set initially to 80% until we can improve the service.
This metric gives us a long-term SLO for the CI-runners service queuing time, which will be tracked in the SLO dashboard and will be included in the headline GitLab.com availability figure generated at https://dashboards.gitlab.net/d/general-slas/general-slas?orgId=1
FYI: @dawsmith, during the Infradev call, it was discussed that a good first step in reducing this latency would be to experiment with the autoscaling parameters that we use for GitLab CI.
If a user has between 0 and 4 jobs currently running on our shared CI runners (aka fair usage), the p95 for the CI queue time should be less than 1 minute.
This relies on built in functionality in the CI scheduler which progressively deprioritises jobs for projects that have more more jobs currently executing.
The reason we chose 4 is that from an observability point of view, everything beyond 4 jobs is currently thrown into a single bucket. If we want to make this number bigger than 4, we need to make some (small) application changes.
This fair-usage strategy protects users running a reasonable number of jobs from users issuing 10000 simultaneous CI jobs, which would be very expensive to provision within a short period.
I've also opened https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8130 for the CI/CD infrastructure team to adjust the off-peak (weekend) base load of CI runners, which is set to 15 nodes, several years ago, and which explains the poor SLI that we see over weekends.
@darbyfrey@erushton is there a list of backstage items that needs to be scheduled to make this reality? I'd be happy to schedule them (or possibly @DarrenEastman / @joshlambert depending on if the changes land in their area.)
We had a call yesterday to discuss this and will be having a recurring call each week until this is resolved.
Outcomes of yesterday's call:
@andrewn is tuning the alerts so that they're not overly noisy and need muting but that means they'll not quite be as tight as we want them to be for the time being.
We're further investigating adjusting the auto scale idle count size that is in place on weekends as it is out of date
There's a LOT of issues on the go in both the gitlab-com and gitlab-org groups - I'll be making an epic for each to help track them
We also discussed that using an apdex measurement as opposed to a raw queue time measurement from prometheus makes more sense.
@DarrenEastman as runner PM I'm transfering this item to you. Please follow along, and as part of managing your product ensure that timely runner shared fleet availability is good. Please keep me posted and let me know when this is closed out for good, and feel free to add to our 1:1 agenda if you want to discuss further.
@dawsmith & @ansdval I've provisionally labelled this the CI/CD & Enablement team, but @ansdval is directly mentioned in this issue so feel free to reassign if this its incorrectly labelled.
After the work we've done over the last few months we now have very stable, and very predictable queue times for the CI jobs running on shared runners on GitLab.com. There is one predictable spike of about 4-minutes at 4am UTC every day, other than that jobs are spending less than a few seconds in the queue on average.
Deeper Examination
Let's start with a couple of quick graphs:
Here's what an average day looked like before we started this initiative. The graph spans a 24-hour window from 10:00 UTC to 9:59 UTC. There are spikes every hour or so and most them peaking to 13-14 minutes.
Here's what an average day looks like now. The graph shows a 24-hour span of the same time range as the one above. There is one spike at 4am UTC peaking at about 4-minutes. The other jittery little spikes are in the 2-3 second range.
The dashboard where this and other graphs are available is here
What we did to get here
Changed the GitLab.com scheduled job frequency from once an hour on the 19th of the hour to every 10 minutes - effectively dividing the scheduled job load by 6
Increasing our “offPeak” idle count to match our regular idle count. Idle count is the number of “warm” VMs we keep on hand to assign CI jobs to. Due to the ~5-minute lag time between making an API call to create a new VM and the time it’s ready, we must pre-provision them. The number of VM’s we aim to have pre-provisioned and ready to go is the idlecount. On weekends we switched to an off-peak lower number, which hadn’t been adjusted for growth in recent years. This value now matches regular peak values
Drastically increase the (regular) idlecount. We now have a lot more idle machines ready to go at any given moment. This gives us an immediate buffer to help us absorb any thunder herds or other large influxes of CI jobs.
The goals of this issue have been accomplished by the CI Queue Time Stabilization Working Group. That WG is in the process of closing down so I will close this issue as well !39195 (merged).