Change project_export background worker queue times
Project export dashboard https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?var-queue=project_export
The project_export
worker has very poor queueing latency characteristics, as shown in this graph of the last 24 hours:
Currently it uses urgency:default
, which gives it up to 1 minute to queue, however, this queue is frequently over 10 minutes.
Something else to keep in mind: most of the project exports (particularly the queueing ones) are kicked off from scheduled jobs at predictable times.
We should decrease the urgency
to throttled
(aka none
).
- Pro: SLO adherence
- Con: this queue could run-away (eg waiting times of days) and we wouldn't get notified since multi-day queue times are acceptable with
urgency:none
In gitlab-com/runbooks!2044 (merged) we're adding alerting for the dequeueing of throttled jobs.
Original proposals:
-
Ramp up more workers, possibly in the new k8s deployment of
project_export
- Pro: SLO adherence (potentially)
- Pro: lower waiting times for project export jobs
- Con: more concurrent export nodes/pods, more expense and more pressure on backend services
-
Decrease the
urgency
tothrottled
(akanone
).- Pro: SLO adherence
- Con: this queue could run-away (eg waiting times of days) and we wouldn't get notified since multi-day queue times are acceptable with
urgency:none
-
Introduce a third urgency of
very_low
with a queue SLO of 1 hour- Pro: SLO adherence
- Pro: run-away queue times would still be alerted on
- Con: small amount of development work
- Con: would need more more buckets in the Prometheus histogram, so extra cardinality
All-in-all, I think option 3 would be the best option. Are there other queues that would suit this long (but finite) queuing SLO?