Change project_export background worker queue times

Project export dashboard https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?var-queue=project_export

The project_export worker has very poor queueing latency characteristics, as shown in this graph of the last 24 hours:

image

Currently it uses urgency:default, which gives it up to 1 minute to queue, however, this queue is frequently over 10 minutes.

Something else to keep in mind: most of the project exports (particularly the queueing ones) are kicked off from scheduled jobs at predictable times.

We should decrease the urgency to throttled (aka none).

  1. Pro: SLO adherence
  2. Con: this queue could run-away (eg waiting times of days) and we wouldn't get notified since multi-day queue times are acceptable with urgency:none

In gitlab-com/runbooks!2044 (merged) we're adding alerting for the dequeueing of throttled jobs.


Original proposals:

These are the options I can think of to tackle this:
  1. Ramp up more workers, possibly in the new k8s deployment of project_export

    1. Pro: SLO adherence (potentially)
    2. Pro: lower waiting times for project export jobs
    3. Con: more concurrent export nodes/pods, more expense and more pressure on backend services
  2. Decrease the urgency to throttled (aka none).

    1. Pro: SLO adherence
    2. Con: this queue could run-away (eg waiting times of days) and we wouldn't get notified since multi-day queue times are acceptable with urgency:none
  3. Introduce a third urgency of very_low with a queue SLO of 1 hour

    1. Pro: SLO adherence
    2. Pro: run-away queue times would still be alerted on
    3. Con: small amount of development work
    4. Con: would need more more buckets in the Prometheus histogram, so extra cardinality

All-in-all, I think option 3 would be the best option. Are there other queues that would suit this long (but finite) queuing SLO?

Edited by Bob Van Landuyt