Change project_export background worker queue times

The project_export worker has very poor queueing latency characteristics, as shown in this graph of the last 24 hours:

Currently it uses urgency:default, which gives it up to 1 minute to queue, however, this queue is frequently over 10 minutes.

Something else to keep in mind: most of the project exports (particularly the queueing ones) are kicked off from scheduled jobs at predictable times.

We should decrease the urgency to throttled (aka none).

Pro: SLO adherence
Con: this queue could run-away (eg waiting times of days) and we wouldn't get notified since multi-day queue times are acceptable with urgency:none

In gitlab-com/runbooks!2044 (merged) we're adding alerting for the dequeueing of throttled jobs.

Original proposals:

These are the options I can think of to tackle this:

Ramp up more workers, possibly in the new k8s deployment of project_export
1. Pro: SLO adherence (potentially)
2. Pro: lower waiting times for project export jobs
3. Con: more concurrent export nodes/pods, more expense and more pressure on backend services
Decrease the urgency to throttled (aka none).
1. Pro: SLO adherence
2. Con: this queue could run-away (eg waiting times of days) and we wouldn't get notified since multi-day queue times are acceptable with urgency:none
Introduce a third urgency of very_low with a queue SLO of 1 hour
1. Pro: SLO adherence
2. Pro: run-away queue times would still be alerted on
3. Con: small amount of development work
4. Con: would need more more buckets in the Prometheus histogram, so extra cardinality

All-in-all, I think option 3 would be the best option. Are there other queues that would suit this long (but finite) queuing SLO?

Edited Mar 24, 2020 by Bob Van Landuyt