Move workers off of quarantine shard

From the comment: &508 (comment 643434617) by @msmiley


I think it's probably reasonable to get rid of the quarantine shard by moving its 2 job classes to the low-urgency-cpu-bound shard:

  • authorized_project_update:authorized_project_update_user_refresh_from_replica
  • authorized_project_update:authorized_project_update_user_refresh_with_low_urgency

Context:

Prior to creating the quarantine shard, those two job classes had been (incorrectly) assigned to the catchall shard, which has a job concurrency of 15. Reassigning them to the hastily created quarantine shard protected all other job classes from being starved by these potentially CPU-hungry jobs. Moving them back into one of the "cpu-bound" shards would carry a bit more risk than the quarantine shard but I think we can iteratively improve on that with the work we are sketching in &539. Specifically:

  • Either cpu-bound shard has a per-pod job concurrency of 5 (better than 15).
  • Either cpu-bound shard has other jobs that are also expected to be CPU-bound and are therefore hopefully already aiming to minimize the duration of their db transactions. (In contrast, the jobs running in the catchall shard are less likely to have had that kind of optimization effort, and would therefore be more likely to contribute to db connection pool saturation when being starved for CPU time.)
  • I suspect (but have not verified) that we use faster CPUs for the k8s nodes running our cpu-bound pods than for the k8s nodes running catchall pods. (We definitely use different node pools, but I have not reviewed the specs for the VMs in those node pools.)

Edited by Rachel Nienaber