Iteratively re-architect our queue implementation to create 10x headroom
@andrewn @stanhu @ayufan I gather our queue implementation in the application is at a breaking point and it needs to be re-architected (the relational analogy Andrew and I talked about is using tables, rather than records-in-a-table to data model job scheduling). Until that happens we're vulnerable to downtime on GitLab.com (especially job scheduling) due to increased usage, CI abuse, or CI mis-use.
I'm not being too prescriptive because you're all senior technical leaders in Engineering and have a better sense of the solution than I do, as well as a sense whether the goal itself should be re-framed from what I've written here.
This should be a drop-everything priority exercise. It's a perfect candidate for Q3 OKRs (of course we don't want to wait until EOQ to see results). Please let us know how we need to mobilize either SREs from infrastructure or Engineers Development (rapid action, availability & performance grooming, etc).
This issue exists for the sake of our mgmt board. But please link to other artifacts/issues as needed.
CC @clefelhocz1 @glopezfernandez @ansdval @marin @Finotto @dawsmith
Linked epic from Infrastructure that will have a big impact on these efforts Background Processing Improvements - gitlab-com/gl-infra&96 (closed)