Investigate and Improve queue implementation (aka 10x Initiative) (#125) · Epics · GitLab.com

Investigate and Improve queue implementation (aka 10x Initiative)

The current architecture for our queueing implementation is a FIFO model which can lead to issues related the noisy neighbor problem. There are some parallel efforts ongoing to fix the immediate issues, this epic is focused on long term architectural solutions. ##### Areas of investigation: * Observability - attaching metadata to jobs * Fault isolation - one process or customer doesn't affect others * Scalability * Prioritization * Job management - kill, pause, restart * Throttling @craig-gomes is the current DRI for this Epic. ##### Areas of Focus: * Observability - a big key to determining the high water mark of our goals (2x? 10x) is to add observability to our current processes to determine where we are at peak and where we have room to improve. Issues and links below will help to inform direction * Per Andrew's [note](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7246#note_201203928) Sidekiq is peaking at 600 jobs/sec. Saturation point is 15,000 jobs/sec * Introduce Sidekiq middleware to discover CPU-intensive jobs https://gitlab.com/gitlab-org/gitlab-ce/issues/65390 * Reduction of unnecessary work * Deduplication of jobs has yielded good performance improvements. One area of investigation is CI Pipelines as the most commonly duplicated jobs. * Use MergeRequest.by_commit_shas to find which MRs to update on push - https://gitlab.com/gitlab-org/gitlab-ce/issues/53213 * Sidekiq namespaces for isolation of noisy neighbors. * Make jobs idempotent. An example is listed in https://gitlab.com/gitlab-org/gitlab-ce/issues/33774 * Identify where it makes sense to split off work into their own (micro)services. * GitLab CI/CD service daemon - https://gitlab.com/gitlab-org/gitlab-ce/issues/37695 * Break UpdateMergeRequestsWorker/MergeRequests::RefreshService into separate workers and services - https://gitlab.com/gitlab-org/gitlab-ce/issues/53215 * Improve Sidekiq scheduling with more granular control within a queue * Modify Sidekiq fetcher. Currently just FIFO, consider using different primitive such as sorted sets * Investigate re-queueing - https://medium.com/@kenzan100/sidekiq-dynamic-re-queueing-72b77c3cba73 * Rate limiting - if a user has generated many jobs, perhaps prevent the internal API from allowing the push for some time. ##### Related Infra issues: * Proposal to simplify worker queues - https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7219 * Proposal to adopt RabbitMQ - https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7246 * Out of scope for consideration since we are not yet approaching job saturation. See @andrewn's notes: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7246#note_201203928 * Review PGBouncer Pool and configuration - https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7403 * Address `besteffort` slowdown - https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7442 * Epic - Improve Redis cache scalability - https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/80 ##### Envelope (Known Metrics) * Sidekiq is peaking at 600 jobs/sec. Saturation point estimate is 15,000 jobs/sec ##### Related Epics * Rearchitect project import/export - https://gitlab.com/groups/gitlab-org/-/epics/1810 * Goals - > 50% Memory reduction > 2x speed improvement on import/export * Improved usage of Sidekiq - https://gitlab.com/groups/gitlab-org/-/epics/1855 * Background Processing Improvements - https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/96

epic