Introduce additional database table to improve builds queuing mechanisms
## Description This epic describes an idea of introducing an additional database table, as an intermediate solution, that will accelerate builds queuing mechanisms. ## Progress Current state: We enabled new mechanisms using pending builds table on production and [saw really good results](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908). The epic is now resolved by we continue working shipping the new mechanisms on-premises. The major risk for gitlab.com related to the builds queuing query has been mitigated. **We achieved the main goal of not using `ci_builds` table to retrieve a queue of builds for GitLab Runners**. Performance degradations related to subtransactions locking have been resolved. | Iteration | Issue / Merge request | Status / ETA | DRI | | ------ | ------ | ------ | ------ | | :white_check_mark: Introduce accelerated table | https://gitlab.com/gitlab-org/gitlab/-/issues/329764, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/61581 | In production | @grzesiek | | :white_check_mark: Rollout of the accelerated table to production | https://gitlab.com/gitlab-org/gitlab/-/issues/331496 | In production | @grzesiek | | :white_check_mark: Enable accelerated table | https://gitlab.com/gitlab-org/gitlab/-/issues/331417, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/62195 | In production | @grzesiek | | :white_check_mark: Rollout of accelerated table join with builds | https://gitlab.com/gitlab-org/gitlab/-/issues/332950 | [Deferred](https://gitlab.com/gitlab-org/gitlab/-/issues/332950#note_596907610) | @grzesiek | | :white_check_mark: Introduce shared runner builds table | https://gitlab.com/gitlab-org/gitlab/-/issues/329765, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/62912 | In production | @grzesiek | | :white_check_mark: Add new migration helpers for data migration in batches | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/63519 | In production | @grzesiek | | :white_check_mark: Add protected pending build column and migrate data | https://gitlab.com/gitlab-org/gitlab/-/issues/333111, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/63759 | In production | @grzesiek | | :white_check_mark: Fix the name of the protected pending builds partial index | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/63880 | In production | @grzesiek | | :white_check_mark: Remove `ci_register_job_service_one_by_one` feature flag | https://gitlab.com/gitlab-org/gitlab/-/issues/333673, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/64121 | In production | @morefice | | :white_check_mark: Rollout of running builds table | https://gitlab.com/gitlab-org/gitlab/-/issues/333538 | In production | @grzesiek | | :white_check_mark: Add factories for pending / running build models | https://gitlab.com/gitlab-org/gitlab/-/issues/333537, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/64223 | In production | @morefice | | :white_check_mark: Remove `ci_builds` from the builds queuing query | https://gitlab.com/gitlab-org/gitlab/-/issues/333114, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/64093 | In production | @grzesiek | | :white_check_mark: Use materialized CTE to improve builds fair scheduling performance | https://gitlab.com/gitlab-org/gitlab/-/issues/334248, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/64605 | In production | @grzesiek | | :white_check_mark: Drop index_ci_builds_on_protected | https://gitlab.com/gitlab-org/gitlab/-/issues/333814, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/64229 | In production | @morefice | | :white_check_mark: Rollout of pending builds table source | https://gitlab.com/gitlab-org/gitlab/-/issues/334255 | In production | @grzesiek | | :white_check_mark: Add documentation for new migration helper | https://gitlab.com/gitlab-org/gitlab/-/issues/333258, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/64248 | In Production | @morefice | | :white_check_mark: Split queuing disaster recovery feature flag into two | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/65237 |In production | @grzesiek | | :white_check_mark: Instrument `OverwriteProjectService` | https://gitlab.com/gitlab-org/gitlab/-/issues/334638, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/65237 | In production | @grzesiek | | :white_check_mark: Instrument `ApplicationRecord` subtransactions | https://gitlab.com/gitlab-org/gitlab/-/issues/334638, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/66477 | In production | @grzesiek | | :white_check_mark: Add long running transactions sampler to PostgreSQL exporter | https://gitlab.com/gitlab-cookbooks/gitlab-exporters/-/merge_requests/233 | In production | @grzesiek | | :white_check_mark: Production Change Request to sample `pg_stat_activity` on the primary database |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5200 | Done | @cmcfarland, @grzesiek | | :white_check_mark: Denormalize CI minutes checking | https://gitlab.com/gitlab-org/gitlab/-/issues/329763, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/64443 | In production | @morefice | | :white_check_mark: Enable CI minutes denormalization | https://gitlab.com/gitlab-org/gitlab/-/issues/338290 | [Done](https://gitlab.com/gitlab-org/gitlab/-/issues/338290#note_667164777) | @morefice | | :white_check_mark: Denormalize shared runners matching | https://gitlab.com/gitlab-org/gitlab/-/issues/329762, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/64644 | In production | @morefice | | :white_check_mark: Enable shared runners denormalization | https://gitlab.com/gitlab-org/gitlab/-/issues/338289 | Done | @morefice | | :white_check_mark: Prevent new index creation for `ci_builds` | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/65582 | In production | @morefice | | :white_check_mark: Denormalize tags matching | https://gitlab.com/gitlab-org/gitlab/-/issues/330734, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/65648 | In production | @morefice | | :white_check_mark: `StuckCiJobsWorker` does not drop all stuck pending builds | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68880 | In production | @mbobin | | :white_check_mark: Enable tags denormalization | https://gitlab.com/gitlab-org/gitlab/-/issues/338366 | [Done](https://gitlab.com/gitlab-org/gitlab/-/issues/338363#note_666027343) | @morefice | | :white_check_mark: Prevent pipeline creation for builds with +50 tags | https://gitlab.com/gitlab-org/gitlab/-/issues/338479, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68380 | In production | @morefice | | :white_check_mark: Refresh CI minutes cache after updating limits | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/69443 | In production | @fabiopitino | | :white_check_mark: Update Ci::PendingBuild when shared runners is toggled | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/69119 | In production | @morefice | | :white_check_mark: Investigate why there are failed builds in the pending builds table | https://gitlab.com/gitlab-org/gitlab/-/issues/339745 | Done | @mbobin | | :white_check_mark: Queuing queries status check and review | https://gitlab.com/gitlab-org/gitlab/-/issues/340499 | Done | @grzesiek | | :white_check_mark: Drop ci_pending_build when calling `Ci::Build#doom!` | https://gitlab.com/gitlab-org/gitlab/-/issues/340305, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/69796 | In production | @morefice | | :white_check_mark: Enable build tags limit by default | https://gitlab.com/gitlab-org/gitlab/-/issues/339855, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/69506 | In production | @morefice | | :white_check_mark: Decouple group runners queuing query from namespaces traversal | https://gitlab.com/gitlab-org/gitlab/-/issues/334531, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/70162 | In production | @grzesiek | | :white_check_mark: Add a scope for pending builds tags | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/70214 | In production | @morefice | | :white_check_mark: Decouple project runners queuing query from projects table | https://gitlab.com/gitlab-org/gitlab/-/issues/340583, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/70415 | In production | @mbobin | | :white_check_mark: Check project settings when assigning a build to a runner | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/70581 | In production | @mbobin | | :white_check_mark: Update `namespace_id` when project gets transferred | https://gitlab.com/gitlab-org/gitlab/-/issues/340597, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/70296 | In production | @morefice | | :white_check_mark: Update pending builds data when group runners settings are toggled | https://gitlab.com/gitlab-org/gitlab/-/issues/340925, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/70691 | In production | @grzesiek | | :white_check_mark: Remove builds from the queue when builds can not be processed | https://gitlab.com/gitlab-org/gitlab/-/issues/335866, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/70581 | In production | @mbobin | | :white_check_mark: Convert project_id column to bigint to increase planning time performance | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/70966 | In production | @mbobin | | :white_check_mark: Group runner denormalized queuing query rollout | https://gitlab.com/gitlab-org/gitlab/-/issues/340930 | [Done](https://gitlab.com/gitlab-org/gitlab/-/issues/340930#note_685438576) | @grzesiek | | :white_check_mark: Disallow database cross-joins in RegisterJobService tests | https://gitlab.com/gitlab-org/gitlab/-/issues/339429, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/71545 | In production | @grzesiek | | :white_check_mark: Migrate pending builds on-permises | https://gitlab.com/gitlab-org/gitlab/-/issues/340585 | [Done](https://gitlab.com/groups/gitlab-org/-/epics/6954) | @mbobin | | :white_check_mark: Improve indexes in `ci_builds` table | https://gitlab.com/gitlab-org/gitlab/-/issues/333811, https://gitlab.com/gitlab-org/gitlab/-/issues/332953 | https://gitlab.com/groups/gitlab-org/-/epics/6954 | | | :white_check_mark: On-premises rollout of the new queuing mechanism | https://gitlab.com/gitlab-org/gitlab/-/issues/332952 | [Done](https://gitlab.com/groups/gitlab-org/-/epics/6954) | @mbobin | **The table above evolves as we iterate on it, because every iteration delivers feedback we use to adjust our strategy and trajectory.** ## Problem <details> Currently we have very expensive query run on top of `ci_builds` where we - look for matching projects - look for pending builds - match tags - match other filters - look at quota This is a problem: - `ci_builds` is expensive to access: this is very wide table that often times out - `ci_builds` cannot be partitioned as otherwise we would not be able to fetch all jobs - for accessing `tags` we cross-join another table `taggings` - for accessing quota we cross-join `project/namespace` - we check access level based on `project/namespace` ## Proposal As a way to accelerate filtering: 1. Introduce `ci_pending_builds` table 1. Denormalize data that we get executing the big queuing query and insert it to the new table before placing a build in the queue 1. `JOIN` with the new table and incrementally reduce the complexity of the big query executed against `ci_builds` table 1. Apply the [Strangler Pattern](https://martinfowler.com/bliki/StranglerFigApplication.html) and move forward with small changes in sequence, evaluating their impact on the system health and performance. 1. When nothing is left is the query using the `ci_builds` table, move entirely to queuing using `ci_pending_builds` table. This would allow us to: - Still load data using `ci_builds` for the purpose of accepting build, but the filtering should be significantly faster and provide more capacity - This would allow us to make `ci_builds` partitioned without breaking queueing - Table would consist as much data as possible to perform build matching: at least `tags`, `protected`, `project_id`, and whatever else is needed - Insert build to table on status transition to `pending` as part of state machine - Delete item from table on status transition from `pending` as part of state machine - Change `RegisterJobService` to filter using `ci_pending_builds` instead of `ci_builds` - We assume that queries would have a significantly lower cost, as we would have much easier and cheaper to access data, and be able to hold this pending queue in memory of postgres for quick filtering </details> **Progress: 100%**
epic