ci_builds is expensive to access: this is very wide table that often times out
ci_builds cannot be partitioned as otherwise we would not be able to fetch all jobs
for accessing tags we cross-join another table taggings
for accessing quota we cross-join project/namespace
we check access level based on project/namespace
As a way to accelerate filtering:
Introduce ci_pending_builds table
Design table so we would not have to load ci_builds (a very wide table) as part of query as part of RegisterJobService for filtering
We would still load ci_builds for the purpose of accepting build, but the filtering should be significantly faster and provide more capacity
This would allow us to make ci_builds partitioned without breaking queueing
Table would consist as much data as possible to perform build matching: at least tags, protected, project_id, and whatever else is needed
Insert build to table on status transition to pending as part of state machine
Delete item from table on status transition from pending as part of state machine
Change RegisterJobService to filter using ci_pending_builds instead of ci_builds
We assume that queries would have a significantly lower cost, as we would have much easier and cheaper to access data, and be able to hold this pending queue in memory of postgres for quick filtering
This acceleration structure is proposed as a follow-up on gitlab-com/gl-infra/production#3712 (closed). If designed properly this could be used for all future work on queueing as well.
This can be an easy way to improve performance today without spending a lot of effort on it.
This can be a way to improve performance today, with a potential throw-away solution without a lot of impact on a codebase (hopefully)_.
Edited
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related or that one is blocking others.
Learn more.
Kamil Trzcińskichanged title from Optimise job picking to Introduce additional DB table (acceleration structure) to optimise job queueing
changed title from Optimise job picking to Introduce additional DB table (acceleration structure) to optimise job queueing
Kamil Trzcińskichanged the descriptionCompare with previous version
changed the description
Kamil Trzcińskichanged title from Introduce additional DB table (acceleration structure) to optimise job queueing to Introduce additional DB table (acceleration structure) to optimise job queueing (as an intermediate solution for better queueing)
changed title from Introduce additional DB table (acceleration structure) to optimise job queueing to Introduce additional DB table (acceleration structure) to optimise job queueing (as an intermediate solution for better queueing)
The query cost reduced from 51823820 to 6453633 a 9x improvement!
The query is faster by around 100ms, from 469ms to 364ms. However, here we have no load, so this will be significantly different value for a live system
More fields (these should be pretty constant):
This table should contain: tags (unsure in what form, is there some clever trick to do this matching?)
This table should contain: protected a flag to indicate if builds needs to use protected runner
This table should contain: runs_on_shared (or is_shared) or anything else allow us to indicate that
All of that can be even more optimised by (iteratively):
The new table allows us to remove more parts of the big query, step by step, and recalculate it as part of ci_pending_builds
We could precalculate: namespace quota: we could remove entries from ci_pending_builds or have a flag: is_shared (if a build needs to be executed by shared runners)
We could precalculate: also as part of is_shared a project visibility and eligability for being picked by shared runners
If project runs out of quota/visibility, we could update its all builds to have is_shared set to false, end reset it if still has quota/visibility
We could store tags in as an array, to be able to quickly bitmask them with our list of tags
The table ci_builds could be partitioned, as we would only depend on ci_pending/running_builds being in-memory for quick builds matching
Especially, if we could somehow do it as a drop-in replacement to current system, and just evaluate its impact with a minimal effort. I would expect that this would be a massive difference for GitLab just because of much more compact data structure.
@ayufan the proposed solution sounds like a nice approach to me as long as we don't have any information that may change in ci_builds_pending (so we would have to maintain the consistency of the materialized table) and we only have to update it on the cases you mention:
Insert build to table on status transition to pending as part of state machine
Delete item from table on status transition from pending as part of state machine
The table and indexes would be pretty bloated, but it would stay comparably small and I assume that Autovacuum and our re-indexing processes should be able to handle that bloat (others can correct me on this statement).
Other than that, we would have to add a couple additional indexes, but that's a minor implementation discussion.
Kamil Trzcińskichanged title from Introduce additional DB table (acceleration structure) to optimise job queueing (as an intermediate solution for better queueing) to Idea to consider: Introduce additional DB table (acceleration structure) to optimise job queueing (as an intermediate solution for better queueing)
changed title from Introduce additional DB table (acceleration structure) to optimise job queueing (as an intermediate solution for better queueing) to Idea to consider: Introduce additional DB table (acceleration structure) to optimise job queueing (as an intermediate solution for better queueing)
Kamil Trzcińskichanged the descriptionCompare with previous version