Skip to content

Introduce additional DB table (acceleration structure) to optimise job queueing (as an intermediate solution for better queueing)

Currently we have very expensive query run on top of ci_builds where we (gitlab-com/gl-infra/production#3712 (closed)):

  • look for matching projects
  • look for pending builds
  • match tags
  • match other filters
  • look at quota

This is a problem:

  • ci_builds is expensive to access: this is very wide table that often times out
  • ci_builds cannot be partitioned as otherwise we would not be able to fetch all jobs
  • for accessing tags we cross-join another table taggings
  • for accessing quota we cross-join project/namespace
  • we check access level based on project/namespace

As a way to accelerate filtering:

  • Introduce ci_pending_builds table
  • Design table so we would not have to load ci_builds (a very wide table) as part of query as part of RegisterJobService for filtering
  • We would still load ci_builds for the purpose of accepting build, but the filtering should be significantly faster and provide more capacity
  • This would allow us to make ci_builds partitioned without breaking queueing
  • Table would consist as much data as possible to perform build matching: at least tags, protected, project_id, and whatever else is needed
  • Insert build to table on status transition to pending as part of state machine
  • Delete item from table on status transition from pending as part of state machine
  • Change RegisterJobService to filter using ci_pending_builds instead of ci_builds
  • We assume that queries would have a significantly lower cost, as we would have much easier and cheaper to access data, and be able to hold this pending queue in memory of postgres for quick filtering

This acceleration structure is proposed as a follow-up on gitlab-com/gl-infra/production#3712 (closed). If designed properly this could be used for all future work on queueing as well. This can be an easy way to improve performance today without spending a lot of effort on it.

This can be a way to improve performance today, with a potential throw-away solution without a lot of impact on a codebase (hopefully)_.

Edited by Grzegorz Bizon