Skip to content

Fetch next project for the container registry detail worker randomly

Adie (she/her) requested to merge 390842-add-sampling-for-next-project into master

What does this MR do and why?

In Restore missing container repositories under ex... (#390842), we introduced a worker that checks for missing container repositories with tags between Rails and the Container Registry. The worker tries to obtain a lease on the project it will work on and we fetch the project today with:

def next_project
  Project.pending_data_repair_analysis.first
end

And .pending_data_repair_analysis is defined in:

scope :pending_data_repair_analysis, -> do
  left_outer_joins(:container_registry_data_repair_detail)
  .where(container_registry_data_repair_details: { project_id: nil })
end

There can be multiple instances of this worker (max concurrency: 4) that are running and it can happen that the same project is picked up by multiple workers and they fail to obtain the lease on the project. We previously had the idea of adding randomization to decrease the chances of collision but we had it before in the database query which led to database optimization issues.

Now, we tried to increase the worker's concurrency (from 4 to 8) but we did not see more requests being made (link here) so a hunch is that multiple workers fetch the same next_project thus some of them fail to obtain the lease and just end.

So in this MR, we would like to add a bit of randomization when getting the next project for analysis. Instead of adding a randomization in the database query (which is a million rows), we will do it on a smaller sample with:

`Project.pending_data_repair_analysis.limit(max_running_jobs * 2).sample`

We chose to get a sample twice the size of the max number of jobs to have a big enough sample size that reduces the risk of collision and at the same not too big that slows down the query unnecessarily.

More information can be found in: #390842 (comment 1456833860)+

Database Analysis

Raw SQL
SELECT
    "projects".*
FROM
    "projects"
    LEFT OUTER JOIN "container_registry_data_repair_details" ON "container_registry_data_repair_details"."project_id" = "projects"."id"
WHERE
    "container_registry_data_repair_details"."project_id" IS NULL
LIMIT 10
Query Plan from Postgres.ai
 Limit  (cost=2.25..3.31 rows=10 width=795) (actual time=3546.150..3546.199 rows=10 loops=1)
   Buffers: shared hit=1080746
   I/O Timings: read=0.000 write=0.000
   ->  Merge Anti Join  (cost=2.25..3192906.35 rows=30084519 width=795) (actual time=3546.148..3546.195 rows=10 loops=1)
         Merge Cond: (projects.id = container_registry_data_repair_details.project_id)
         Buffers: shared hit=1080746
         I/O Timings: read=0.000 write=0.000
         ->  Index Scan using idx_projects_on_repository_storage_last_repository_updated_at on public.projects  (cost=0.56..3078579.99 rows=31188928 width=795) (actual time=0.031..3146.984 rows=1106848 loops=1)
               Buffers: shared hit=1059059
               I/O Timings: read=0.000 write=0.000
         ->  Index Only Scan using container_registry_data_repair_details_pkey on public.container_registry_data_repair_details  (cost=0.43..22548.93 rows=1104409 width=8) (actual time=0.033..160.618 rows=1106838 loops=1)
               Heap Fetches: 23472
               Buffers: shared hit=21687
               I/O Timings: read=0.000 write=0.000
Query Plan Summary
Summary:
  
Time: 3.547 s  
  - planning: 0.852 ms  
  - execution: 3.546 s  
    - I/O read: 0.000 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 1080746 (~8.20 GiB) from the buffer pool  
  - reads: 0 from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0 

Details and visualization: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/20063/commands/65447.

Old query plan (without a limit and using .first): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/20077/commands/65486

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #390842

Edited by Adie (she/her)

Merge request reports