Skip to content

Add worker to collect data on missing repositories

Adie (she/her) requested to merge 390842-add-registry-data-repair-worker into master

What does this MR do and why?

Rollout issue: #397505

Part of: Restore missing container repositories under ex... (#390842)

Implementation is discussed in #390842 (comment 1305310762), #390842 (comment 1333690944), and #390842 (comment 1335035423).

In this MR, we:

  1. Add a limited capacity worker: ContainerRegistry::RecordDataRepairDetailWorker

This limited capacity worker picks up a project that is not yet analyzed and tries to obtain an exclusive lease on the project. A project is not yet analyzed when it is not in the new table yet (i.e. it has not been started, completed, nor failed).

Once an exclusive lease is obtained, we insert an ongoing row to the new table so other parallel workers would not pick it up anymore. We then query the sub-repositories endpoint page by page and calculate how many repositories are missing in the database. We do this by comparing the number of fetched repositories by the number of container repositories of the project that is found in the database.

After we have gone through all the pages of the endpoint, we then update the ongoing row to completed and record how many repositories were missing.

  1. Use the existing ContainerRegistry::CleanupWorkerto call the limited capacity worker and also clean up stale ongoing rows in the new table.
  2. Add a feature flag to control this feature

Migration Results

main: == 20230404160131 AddStatusToDataRepairDetails: migrating =====================
main: -- add_column(:container_registry_data_repair_details, :status, :integer, {:default=>0, :null=>false, :limit=>2})
main:    -> 0.0022s
main: -- transaction_open?()
main:    -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main:    -> 0.0853s
main: -- index_exists?(:container_registry_data_repair_details, :status, {:name=>"index_container_registry_data_repair_details_on_status", :algorithm=>:concurrently})
main:    -> 0.0013s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0004s
main: -- add_index(:container_registry_data_repair_details, :status, {:name=>"index_container_registry_data_repair_details_on_status", :algorithm=>:concurrently})
main:    -> 0.0016s
main: -- execute("RESET statement_timeout")
main:    -> 0.0003s
main: == 20230404160131 AddStatusToDataRepairDetails: migrated (0.1046s) ============

Database Analysis

  1. New scope: Project.pending_data_repair_analysis (not yet evaluated on the database, will only be evaluated by 2 and 3)
Raw SQL
SELECT
    "projects".*
FROM
    "projects"
    LEFT OUTER JOIN "container_registry_data_repair_details" ON "container_registry_data_repair_details"."project_id" = "projects"."id"
WHERE
    "container_registry_data_repair_details"."project_id" IS NULL
  1. Project.pending_data_repair_analysis.exists? (one evaluation of the scope in 1.)
Raw SQL
SELECT
    1 AS one
FROM
    "projects"
    LEFT OUTER JOIN "container_registry_data_repair_details" ON "container_registry_data_repair_details"."project_id" = "projects"."id"
WHERE
    "container_registry_data_repair_details"."project_id" IS NULL
LIMIT 1

Query Plan
 Limit  (cost=0.15..0.50 rows=1 width=4) (actual time=0.046..0.047 rows=1 loops=1)
   Buffers: shared hit=6
   I/O Timings: read=0.000 write=0.000
   ->  Nested Loop Anti Join  (cost=0.15..10292748.20 rows=29594690 width=4) (actual time=0.045..0.045 rows=1 loops=1)
         Buffers: shared hit=6
         I/O Timings: read=0.000 write=0.000
         ->  Seq Scan on public.projects  (cost=0.00..5261373.40 rows=29596140 width=4) (actual time=0.011..0.011 rows=1 loops=1)
               Buffers: shared hit=1
               I/O Timings: read=0.000 write=0.000
         ->  Index Only Scan using container_registry_data_repair_details_pkey on public.container_registry_data_repair_details  (cost=0.15..0.17 rows=1 width=8) (actual time=0.031..0.031 rows=0 loops=1)
               Index Cond: (container_registry_data_repair_details.project_id = projects.id)
               Heap Fetches: 0
               Buffers: shared hit=5
               I/O Timings: read=0.000 write=0.000

Recommendations:
✅ Looks good
Summary:
  
Time: 0.812 ms  
  - planning: 0.730 ms  
  - execution: 0.082 ms  
    - I/O read: 0.000 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 6 (~48.00 KiB) from the buffer pool  
  - reads: 0 from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  
  
Details and visualization: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/17343/commands/58173.
Details and Visualization https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/17343/commands/58173
  1. Project.pending_data_repair_analysis.limit(max_running_jobs + 1).count where max_running_jobs = 2.
Raw SQL:
SELECT
    COUNT(*)
FROM (
    SELECT
        1 AS one
    FROM
        "projects"
    LEFT OUTER JOIN "container_registry_data_repair_details" ON "container_registry_data_repair_details"."project_id" = "projects"."id"
WHERE
    "container_registry_data_repair_details"."project_id" IS NULL
LIMIT 3
Query Plan
 Aggregate  (cost=0.71..0.72 rows=1 width=8) (actual time=7.101..7.105 rows=1 loops=1)
   Buffers: shared hit=1 read=6
   I/O Timings: read=6.792 write=0.000
   ->  Limit  (cost=0.59..0.67 rows=3 width=4) (actual time=4.924..7.088 rows=3 loops=1)
         Buffers: shared hit=1 read=6
         I/O Timings: read=6.792 write=0.000
         ->  Merge Anti Join  (cost=0.59..772596.20 rows=29594690 width=4) (actual time=4.922..7.070 rows=3 loops=1)
               Merge Cond: (projects.id = container_registry_data_repair_details.project_id)
               Buffers: shared hit=1 read=6
               I/O Timings: read=6.792 write=0.000
               ->  Index Only Scan using projects_pkey on public.projects  (cost=0.44..698549.32 rows=29596140 width=4) (actual time=4.883..7.023 rows=3 loops=1)
                     Heap Fetches: 0
                     Buffers: shared read=6
                     I/O Timings: read=6.792 write=0.000
               ->  Index Only Scan using container_registry_data_repair_details_pkey on public.container_registry_data_repair_details  (cost=0.15..38.40 rows=1450 width=8) (actual time=0.033..0.033 rows=0 loops=1)
                     Heap Fetches: 0
                     Buffers: shared hit=1
                     I/O Timings: read=0.000 write=0.000



Other artifacts are provided below
Recommendations:
✅ Looks good
Summary:
  
Time: 8.177 ms  
  - planning: 0.848 ms  
  - execution: 7.329 ms  
    - I/O read: 6.792 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 1 (~8.00 KiB) from the buffer pool  
  - reads: 6 (~48.00 KiB) from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  
  
Details and visualization: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/17343/commands/58176.
Details and Visualization https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/17343/commands/58176

How to set up and validate locally

  1. Make sure your local environment supports the Gitlab API. Check the following and it should be true.
ContainerRegistry::GitlabApiClient.supports_gitlab_api? #=> true
  1. Enable the feature flag
Feature.enable(:registry_data_repair_worker)
  1. To run the limited capacity worker, run the following in the console:
ContainerRegistry::RecordDataRepairDetailWorker.perform_with_capacity
  1. The cron worker is also running (every 5 minutes, :00, :05, :10, :15, :20) so even without running the limited capacity worker, it will be executed. You may follow the logs via gdk tail rails-background-jobs.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #390842

Edited by Adie (she/her)

Merge request reports