Add worker to collect data on missing repositories (!115156) · Merge requests · GitLab.org / GitLab

Adie (she/her) requested to merge 390842-add-registry-data-repair-worker into master Mar 20, 2023

What does this MR do and why?

Rollout issue: #397505

Part of: Restore missing container repositories under ex... (#390842)

Implementation is discussed in #390842 (comment 1305310762), #390842 (comment 1333690944), and #390842 (comment 1335035423).

In this MR, we:

Add a limited capacity worker: ContainerRegistry::RecordDataRepairDetailWorker

This limited capacity worker picks up a project that is not yet analyzed and tries to obtain an exclusive lease on the project. A project is not yet analyzed when it is not in the new table yet (i.e. it has not been started, completed, nor failed).

Once an exclusive lease is obtained, we insert an ongoing row to the new table so other parallel workers would not pick it up anymore. We then query the sub-repositories endpoint page by page and calculate how many repositories are missing in the database. We do this by comparing the number of fetched repositories by the number of container repositories of the project that is found in the database.

After we have gone through all the pages of the endpoint, we then update the ongoing row to completed and record how many repositories were missing.

Use the existing ContainerRegistry::CleanupWorkerto call the limited capacity worker and also clean up stale ongoing rows in the new table.
Add a feature flag to control this feature

Migration Results

main: == 20230404160131 AddStatusToDataRepairDetails: migrating =====================
main: -- add_column(:container_registry_data_repair_details, :status, :integer, {:default=>0, :null=>false, :limit=>2})
main:    -> 0.0022s
main: -- transaction_open?()
main:    -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main:    -> 0.0853s
main: -- index_exists?(:container_registry_data_repair_details, :status, {:name=>"index_container_registry_data_repair_details_on_status", :algorithm=>:concurrently})
main:    -> 0.0013s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0004s
main: -- add_index(:container_registry_data_repair_details, :status, {:name=>"index_container_registry_data_repair_details_on_status", :algorithm=>:concurrently})
main:    -> 0.0016s
main: -- execute("RESET statement_timeout")
main:    -> 0.0003s
main: == 20230404160131 AddStatusToDataRepairDetails: migrated (0.1046s) ============

Database Analysis

New scope: Project.pending_data_repair_analysis (not yet evaluated on the database, will only be evaluated by 2 and 3)

Raw SQL

SELECT
    "projects".*
FROM
    "projects"
    LEFT OUTER JOIN "container_registry_data_repair_details" ON "container_registry_data_repair_details"."project_id" = "projects"."id"
WHERE
    "container_registry_data_repair_details"."project_id" IS NULL

Project.pending_data_repair_analysis.exists? (one evaluation of the scope in 1.)

Raw SQL

SELECT
    1 AS one
FROM
    "projects"
    LEFT OUTER JOIN "container_registry_data_repair_details" ON "container_registry_data_repair_details"."project_id" = "projects"."id"
WHERE
    "container_registry_data_repair_details"."project_id" IS NULL
LIMIT 1

Query Plan

 Limit  (cost=0.15..0.50 rows=1 width=4) (actual time=0.046..0.047 rows=1 loops=1)
   Buffers: shared hit=6
   I/O Timings: read=0.000 write=0.000
   ->  Nested Loop Anti Join  (cost=0.15..10292748.20 rows=29594690 width=4) (actual time=0.045..0.045 rows=1 loops=1)
         Buffers: shared hit=6
         I/O Timings: read=0.000 write=0.000
         ->  Seq Scan on public.projects  (cost=0.00..5261373.40 rows=29596140 width=4) (actual time=0.011..0.011 rows=1 loops=1)
               Buffers: shared hit=1
               I/O Timings: read=0.000 write=0.000
         ->  Index Only Scan using container_registry_data_repair_details_pkey on public.container_registry_data_repair_details  (cost=0.15..0.17 rows=1 width=8) (actual time=0.031..0.031 rows=0 loops=1)
               Index Cond: (container_registry_data_repair_details.project_id = projects.id)
               Heap Fetches: 0
               Buffers: shared hit=5
               I/O Timings: read=0.000 write=0.000

Recommendations:
✅ Looks good
Summary:
  
Time: 0.812 ms  
  - planning: 0.730 ms  
  - execution: 0.082 ms  
    - I/O read: 0.000 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 6 (~48.00 KiB) from the buffer pool  
  - reads: 0 from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  
  
Details and visualization: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/17343/commands/58173.

Details and Visualization

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/17343/commands/58173

Project.pending_data_repair_analysis.limit(max_running_jobs + 1).count where max_running_jobs = 2.

Raw SQL:

SELECT
    COUNT(*)
FROM (
    SELECT
        1 AS one
    FROM
        "projects"
    LEFT OUTER JOIN "container_registry_data_repair_details" ON "container_registry_data_repair_details"."project_id" = "projects"."id"
WHERE
    "container_registry_data_repair_details"."project_id" IS NULL
LIMIT 3

Query Plan

 Aggregate  (cost=0.71..0.72 rows=1 width=8) (actual time=7.101..7.105 rows=1 loops=1)
   Buffers: shared hit=1 read=6
   I/O Timings: read=6.792 write=0.000
   ->  Limit  (cost=0.59..0.67 rows=3 width=4) (actual time=4.924..7.088 rows=3 loops=1)
         Buffers: shared hit=1 read=6
         I/O Timings: read=6.792 write=0.000
         ->  Merge Anti Join  (cost=0.59..772596.20 rows=29594690 width=4) (actual time=4.922..7.070 rows=3 loops=1)
               Merge Cond: (projects.id = container_registry_data_repair_details.project_id)
               Buffers: shared hit=1 read=6
               I/O Timings: read=6.792 write=0.000
               ->  Index Only Scan using projects_pkey on public.projects  (cost=0.44..698549.32 rows=29596140 width=4) (actual time=4.883..7.023 rows=3 loops=1)
                     Heap Fetches: 0
                     Buffers: shared read=6
                     I/O Timings: read=6.792 write=0.000
               ->  Index Only Scan using container_registry_data_repair_details_pkey on public.container_registry_data_repair_details  (cost=0.15..38.40 rows=1450 width=8) (actual time=0.033..0.033 rows=0 loops=1)
                     Heap Fetches: 0
                     Buffers: shared hit=1
                     I/O Timings: read=0.000 write=0.000



Other artifacts are provided below
Recommendations:
✅ Looks good
Summary:
  
Time: 8.177 ms  
  - planning: 0.848 ms  
  - execution: 7.329 ms  
    - I/O read: 6.792 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 1 (~8.00 KiB) from the buffer pool  
  - reads: 6 (~48.00 KiB) from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  
  
Details and visualization: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/17343/commands/58176.

Details and Visualization

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/17343/commands/58176

How to set up and validate locally

Make sure your local environment supports the Gitlab API. Check the following and it should be true.

ContainerRegistry::GitlabApiClient.supports_gitlab_api? #=> true

Enable the feature flag

Feature.enable(:registry_data_repair_worker)

To run the limited capacity worker, run the following in the console:

ContainerRegistry::RecordDataRepairDetailWorker.perform_with_capacity

The cron worker is also running (every 5 minutes, :00, :05, :10, :15, :20) so even without running the limited capacity worker, it will be executed. You may follow the logs via gdk tail rails-background-jobs.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #390842

Edited Apr 04, 2023 by Adie (she/her)

Add worker to collect data on missing repositories

What does this MR do and why?

Migration Results

Database Analysis

How to set up and validate locally

MR acceptance checklist

Merge request reports