Skip to content

Retry failed or stuck batched migration jobs

What does this MR do?

Retries failed / stuck jobs after running all batched migration jobs.

A job can be stuck if Sidekiq is killed before it could update the status of the job to success / fail, the query to update the status times out, etc..

Related to #327405 (closed)

Queries

Retriable jobs
SELECT "batched_background_migration_jobs".* FROM (
  (SELECT "batched_background_migration_jobs".*
   FROM "batched_background_migration_jobs"
   WHERE "batched_background_migration_jobs"."batched_background_migration_id" = 1
     AND "batched_background_migration_jobs"."status" = 2 AND (attempts < 3)
  )
  UNION
  (SELECT "batched_background_migration_jobs".*
   FROM "batched_background_migration_jobs"
   WHERE "batched_background_migration_jobs"."batched_background_migration_id" = 1
   AND "batched_background_migration_jobs"."status" IN (0, 1) AND (updated_at <= '2021-04-27 05:47:18.957092')
  )
) batched_background_migration_jobs
WHERE "batched_background_migration_jobs"."batched_background_migration_id" = 1
ORDER BY "batched_background_migration_jobs"."id" ASC
LIMIT 1
Time: 2.002 ms  
  - planning: 1.214 ms  
  - execution: 0.788 ms  
    - I/O read: 0.060 ms  
    - I/O write: N/A  
  
Shared buffers:  
  - hits: 28 (~224.00 KiB) from the buffer pool  
  - reads: 2 (~16.00 KiB) from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  
 Limit  (cost=15.34..15.34 rows=1 width=112) (actual time=0.305..0.306 rows=1 loops=1)
   Buffers: shared hit=28 read=2
   I/O Timings: read=0.060
   ->  Sort  (cost=15.34..15.37 rows=11 width=112) (actual time=0.303..0.304 rows=1 loops=1)
         Sort Key: batched_background_migration_jobs.id
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=28 read=2
         I/O Timings: read=0.060
         ->  HashAggregate  (cost=15.06..15.17 rows=11 width=112) (actual time=0.263..0.270 rows=14 loops=1)
               Group Key: batched_background_migration_jobs.id, batched_background_migration_jobs.created_at, batched_background_migration_jobs.updated_at, batched_background_migration_jobs.started_at, batched_background_migration_jobs.finished_at, batched_background_migration_jobs.batched_background_migration_id, batched_background_migration_jobs.min_value, batched_background_migration_jobs.max_value, batched_background_migration_jobs.batch_size, batched_background_migration_jobs.sub_batch_size, batched_background_migration_jobs.status, batched_background_migration_jobs.attempts, batched_background_migration_jobs.metrics, batched_background_migration_jobs.pause_ms
               Buffers: shared hit=25 read=2
               I/O Timings: read=0.060
               ->  Append  (cost=0.29..14.68 rows=11 width=112) (actual time=0.110..0.237 rows=14 loops=1)
                     Buffers: shared hit=25 read=2
                     I/O Timings: read=0.060
                     ->  Index Scan using index_batched_jobs_on_batched_migration_id_and_status on public.batched_background_migration_jobs  (cost=0.29..3.31 rows=1 width=143) (actual time=0.109..0.110 rows=1 loops=1)
                           Index Cond: ((batched_background_migration_jobs.batched_background_migration_id = 1) AND (batched_background_migration_jobs.status = 2))
                           Filter: (batched_background_migration_jobs.attempts < 3)
                           Rows Removed by Filter: 0
                           Buffers: shared hit=7 read=2
                           I/O Timings: read=0.060
                     ->  Index Scan using index_batched_jobs_on_batched_migration_id_and_status on public.batched_background_migration_jobs batched_background_migration_jobs_1  (cost=0.29..11.21 rows=10 width=143) (actual time=0.035..0.123 rows=13 loops=1)
                           Index Cond: ((batched_background_migration_jobs_1.batched_background_migration_id = 1) AND (batched_background_migration_jobs_1.status = ANY ('{0,1}'::integer[])))
                           Filter: (batched_background_migration_jobs_1.updated_at <= '2021-04-27 05:47:18.957092+00'::timestamp with time zone)
                           Rows Removed by Filter: 0
                           Buffers: shared hit=18
Active job existence
SELECT 1 AS one
FROM "batched_background_migration_jobs"
WHERE "batched_background_migration_jobs"."batched_background_migration_id" = 1
  AND "batched_background_migration_jobs"."status" IN (0, 1)
LIMIT 1
Time: 0.256 ms  
  - planning: 0.123 ms  
  - execution: 0.133 ms  
    - I/O read: N/A  
    - I/O write: N/A  
  
Shared buffers:  
  - hits: 14 (~112.00 KiB) from the buffer pool  
  - reads: 0 from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  
 Limit  (cost=0.29..0.78 rows=1 width=4) (actual time=0.112..0.113 rows=1 loops=1)
   Buffers: shared hit=14
   ->  Index Only Scan using index_batched_jobs_on_batched_migration_id_and_status on public.batched_background_migration_jobs  (cost=0.29..5.27 rows=10 width=4) (actual time=0.111..0.111 rows=1 loops=1)
         Index Cond: ((batched_background_migration_jobs.batched_background_migration_id = 1) AND (batched_background_migration_jobs.status = ANY ('{0,1}'::integer[])))
         Heap Fetches: 0
         Buffers: shared hit=14
Failed job existence
SELECT 1 AS one FROM "batched_background_migration_jobs"
WHERE "batched_background_migration_jobs"."batched_background_migration_id" = 1
  AND "batched_background_migration_jobs"."status" = 2
LIMIT 1
Time: 0.248 ms  
  - planning: 0.133 ms  
  - execution: 0.115 ms  
    - I/O read: N/A  
    - I/O write: N/A  
  
Shared buffers:  
  - hits: 9 (~72.00 KiB) from the buffer pool  
  - reads: 0 from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0 
 Limit  (cost=0.29..3.30 rows=1 width=4) (actual time=0.096..0.097 rows=1 loops=1)
   Buffers: shared hit=9
   ->  Index Only Scan using index_batched_jobs_on_batched_migration_id_and_status on public.batched_background_migration_jobs  (cost=0.29..3.30 rows=1 width=4) (actual time=0.094..0.095 rows=1 loops=1)
         Index Cond: ((batched_background_migration_jobs.batched_background_migration_id = 1) AND (batched_background_migration_jobs.status = 2))
         Heap Fetches: 0
         Buffers: shared hit=9

Migration output

== 20210427062807 AddIndexToBatchedMigrationJobsStatus: migrating =============
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:batched_background_migration_jobs, [:batched_background_migration_id, :status], {:name=>"index_batched_jobs_on_batched_migration_id_and_status", :algorithm=>:concurrently})
   -> 0.0030s
-- add_index(:batched_background_migration_jobs, [:batched_background_migration_id, :status], {:name=>"index_batched_jobs_on_batched_migration_id_and_status", :algorithm=>:concurrently})
   -> 0.0060s
== 20210427062807 AddIndexToBatchedMigrationJobsStatus: migrated (0.0102s) ====
== 20210427062807 AddIndexToBatchedMigrationJobsStatus: reverting =============
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:batched_background_migration_jobs, [:batched_background_migration_id, :status], {:name=>"index_batched_jobs_on_batched_migration_id_and_status", :algorithm=>:concurrently})
   -> 0.0033s
-- remove_index(:batched_background_migration_jobs, {:name=>"index_batched_jobs_on_batched_migration_id_and_status", :algorithm=>:concurrently, :column=>[:batched_background_migration_id, :status]})
   -> 0.0053s
== 20210427062807 AddIndexToBatchedMigrationJobsStatus: reverted (0.0098s) ====

Screenshots (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Heinrich Lee Yu

Merge request reports