Retry failed or stuck batched migration jobs
What does this MR do?
Retries failed / stuck jobs after running all batched migration jobs.
A job can be stuck if Sidekiq is killed before it could update the status of the job to success / fail, the query to update the status times out, etc..
Related to #327405 (closed)
Queries
Retriable jobs
SELECT "batched_background_migration_jobs".* FROM (
(SELECT "batched_background_migration_jobs".*
FROM "batched_background_migration_jobs"
WHERE "batched_background_migration_jobs"."batched_background_migration_id" = 1
AND "batched_background_migration_jobs"."status" = 2 AND (attempts < 3)
)
UNION
(SELECT "batched_background_migration_jobs".*
FROM "batched_background_migration_jobs"
WHERE "batched_background_migration_jobs"."batched_background_migration_id" = 1
AND "batched_background_migration_jobs"."status" IN (0, 1) AND (updated_at <= '2021-04-27 05:47:18.957092')
)
) batched_background_migration_jobs
WHERE "batched_background_migration_jobs"."batched_background_migration_id" = 1
ORDER BY "batched_background_migration_jobs"."id" ASC
LIMIT 1
Time: 2.002 ms
- planning: 1.214 ms
- execution: 0.788 ms
- I/O read: 0.060 ms
- I/O write: N/A
Shared buffers:
- hits: 28 (~224.00 KiB) from the buffer pool
- reads: 2 (~16.00 KiB) from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
Limit (cost=15.34..15.34 rows=1 width=112) (actual time=0.305..0.306 rows=1 loops=1)
Buffers: shared hit=28 read=2
I/O Timings: read=0.060
-> Sort (cost=15.34..15.37 rows=11 width=112) (actual time=0.303..0.304 rows=1 loops=1)
Sort Key: batched_background_migration_jobs.id
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=28 read=2
I/O Timings: read=0.060
-> HashAggregate (cost=15.06..15.17 rows=11 width=112) (actual time=0.263..0.270 rows=14 loops=1)
Group Key: batched_background_migration_jobs.id, batched_background_migration_jobs.created_at, batched_background_migration_jobs.updated_at, batched_background_migration_jobs.started_at, batched_background_migration_jobs.finished_at, batched_background_migration_jobs.batched_background_migration_id, batched_background_migration_jobs.min_value, batched_background_migration_jobs.max_value, batched_background_migration_jobs.batch_size, batched_background_migration_jobs.sub_batch_size, batched_background_migration_jobs.status, batched_background_migration_jobs.attempts, batched_background_migration_jobs.metrics, batched_background_migration_jobs.pause_ms
Buffers: shared hit=25 read=2
I/O Timings: read=0.060
-> Append (cost=0.29..14.68 rows=11 width=112) (actual time=0.110..0.237 rows=14 loops=1)
Buffers: shared hit=25 read=2
I/O Timings: read=0.060
-> Index Scan using index_batched_jobs_on_batched_migration_id_and_status on public.batched_background_migration_jobs (cost=0.29..3.31 rows=1 width=143) (actual time=0.109..0.110 rows=1 loops=1)
Index Cond: ((batched_background_migration_jobs.batched_background_migration_id = 1) AND (batched_background_migration_jobs.status = 2))
Filter: (batched_background_migration_jobs.attempts < 3)
Rows Removed by Filter: 0
Buffers: shared hit=7 read=2
I/O Timings: read=0.060
-> Index Scan using index_batched_jobs_on_batched_migration_id_and_status on public.batched_background_migration_jobs batched_background_migration_jobs_1 (cost=0.29..11.21 rows=10 width=143) (actual time=0.035..0.123 rows=13 loops=1)
Index Cond: ((batched_background_migration_jobs_1.batched_background_migration_id = 1) AND (batched_background_migration_jobs_1.status = ANY ('{0,1}'::integer[])))
Filter: (batched_background_migration_jobs_1.updated_at <= '2021-04-27 05:47:18.957092+00'::timestamp with time zone)
Rows Removed by Filter: 0
Buffers: shared hit=18
Active job existence
SELECT 1 AS one
FROM "batched_background_migration_jobs"
WHERE "batched_background_migration_jobs"."batched_background_migration_id" = 1
AND "batched_background_migration_jobs"."status" IN (0, 1)
LIMIT 1
Time: 0.256 ms
- planning: 0.123 ms
- execution: 0.133 ms
- I/O read: N/A
- I/O write: N/A
Shared buffers:
- hits: 14 (~112.00 KiB) from the buffer pool
- reads: 0 from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
Limit (cost=0.29..0.78 rows=1 width=4) (actual time=0.112..0.113 rows=1 loops=1)
Buffers: shared hit=14
-> Index Only Scan using index_batched_jobs_on_batched_migration_id_and_status on public.batched_background_migration_jobs (cost=0.29..5.27 rows=10 width=4) (actual time=0.111..0.111 rows=1 loops=1)
Index Cond: ((batched_background_migration_jobs.batched_background_migration_id = 1) AND (batched_background_migration_jobs.status = ANY ('{0,1}'::integer[])))
Heap Fetches: 0
Buffers: shared hit=14
Failed job existence
SELECT 1 AS one FROM "batched_background_migration_jobs"
WHERE "batched_background_migration_jobs"."batched_background_migration_id" = 1
AND "batched_background_migration_jobs"."status" = 2
LIMIT 1
Time: 0.248 ms
- planning: 0.133 ms
- execution: 0.115 ms
- I/O read: N/A
- I/O write: N/A
Shared buffers:
- hits: 9 (~72.00 KiB) from the buffer pool
- reads: 0 from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
Limit (cost=0.29..3.30 rows=1 width=4) (actual time=0.096..0.097 rows=1 loops=1)
Buffers: shared hit=9
-> Index Only Scan using index_batched_jobs_on_batched_migration_id_and_status on public.batched_background_migration_jobs (cost=0.29..3.30 rows=1 width=4) (actual time=0.094..0.095 rows=1 loops=1)
Index Cond: ((batched_background_migration_jobs.batched_background_migration_id = 1) AND (batched_background_migration_jobs.status = 2))
Heap Fetches: 0
Buffers: shared hit=9
Migration output
== 20210427062807 AddIndexToBatchedMigrationJobsStatus: migrating =============
-- transaction_open?()
-> 0.0000s
-- index_exists?(:batched_background_migration_jobs, [:batched_background_migration_id, :status], {:name=>"index_batched_jobs_on_batched_migration_id_and_status", :algorithm=>:concurrently})
-> 0.0030s
-- add_index(:batched_background_migration_jobs, [:batched_background_migration_id, :status], {:name=>"index_batched_jobs_on_batched_migration_id_and_status", :algorithm=>:concurrently})
-> 0.0060s
== 20210427062807 AddIndexToBatchedMigrationJobsStatus: migrated (0.0102s) ====
== 20210427062807 AddIndexToBatchedMigrationJobsStatus: reverting =============
-- transaction_open?()
-> 0.0000s
-- index_exists?(:batched_background_migration_jobs, [:batched_background_migration_id, :status], {:name=>"index_batched_jobs_on_batched_migration_id_and_status", :algorithm=>:concurrently})
-> 0.0033s
-- remove_index(:batched_background_migration_jobs, {:name=>"index_batched_jobs_on_batched_migration_id_and_status", :algorithm=>:concurrently, :column=>[:batched_background_migration_id, :status]})
-> 0.0053s
== 20210427062807 AddIndexToBatchedMigrationJobsStatus: reverted (0.0098s) ====
Screenshots (strongly suggested)
Does this MR meet the acceptance criteria?
Conformity
-
📋 Does this MR need a changelog?-
I have included a changelog entry. -
I have not included a changelog entry because _____.
-
-
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team
Edited by Heinrich Lee Yu