Batched Background Migration Job records not being updated
We had 5 failed runs for the BatchedBackgroundMigrationWorker
(out of 1080).
All of the failed runs were caused by statement timeouts on the select query that finds the upper bound (stop) of the batch in each_batch
.
Those errors are very sparse and not much of a concern as they happen while the BatchedMigrationRunner
tries to find the next batch range and before it creates a BatchedJob
, so the next execution is not affected.
Issue
But while I was checking this, I run a query over the batched_background_migration_jobs
:
gitlabhq_production=> SELECT status, COUNT(*) FROM batched_background_migration_jobs GROUP BY status;
status | count
--------+-------
1 | 11
3 | 4275
gitlabhq_production=> SELECT * FROM batched_background_migration_jobs WHERE status = 1;
id | created_at | updated_at | started_at | finished_at | batched_background_migration_id | min_value | max_value | batch_size | sub_batch_size | status | attempts | metrics
------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+---------------------------------+-----------+-----------+------------+----------------+--------+----------+---------
-- 10 NOT updated jobs - stuck at running
685 | 2021-04-05 09:50:02.00755+00 | 2021-04-05 09:50:35.329187+00 | 2021-04-05 09:50:02.014176+00 | 2021-04-05 09:50:35.328527+00 | 1 | 116151373 | 116208906 | 50000 | 200 | 1 | 1 | {}
2223 | 2021-04-07 13:37:04.390141+00 | 2021-04-07 13:37:26.958784+00 | 2021-04-07 13:37:04.398889+00 | 2021-04-07 13:37:26.958107+00 | 1 | 220408701 | 220630870 | 200000 | 500 | 1 | 1 | {}
2300 | 2021-04-07 16:57:01.971842+00 | 2021-04-07 16:57:26.540344+00 | 2021-04-07 16:57:01.981515+00 | 2021-04-07 16:57:26.539665+00 | 1 | 237600007 | 237824454 | 200000 | 500 | 1 | 1 | {}
2394 | 2021-04-07 20:14:02.044816+00 | 2021-04-07 20:14:34.320387+00 | 2021-04-07 20:14:02.055119+00 | 2021-04-07 20:14:34.319405+00 | 1 | 258675277 | 258899689 | 200000 | 500 | 1 | 1 | {}
2561 | 2021-04-08 01:52:02.332268+00 | 2021-04-08 01:53:09.779131+00 | 2021-04-08 01:52:02.337932+00 | 2021-04-08 01:53:09.778284+00 | 1 | 295972238 | 296189933 | 200000 | 500 | 1 | 1 | {}
2772 | 2021-04-08 09:00:01.608604+00 | 2021-04-08 09:00:01.620031+00 | 2021-04-08 09:00:01.619302+00 | | 1 | 342298978 | 342524610 | 200000 | 500 | 1 | 1 | {}
2819 | 2021-04-08 10:45:16.084115+00 | 2021-04-08 10:46:48.956071+00 | 2021-04-08 10:45:16.090629+00 | 2021-04-08 10:46:48.955465+00 | 1 | 352591053 | 352812908 | 200000 | 500 | 1 | 1 | {}
2823 | 2021-04-08 10:54:01.841729+00 | 2021-04-08 10:55:22.627538+00 | 2021-04-08 10:54:01.854955+00 | 2021-04-08 10:55:22.626391+00 | 1 | 353464105 | 353689234 | 200000 | 500 | 1 | 1 | {}
2877 | 2021-04-08 13:03:01.854084+00 | 2021-04-08 13:04:22.636241+00 | 2021-04-08 13:03:01.863688+00 | 2021-04-08 13:04:22.635051+00 | 1 | 365333710 | 365565363 | 200000 | 500 | 1 | 1 | {}
3598 | 2021-04-09 15:11:02.064962+00 | 2021-04-09 15:11:02.308102+00 | 2021-04-09 15:11:02.307566+00 | | 1 | 521395995 | 521609060 | 200000 | 500 | 1 | 1 | {}
-- Current Active job
4286 | 2021-04-10 14:25:02.058946+00 | 2021-04-10 14:25:02.082803+00 | 2021-04-10 14:25:02.082081+00 | | 1 | 669388657 | 669705038 | 300000 | 500 | 1 | 1 | {}
We can see two problems above:
- 10 jobs were never updated - they all have still a running status (
status == 1
) - 2 out of those 10 jobs do not have a
finished_at
set, which is explicitly set in the ensure clause ofBatchedMigrationWrapper
- All of them have no metrics set, which means that they should have failed while the job was performing the batch.
One plausible explanation I can think of, is that finish_tracking_execution failed to save the record.
But that (or any other similar error) should have resulted in an error recorded in the slow logs. Shouldn't those errors go all the way up to the cronjob?
Adding screenshot for reference after 7 days pass and old logs are not available to us (Failed jobs):