Batched Background Migration Job records not being updated

We had 5 failed runs for the BatchedBackgroundMigrationWorker (out of 1080).

All of the failed runs were caused by statement timeouts on the select query that finds the upper bound (stop) of the batch in each_batch.

Those errors are very sparse and not much of a concern as they happen while the BatchedMigrationRunner tries to find the next batch range and before it creates a BatchedJob, so the next execution is not affected.

Issue

But while I was checking this, I run a query over the batched_background_migration_jobs:

gitlabhq_production=> SELECT status, COUNT(*) FROM batched_background_migration_jobs GROUP BY status;
 status | count
--------+-------
      1 |    11
      3 |  4275

gitlabhq_production=> SELECT * FROM batched_background_migration_jobs WHERE status = 1;
  id  |          created_at           |          updated_at           |          started_at           |          finished_at          | batched_background_migration_id | min_value | max_value | batch_size | sub_batch_size | status | attempts | metrics
------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+---------------------------------+-----------+-----------+------------+----------------+--------+----------+---------

-- 10 NOT updated jobs - stuck at running

  685 | 2021-04-05 09:50:02.00755+00  | 2021-04-05 09:50:35.329187+00 | 2021-04-05 09:50:02.014176+00 | 2021-04-05 09:50:35.328527+00 |                               1 | 116151373 | 116208906 |      50000 |            200 |      1 |        1 | {}
 2223 | 2021-04-07 13:37:04.390141+00 | 2021-04-07 13:37:26.958784+00 | 2021-04-07 13:37:04.398889+00 | 2021-04-07 13:37:26.958107+00 |                               1 | 220408701 | 220630870 |     200000 |            500 |      1 |        1 | {}
 2300 | 2021-04-07 16:57:01.971842+00 | 2021-04-07 16:57:26.540344+00 | 2021-04-07 16:57:01.981515+00 | 2021-04-07 16:57:26.539665+00 |                               1 | 237600007 | 237824454 |     200000 |            500 |      1 |        1 | {}
 2394 | 2021-04-07 20:14:02.044816+00 | 2021-04-07 20:14:34.320387+00 | 2021-04-07 20:14:02.055119+00 | 2021-04-07 20:14:34.319405+00 |                               1 | 258675277 | 258899689 |     200000 |            500 |      1 |        1 | {}
 2561 | 2021-04-08 01:52:02.332268+00 | 2021-04-08 01:53:09.779131+00 | 2021-04-08 01:52:02.337932+00 | 2021-04-08 01:53:09.778284+00 |                               1 | 295972238 | 296189933 |     200000 |            500 |      1 |        1 | {}
 2772 | 2021-04-08 09:00:01.608604+00 | 2021-04-08 09:00:01.620031+00 | 2021-04-08 09:00:01.619302+00 |                               |                               1 | 342298978 | 342524610 |     200000 |            500 |      1 |        1 | {}
 2819 | 2021-04-08 10:45:16.084115+00 | 2021-04-08 10:46:48.956071+00 | 2021-04-08 10:45:16.090629+00 | 2021-04-08 10:46:48.955465+00 |                               1 | 352591053 | 352812908 |     200000 |            500 |      1 |        1 | {}
 2823 | 2021-04-08 10:54:01.841729+00 | 2021-04-08 10:55:22.627538+00 | 2021-04-08 10:54:01.854955+00 | 2021-04-08 10:55:22.626391+00 |                               1 | 353464105 | 353689234 |     200000 |            500 |      1 |        1 | {}
 2877 | 2021-04-08 13:03:01.854084+00 | 2021-04-08 13:04:22.636241+00 | 2021-04-08 13:03:01.863688+00 | 2021-04-08 13:04:22.635051+00 |                               1 | 365333710 | 365565363 |     200000 |            500 |      1 |        1 | {}
 3598 | 2021-04-09 15:11:02.064962+00 | 2021-04-09 15:11:02.308102+00 | 2021-04-09 15:11:02.307566+00 |                               |                               1 | 521395995 | 521609060 |     200000 |            500 |      1 |        1 | {}

-- Current Active job

 4286 | 2021-04-10 14:25:02.058946+00 | 2021-04-10 14:25:02.082803+00 | 2021-04-10 14:25:02.082081+00 |                               |                               1 | 669388657 | 669705038 |     300000 |            500 |      1 |        1 | {}

We can see two problems above:

10 jobs were never updated - they all have still a running status (status == 1)
2 out of those 10 jobs do not have a finished_at set, which is explicitly set in the ensure clause of BatchedMigrationWrapper
All of them have no metrics set, which means that they should have failed while the job was performing the batch.

One plausible explanation I can think of, is that finish_tracking_execution failed to save the record.

But that (or any other similar error) should have resulted in an error recorded in the slow logs. Shouldn't those errors go all the way up to the cronjob?

Adding screenshot for reference after 7 days pass and old logs are not available to us (Failed jobs):

Edited Apr 10, 2021 by Yannis Roussos