Skip to content

Fix BG migration causes many unprocessed jobs due to the retry strategy with concurrency limit

What does this MR do?

It seems the BackgroundMigrationWorker has a bug that its retry mechanism is not working as expected. The retried jobs are competing against the scheduled jobs and keep losing in the exclusive locking war. As the consequence, the job will be given up, regardless of the DB health.

To illustrate this bug:

If everything goes normal:

00:00 - Job-1 is executed
00:02 - Job-2 is executed
00:04 - Job-3 is executed
00:06 - Job-4 is executed
00:08 - Job-5 is executed

If DB was not healthy one time:

00:00 - Job-1 is rescheduled to run after 2 min because DB was unhealthy momentarily 
00:02 - Job-2 is executed and Job-1 is retried. Job-2 won in the exclusive locking and Job-1 is rescheduled to run after 2 min.
00:04 - Job-3 is executed and Job-1 is retried. Job-3 won in the exclusive locking and Job-1 is rescheduled to run after 2 min.
00:06 - Job-4 is executed and Job-1 is retried. Job-4 won in the exclusive locking and Job-1 is rescheduled to run after 2 min.
00:08 - Job-5 is executed and Job-1 is retried. Job-5 won in the exclusive locking and Job-1 is given up.

Proposal

In order to rescue the temporary retries, we have to allow concurrent executions up to 2 jobs. We switch to ApplicationRateLimiter for this purpose.

Background

I noticed this potential bug while checking the pending BG migration job in a cleanup MR. The BG migration for DeleteOrphanedDeployments has already finished yesterday:

2021-08-23_19-50

https://log.gprd.gitlab.net/goto/c909cb473cf3b1ed9dde40e14fb20e0d

But checking the BG migration job records, there are 1122 pending jobs. These batches were not processed properly by some reason.

[ gprd ] production> Gitlab::Database::BackgroundMigrationJob.where(class_name: 'DeleteOrphanedDeployments').pending.count
=> 1122

At first, I thought that it's hitting an exception or database statement timeout. But it seems these jobs have been executed without errors.

By looking at the lease_attempts argument for the BackgroundMigrationWorker#perform, we can see that some jobs are retried many times and given up after 4 retry attempts.

2021-08-23_19-57

https://log.gprd.gitlab.net/goto/f453ff8eebff054f081a4b8363a4ece1

(FYI, this problem actually appears on the other BG migrations as well. See https://log.gprd.gitlab.net/goto/1da3b68f04b78548d09f74749d803eb6)

This is weird, because these migration jobs do not take more than 15 seconds to complete the whole process. If it takes more than 2 minutes, there is a possibility to get the queues snowballed, but this can't be the case.

2021-08-23_20-01

https://log.gprd.gitlab.net/goto/9ca4742c796428843901021dcc17c521

Screenshots or Screencasts (strongly suggested)

How to setup and validate locally (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Shinya Maeda

Merge request reports