Fix BG migration causes many unprocessed jobs due to the retry strategy with concurrency limit (!68763) · Merge requests · GitLab.org / GitLab

Shinya Maeda requested to merge fix-bg-migration-job-retaining-leason-too-much into master Aug 23, 2021

What does this MR do?

It seems the BackgroundMigrationWorker has a bug that its retry mechanism is not working as expected. The retried jobs are competing against the scheduled jobs and keep losing in the exclusive locking war. As the consequence, the job will be given up, regardless of the DB health.

To illustrate this bug:

If everything goes normal:

00:00 - Job-1 is executed
00:02 - Job-2 is executed
00:04 - Job-3 is executed
00:06 - Job-4 is executed
00:08 - Job-5 is executed

If DB was not healthy one time:

00:00 - Job-1 is rescheduled to run after 2 min because DB was unhealthy momentarily 
00:02 - Job-2 is executed and Job-1 is retried. Job-2 won in the exclusive locking and Job-1 is rescheduled to run after 2 min.
00:04 - Job-3 is executed and Job-1 is retried. Job-3 won in the exclusive locking and Job-1 is rescheduled to run after 2 min.
00:06 - Job-4 is executed and Job-1 is retried. Job-4 won in the exclusive locking and Job-1 is rescheduled to run after 2 min.
00:08 - Job-5 is executed and Job-1 is retried. Job-5 won in the exclusive locking and Job-1 is given up.

Proposal

In order to rescue the temporary retries, we have to allow concurrent executions up to 2 jobs. We switch to ApplicationRateLimiter for this purpose.

Background

I noticed this potential bug while checking the pending BG migration job in a cleanup MR. The BG migration for DeleteOrphanedDeployments has already finished yesterday:

https://log.gprd.gitlab.net/goto/c909cb473cf3b1ed9dde40e14fb20e0d

But checking the BG migration job records, there are 1122 pending jobs. These batches were not processed properly by some reason.

[ gprd ] production> Gitlab::Database::BackgroundMigrationJob.where(class_name: 'DeleteOrphanedDeployments').pending.count
=> 1122

At first, I thought that it's hitting an exception or database statement timeout. But it seems these jobs have been executed without errors.

By looking at the lease_attempts argument for the BackgroundMigrationWorker#perform, we can see that some jobs are retried many times and given up after 4 retry attempts.

https://log.gprd.gitlab.net/goto/f453ff8eebff054f081a4b8363a4ece1

(FYI, this problem actually appears on the other BG migrations as well. See https://log.gprd.gitlab.net/goto/1da3b68f04b78548d09f74749d803eb6)

This is weird, because these migration jobs do not take more than 15 seconds to complete the whole process. If it takes more than 2 minutes, there is a possibility to get the queues snowballed, but this can't be the case.

https://log.gprd.gitlab.net/goto/9ca4742c796428843901021dcc17c521

Screenshots or Screencasts (strongly suggested)

How to setup and validate locally (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

I have included changelog trailers, or none are needed. (Does this MR need a changelog?)
I have added/updated documentation, or it's not needed. (Is documentation required?)
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
I have self-reviewed this MR per code review guidelines.
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have followed the style guides.
This change is backwards compatible across updates, or this does not apply.

Availability and Testing

I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)
I have tested this MR in all supported browsers, or it's not needed.
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

Edited Sep 07, 2021 by Shinya Maeda

Fix BG migration causes many unprocessed jobs due to the retry strategy with concurrency limit