Enqueue NewMergeRequestWorker for broken MRs after Redis Sidekiq outage

Summary

Merge requests created during a period of Redis Sidekiq downtime do not function correctly, indefinitely. MRs in this state have the following properties:

  • Display a warning flash message.
  • Diff is not viewable.
  • Pipeline status is pending.
  • Cannot be merged.

This is because NewMergeRequestWorker was never enqueued for these MRs, resulting in (at least) the associated merge_request_diff not being created.

As a workaround, an affected MR can be closed, a new commit pushed to the branch and a new MR opened but this is not immediately obvious to the user.

The user impact is not easily observed in error charts or budgets as it produces 404 rather than 5** status codes.

Impact

  • The issue affects all MRs created during a Sidekiq outage. Since there is no way to automatically recover, these remain in a broken state.
  • Diffs, pipeline state and ability to merge can all be affected.
  • A production incident, in which Redis was unavailable for 30 mins, took several hours to fully recover from.

The issue manifests as 404s on the merge request diff_metadata and diffs_batch endpoints. The incidence declined very slowly over time, however this is most likely due to user action, implementing the workaround. It is visible in this screenshot:

image

(source)

Recommendation

During the related incident, the affected merge requests were identified by looking for those without merge_request_diffs, then the NewMergeRequestWorker job was enqueued for each. This resolved the issue.

Automatically re-enqueuing this job would allow the system to recover without further intervention in the event of a Sidekiq outage. This could be done by:

  1. Record the job ID on the MR.
  2. Upon 404 on the diff_metadata or diffs_batch endpoints, check that the job exists and has not completed.
  3. Enqueue the job again if required.

Implementing a generic mechanism similar to Sidekiq Pro's Pro Reliability Client would help to solve this for other use cases also.

Verification

Successful implementation would be observable on this chart during Sidekiq downtime. A separate index could be used for a non-production environment, where downtime could be simulated.

Alternatively, MRs that are stuck in this state are easily identifiable by the error messages, lack of pipeline status and non-ability to view code or merge. Deleting the associated diff and checking the MR page should see the MR recover.