Merge trains can get stuck on unexpected errors
Summary
When unexpected errors occur in MergeTrains::RefreshMergeRequestService, the merge train stalls in a sidekiq job with no feedback for users. See our sentry (internal link) for some examples of errors that are happening there.
See also this zendesk ticket (internal link)
Steps to reproduce
Example Project
What is the current bug behavior?
What is the expected correct behavior?
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)(we will only investigate if the tests are passing)
Possible fixes
- Handle merge trains against the known exceptions in our sentry (internal link). These are all examples of specific code that needs to be more robust.
- Make merge trains more robust to unexpected errors: Rescue
StandardErrorinMergeTrains::RefreshMergeRequestService. If the parent refresh worker is on its last retry, capture and log the exception (internally); otherwise, re-raise the error (so so we get re-tries). Show a system note with a generic "Internal error" along with the correlation ID, e.g. "Merge request removed from the train due to an internal error (correlation ID ABCD)".- This needs to be done carefully. For example, what should happen if there is a statement timeout when marking a train car as merged?
Edited by 🤖 GitLab Bot 🤖