DataIntegrityFailure error messages are obfuscating real exceptions
The meta-problem
This rescue-drop-re-raise structure is something of an issue when error happen during execution of the service, because the drop! call will itself raise an error about an invalid status transition and prevent the service from ever re-raising the initial error into the logs.
class CreateDownstreamPipelineService < ::BaseService
def execute(bridge)
...
rescue StandardError => e
@bridge.reset.drop!(:data_integrity_failure)
raise e
end
During an incident gitlab-com/gl-infra/production#17885 (closed), 99.9% of the the errors being raised were invalid status transitions. This is somewhat unhelpful for investigating the true cause of the failure, which is leaving the job in a state that is not able to transition via drop! anyway, but not reporting the original problem.
Which is really caused by:
Implementation Guide
See this comment.
Concerns
1. Will this leave jobs in running states sometimes? No. If the job is running, the job will fully retry as it does today. We'll only stop the retry cycle if we can confirm that the job is in a failed state.
2. Will this prevent errors from surfacing? No. This change only affects cases where an exception has already been raised and has caused Sidekiq to retry the job. By exiting gracefully if the job can no longer be stopped, we're allowing the first error to make it into the logs but stopping before we start raising redundant unhelpful ones.
3. Will this actually improve the error rate? Also no. But it will be easier for devopsverify to figure out what's going on when downstream pipeline creation fails.
Examples
- CI pipeline with parent-child setup throws data... (#424158 - closed)
- Add more if you find them!

