MergeRequestMergeabilityCheckWorker production readiness improvements

Recently on GitLab.com, several production alerts have fired for Gitaly nodes, for the UserMergeToRef GRPC call violating its SLO.

Examples:

These apdex violations appear to be related to long-running MergeRequestMergeabilityCheckWorker calls, but this can be difficult to diagnose for several reasons:

✅ The MergeRequestMergeabilityCheckWorker will succeed (and log the error) even when the MergeabilityCheckService service check fails.
1. Is there a reason this sidekiq job doesn't fail instead. Setting retries to 0 and failing would make this easier to understand in production.
2. Dedicated issue: #337449 (closed)
✅ The MergeRequestMergeabilityCheckWorker uses unstructured logging which is almost impossible to investigate at the volumes we see on GitLab.com.
1. Switching to structured logging, or passing the error back via Sidekiq would be preferable.
2. Dedicated issue: #337351 (closed)
The MergeRequestMergeabilityCheckWorker does not appear to cache consistent failures from the UserMergeToRef call.
1. These calls can be very expensive, running for 30 seconds or more before failing.
2. Each time a user visits the merge request page, rails will issue another MergeRequestMergeabilityCheckWorker job with the same outcome.
3. Would it be possible to cache the fact that merge check is not succeeding and skip it if it is executed for the source and target SHAs?
4. Dedicated issue: #337450 (closed)
Would it be possible to improve the performance of the backend GRPC call to UserMergeToRef for huge merge requests?
1. These can take up to a minute to execute. I imagine that the Mergeability Check is a boolean response: would it be possible to shortcut the full merge and just respond with the result?
2. Investigation issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13855

Edited Aug 05, 2021 by Matt Nohr