MergeRequestMergeabilityCheckWorker production readiness improvements
Recently on GitLab.com, several production alerts have fired for Gitaly nodes, for the UserMergeToRef
GRPC call violating its SLO.
Examples:
- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5228#note_636178252
- gitlab-com/gl-infra/production#5225 (closed)
- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5193
- gitlab-com/gl-infra/production#5225 (closed)
- gitlab-com/gl-infra/production#5079 (closed)
These apdex violations appear to be related to long-running MergeRequestMergeabilityCheckWorker
calls, but this can be difficult to diagnose for several reasons:
-
✅ TheMergeRequestMergeabilityCheckWorker
will succeed (and log the error) even when theMergeabilityCheckService
service check fails.- Is there a reason this sidekiq job doesn't fail instead. Setting retries to 0 and failing would make this easier to understand in production.
- Dedicated issue: #337449 (closed)
-
✅ TheMergeRequestMergeabilityCheckWorker
uses unstructured logging which is almost impossible to investigate at the volumes we see on GitLab.com.- Switching to structured logging, or passing the error back via Sidekiq would be preferable.
- Dedicated issue: #337351 (closed)
- The
MergeRequestMergeabilityCheckWorker
does not appear to cache consistent failures from theUserMergeToRef
call.- These calls can be very expensive, running for 30 seconds or more before failing.
- Each time a user visits the merge request page, rails will issue another
MergeRequestMergeabilityCheckWorker
job with the same outcome. - Would it be possible to cache the fact that merge check is not succeeding and skip it if it is executed for the source and target SHAs?
- Dedicated issue: #337450 (closed)
- Would it be possible to improve the performance of the backend GRPC call to
UserMergeToRef
for huge merge requests?- These can take up to a minute to execute. I imagine that the Mergeability Check is a boolean response: would it be possible to shortcut the full merge and just respond with the result?
- Investigation issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13855
Edited by Matt Nohr