As it is now, we are mostly blind to the state of project mirroring. One of the consequences of that is that we are not alerting on project mirroring delay.
One way to improve this situation is to have the sidekiq workers report the relevant metrics since they have all the information necessary to calculate them. This would also keep the instrumentation next to the implementation.
@tlinz I would say backend-weight2 . I believe we should already have the metric instrumentation in place. We need to identify necessary metrics and report them.
@vyaklushin as you so explicitly write "backend-weight" I wonder if other teams also have work in relation to this issue that we need to consider in the planning?
IMO I would estimate this at a base weight 2 - 3. I'm not sure if we need input from other teams since the sidekiq workers all the information necessary. What are your thoughts @vyaklushin ?
@vyaklushin I was thinking that we should define SLIs for the actual expected time a mirror should run (project_mirror_data.next_execution_timestamp) versus when it actually starts in the RepositoryUpdateMirrorWorker. Then we could define what is an acceptable delay (1 minute?).
If we pour that into an application SLI that we add to the Sidekiq service, then we'll have actual alerts for when mirrors start too slow, or when mirrors stop executing. Meaning we'd know before we get reports from users. It would then also count towards the error budget for groupsource code.
While we're on that, we could potentially update the dashboard linked in the description. As far as I can tell it needs
Updated queries in gitlab-exporter based on the outcome of #216783 (closed)
Fix the queries for job operation rates (perhaps these can be replaced with the new SLI we're adding).
I've discovered that we already collecting metrics that calculate the difference between:
scheduled and started statuses (waiting delay) - gitlab_repository_mirror_waiting_duration_seconds_bucket
started and finished statuses (update duration) - gitlab_repository_mirror_update_duration_seconds_bucket
This data should allow to display and alert us when waiting delay / update duration goes over the limit. The current bucket values are suboptimal. I've opened a merge request to adjust them - !93018 (merged)