SLIs and/or alerting in Sharding slack channel for loose foreign key processing
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
The following discussion from gitlab-com/runbooks!4542 (merged) should be addressed:
-
@reprazent started a discussion: (+2 comments) Not for now, but it would be cool if we expressed these as a separate SLI per database-shard:
An error SLI using a counter for the total number of objects to be processed, a counter for the number of errors. An Apdex SLI using a success-counter and a total counter.
This way, we'd get alerts graphs on the relevant service dashboards for the database shard (in the first case, patroni-ci).
We could use an Application SLI to define the metrics. WDYT?
But I saw an operation rate, and Sidekiq-jobs. So I thought that this kind of functionality is probably important enough to express as SLIs:
- How fast are we deleting or updating cascading rows (apdex)
- How many failures do we have while updating/deleting rows (errors)
the current architecture should already be tracked by normal sidekiq error tracking and we don't catch any errors in the worker then we may be fine with what we have already.
The disadvantage here is that this worker runs on catchall. Mainly covered by the
shard_catchallSLI. This is our biggest Sidekiq shard, which at peak often processes more than 1k of jobs per second. If this job runs once per minute, it's not likely to make a dent in the SLI itself.We do have per-worker monitoring, and the option for directing alerts to group, for example generating alerts like this. The problem with these alerts is that even though the failures per worker can be frequent, they often recover with Sidekiq retries, and SREs don't have time to look into every single one.
Perhaps a first step could be creating an alerts channel for Sharding and routing the alerts for these workers to that channel: https://gitlab.com/gitlab-com/runbooks/blob/5280eff7b915745cee9e4dd2847a31141876f406/docs/uncategorized/alert-routing.md#L5
The change to enable this would be roughly like this: