Add covered experience SLIs for some user operations that can be used for alerting
This comes from the incident in gitlab-com/gl-infra/production#17158 (closed), where Sidekiq jobs were not completing within their set target target duration. But Sidekiq is not included in our Service Level Availability because it also runs a lot of jobs that aren't directly user facing.
During the incident, users could notice the following symptoms of the incident:
- Pushes of updated code take a long time to be reflected in the UI (merge request or otherwise)
- Pipelines and jobs take a long time to start after a push
I think we can come up with more of these processes to capture in SLIs to be included in Error Budgets for Stage Groups and potentially Service Level Availability.
If we add SLIs for these, we could use these to have meaningful alerts about these processes that users care about. They are also easily attributable to feature categories that have an owner, so we know who to ask for help during incidents related to these SLIs.
We could potentially add to our service availability calculation without having to add the entirety of Sidekiq. The SLIs could be attributed to multiple services (GitLab-shell + Git + Sidekiq).
In gitlab-com/gl-infra/production-engineering#25401 (comment 1942771709) we've started defining SLIs with existing metrics for the same purpose. But for now we haven't added any that would span multiple services, and we're adding them all ourselves. In this issue we should discuss how we can push this process left more, so stage groups can build SLIs like this themselves.