Adapt SLA dashboard to include user perceived impactful services
Current state
Our SLA dashboard currently includes the following services :
- git
- web
- registry
- api
- sidekiq
- ci-runners
These services affect our weighted availability score, which is a single number describing an average SLO across of each of the services
Problem description
We want to ensure that we show a realistic view of the platform availability. Some services have a direct impact on users perceived availability of the platform.
For example, if web service is experiencing issues, users can notice this by observing slower page load times, or similar. This is also true for git, registry and api.
This is not the case for sidekiq and ci-runners services. These services handle async workloads, so the impact to users exists, but it is not reflective of the system's availability.
For example, if a single sidekiq queue (from 10s of queues we currently operate) is experiencing a delay of just a couple of minutes on single minute SLO for that queue, the SLA for sidekiq service will be affected but users will not observe any availability issues. This is not the case for all sidekiq queues, but vast majority of them have no impact on underlying availability of GitLab.com.
This skews the perception of how available GitLab.com really is, and we have other tools at our disposal to ensure that queues that are important alert us in time.
Proposal
Remove:
- sidekiq
- ci-runners
Consider looking into other services that do have a direct impact, such as gitlab-pages. If pages are unavailable, users will immediately notice that as an availability issue.
/cc @brentnewton @andrewn