Report availability per service and overall GitLab availability.
Problem statement
Today, we calculate availability for five key services and report that externally. This proposal is not suggesting changing that in any way.
What I am suggesting is that we calculate an overall availability score for every service that we have, and include an overall GitLab.com availability score on the internal SLA page. This would allow us to start determining which services have the highest and lowest availability, as well as report internally on over-all availability. Reporting on these numbers would allow us to then track and work to improve them, which would lessen the number of customer problems as well as ease the lives of those on-call.
Today's external availability
We presently calculate an SLO observation status for every service that we set as contractually obligated. This is a boolean metric that says whether or not we're meeting the SLO for that service at that moment in time.
This is created here -- https://gitlab.com/gitlab-com/runbooks/-/blob/master/mimir-rules-jsonnet/service-slo-observance.jsonnet?ref_type=heads#L63-76
And makes a per service file like this: https://gitlab.com/gitlab-com/runbooks/-/blob/master/mimir-rules/gitlab-gprd/api/autogenerated-gitlab-gprd-api-service-slo-observance.yml?ref_type=heads
We use those metrics to calculate an overall availability score in https://gitlab.com/gitlab-com/runbooks/-/blob/master/mimir-rules/gitlab-gprd/autogenerated-gitlab-gprd-sla-rules.yml?ref_type=heads (jsonnet is in https://gitlab.com/gitlab-com/runbooks/-/blob/master/mimir-rules-jsonnet/sla-rules.jsonnet?ref_type=heads).
We use that metric to calculate the availability over a certain timeperiod.
Proposal
Let's start calculating availability per service and reporting it on the service overview dashboard.
This would involve modifying https://gitlab.com/gitlab-com/runbooks/-/blob/master/mimir-rules-jsonnet/service-slo-observance.jsonnet?ref_type=heads#L63-76 to run for all services. The level of effort for that is so minimal that I put it in gitlab-com/runbooks!8485 (closed) to do so.
We would also need to determine some basic contractual thresholds for everything, but that could be set as a default to start with based on what we have today.
Once we have that calculated, we use a similar method to https://gitlab.com/gitlab-com/runbooks/-/blob/master/mimir-rules-jsonnet/sla-rules.jsonnet?ref_type=heads to calculate availability per service and then report that on the overview page (update this section -- https://gitlab.com/gitlab-com/runbooks/-/blob/master/libsonnet/gitlab-dashboards/service_dashboard.libsonnet?ref_type=heads#L66)
The new panels at the top would look something like this:
We would also need to create a page that calculated GitLab overall availability, which would be a fork of https://gitlab.com/gitlab-com/runbooks/-/blob/master/mimir-rules-jsonnet/sla-rules.jsonnet?ref_type=heads.
At that point, we would be able to see availability per service and over all availability in addition to externally reported availability.
Options from there
This could be included on error budget pages, be a way to prioritize what systems we want to help stabilize, and even more importantly, be an indicator to help troubleshoot.
The latter is useful enough that it's worth explaining more.
Let's say the EOC gets a page that the apdex for web has dropped. There are a multitude of services that could cause that to happen (web bugs, database issues, sidekiq issues, others). Having a full page visibility on current availability could point the EOC to a specific cause (maybe patroni-ci's availability has dropped?) with less trouble shooting.
Level of effort
For someone familiar with this codebase, I suspect this is less than a week's worth of work.