SLA for GitLab KAS service
Problem
We should resolve this prior to going into full launch. Our on-call will need to know what should be considered normal, and what may be poor behavior. Without SLA's this service can remain down until a customer would notify us. Even if it's just the bare minimum during launch.
Proposal
Iteration 1
Error rates on GetConfiguration()
gRPC method - this is the main gRPC call initiated by the agent to KAS, which should be under 1%. Note: this metric is already available, so we just need to setup alerts for GitLab.com
Iteration 2
Implement an SLA for KAS as "time of commit push" - "time when kas sent info about this commit to agentk" with the (intolerable) threshold set at 1 hour. Note: we need to implement this metric in KAS
Start latency measure for when commit was pushed to GitLab (alternatively, when the commit is detected by KAS). Ideally, we should not make another gRPC call just to get this data.Finish latency measure for when the commit content is sent to the agent
** We may not get early warning if the latency for this is infinite.
** If we miss a notification, it's probably impossible to publish any latency
Update:
Due to technical challenges, we decided to measure this SLA in another fashion:
We'll measure GitOps Gitaly poll intervals, i.e. the time KAS takes in between calls to Gitaly to check if there are any new commit updates.
This poll interval is configured on KAS startup. At the moment of this write-up, it's 20 seconds. So we expect that, in regular conditions poll interval won't vary much from 20 seconds. If there's a capacity problem, this interval might be affected, in which case we would track this longer intervals.
We'll track poll intervals in Prometheus Histogram buckets of [20, 40, 60, 80, 100] seconds. These buckets will allow us to setup an approximated Apdex score based alert.
We should create a separate issue to track the work to set up this alert:
(
sum(rate(gitops_poll_interval_bucket{le="20"}[5m])) by (job)
+
sum(rate(gitops_poll_interval_bucket{le="80"}[5m])) by (job)
) / 2 / sum(rate(gitops_poll_interval_bucket[5m])) by (job)