Improve Grafana dashboard for kas
https://dashboards.gitlab.net/d/kas-main/kas-overview needs to be improved to show:
Communication metrics
Each group should probably be a panel, showing what RPCs are being called and what status codes are being returned/received.
-
kas->Gitaly -
kas->kas -
agentk->kas -
GitLab->kas -
kas->GitLab -
kas->redis
CI tunnel
-
Kubernetes API proxying req/sec grouped by response code (separate panel for non-200 responses). https://dashboards.gitlab.net/d/kas-ci-tunnel/kas-ci-tunnel?orgId=1 -
k8s_api_proxy_routing_duration_seconds
- heatmap for kas->kas routing duration histogram (sum(rate(k8s_api_proxy_routing_duration_seconds_bucket{app="kas",env="gprd"}[$__rate_interval])) by (le)
)
SLA/SLI for GitOps polling
See #90 (closed). We need to draw the threshold line and the graph/heatmap for this SLI.
Rate limiting
We need to draw the the threshold line and the graph for what is being limited:
-
kas->Gitaly -
agentk->kas -
kas->GitLab
Various
-
number of kas pods and their versions - https://dashboards.gitlab.net/d/kas-pod/kas-pod-info?orgId=1 -
goroutines -
GC metrics?
Useful links
- https://gitlab.com/gitlab-com/runbooks/-/blob/master/dashboards/kas/main.dashboard.jsonnet
- https://github.com/grafana/grafonnet-lib/
- https://grafana.github.io/grafonnet-lib/api-docs/
- https://grafana.github.io/grafonnet-lib/getting-started/
- https://grafana.com/docs/grafana/latest/visualizations/heatmap/
- https://grafana.com/docs/grafana/latest/basics/intro-histograms/
- https://towardsdatascience.com/prometheus-histograms-with-grafana-heatmaps-d556c28612c7
Edited by Mikhail Mazurskiy