Grafana-dashboards-as-code
This change allows us to keep some of our dashboards in the runbooks project.
Example of a dashboard generated with this change: https://dashboards.gitlab.net/d/XIXT9Tqik/test-general-service-metrics?orgId=1
On master
builds, the dashboards will be uploaded to https://dashboards.gitlab.com. Any local changes to these dashboards on
the Grafana instance will be overwritten.
The dashboards are kept in grafonnet
format, which is based on the jsonnet template language.
Local Development
- Install
jsonnet
,jq
andcurl
- On a Mac,
jsonnet
can be installed withbrew install jsonnet
- On Linux, you'll need to build the binary yourself, or use the docker image
docker run --rm registry.gitlab.com/gitlab-com/runbooks/jsonnet:latest
- On a Mac,
Editting Files
- Dashboards should be kept in files with the following name:
/dashboards/[grafana_folder_name]/[name].dashboard.jsonnet
-
grafana_folder_name
refers to the grafana folder where the files will be uploaded to. Note that the folder must already be created.
-
- Obtain a API key to the Grafana instance and export it in
GRAFANA_API_TOKEN
:export GRAFANA_API_TOKEN=123
- To upload the files, run
./dashboards/upload.sh
jsonnet
docker image
The - Google does not maintain official docker images for jsonnet.
- For this reason, we have a manual build step to build the
registry.gitlab.com/gitlab-com/runbooks/jsonnet:latest
image. - To update the image, run this job in the CI build manually
Features
As a first dashboard-as-code, I've rewritten the general service metric dashboards. I've also added some new features to them, documented below.
Triage Dashboard
The triage page shows:
Services with current alerts, colour-coded and ordered by severity, click through links to service dashboards
Indicators for each metric to indicator whether the metric is in violation of it's SLO at that point in the chart
Service Dashboard
Current alerts for the service, colour-coded and ordered by severity, click through links to alert manager
Apdex Scores for the Service, with current values, value from same-time-last-week, expected-normal boundaries and an latency SLO indicator for the service
- White dotted line is same-time-last-week value
- Red dotted line is the SLO threshold indicator value
Error ratios (% requests in error) with same-time-last-week value, expected-normal boundaries and an error-rate SLO indicator for the service
- White dotted line is same-time-last-week value
- Red dotted line is the SLO threshold indicator value
Service Availability
- Shows the percentage of requests that are responding as healthy
- White dotted line is same-time-last-week
QPS Operation Rates for the service, with predicted normal boundaries
Try it yourself
Edited by Andrew Newdigate