Skip to content

Grafana-dashboards-as-code

Andrew Newdigate requested to merge dashboard-experiment into master

This change allows us to keep some of our dashboards in the runbooks project.

Example of a dashboard generated with this change: https://dashboards.gitlab.net/d/XIXT9Tqik/test-general-service-metrics?orgId=1

On master builds, the dashboards will be uploaded to https://dashboards.gitlab.com. Any local changes to these dashboards on the Grafana instance will be overwritten.

The dashboards are kept in grafonnet format, which is based on the jsonnet template language.

Local Development

  • Install jsonnet, jq and curl
    • On a Mac, jsonnet can be installed with brew install jsonnet
    • On Linux, you'll need to build the binary yourself, or use the docker image docker run --rm registry.gitlab.com/gitlab-com/runbooks/jsonnet:latest

Editting Files

  • Dashboards should be kept in files with the following name: /dashboards/[grafana_folder_name]/[name].dashboard.jsonnet
    • grafana_folder_name refers to the grafana folder where the files will be uploaded to. Note that the folder must already be created.
  • Obtain a API key to the Grafana instance and export it in GRAFANA_API_TOKEN:
    • export GRAFANA_API_TOKEN=123
  • To upload the files, run ./dashboards/upload.sh

The jsonnet docker image

  • Google does not maintain official docker images for jsonnet.
  • For this reason, we have a manual build step to build the registry.gitlab.com/gitlab-com/runbooks/jsonnet:latest image.
  • To update the image, run this job in the CI build manually

Features

As a first dashboard-as-code, I've rewritten the general service metric dashboards. I've also added some new features to them, documented below.

Triage Dashboard

image

The triage page shows:

Services with current alerts, colour-coded and ordered by severity, click through links to service dashboards

image

Indicators for each metric to indicator whether the metric is in violation of it's SLO at that point in the chart

image

Service Dashboard

Current alerts for the service, colour-coded and ordered by severity, click through links to alert manager

image

Apdex Scores for the Service, with current values, value from same-time-last-week, expected-normal boundaries and an latency SLO indicator for the service

  • White dotted line is same-time-last-week value
  • Red dotted line is the SLO threshold indicator value

image

Error ratios (% requests in error) with same-time-last-week value, expected-normal boundaries and an error-rate SLO indicator for the service

  • White dotted line is same-time-last-week value
  • Red dotted line is the SLO threshold indicator value

image

Service Availability

  • Shows the percentage of requests that are responding as healthy
  • White dotted line is same-time-last-week

image

QPS Operation Rates for the service, with predicted normal boundaries

image

Try it yourself

Edited by Andrew Newdigate

Merge request reports