Skip to content


Andrew Newdigate requested to merge dashboard-experiment into master

This change allows us to keep some of our dashboards in the runbooks project.

Example of a dashboard generated with this change:

On master builds, the dashboards will be uploaded to Any local changes to these dashboards on the Grafana instance will be overwritten.

The dashboards are kept in grafonnet format, which is based on the jsonnet template language.

Local Development

  • Install jsonnet, jq and curl
    • On a Mac, jsonnet can be installed with brew install jsonnet
    • On Linux, you'll need to build the binary yourself, or use the docker image docker run --rm

Editting Files

  • Dashboards should be kept in files with the following name: /dashboards/[grafana_folder_name]/[name].dashboard.jsonnet
    • grafana_folder_name refers to the grafana folder where the files will be uploaded to. Note that the folder must already be created.
  • Obtain a API key to the Grafana instance and export it in GRAFANA_API_TOKEN:
    • export GRAFANA_API_TOKEN=123
  • To upload the files, run ./dashboards/

The jsonnet docker image

  • Google does not maintain official docker images for jsonnet.
  • For this reason, we have a manual build step to build the image.
  • To update the image, run this job in the CI build manually


As a first dashboard-as-code, I've rewritten the general service metric dashboards. I've also added some new features to them, documented below.

Triage Dashboard


The triage page shows:

Services with current alerts, colour-coded and ordered by severity, click through links to service dashboards


Indicators for each metric to indicator whether the metric is in violation of it's SLO at that point in the chart


Service Dashboard

Current alerts for the service, colour-coded and ordered by severity, click through links to alert manager


Apdex Scores for the Service, with current values, value from same-time-last-week, expected-normal boundaries and an latency SLO indicator for the service

  • White dotted line is same-time-last-week value
  • Red dotted line is the SLO threshold indicator value


Error ratios (% requests in error) with same-time-last-week value, expected-normal boundaries and an error-rate SLO indicator for the service

  • White dotted line is same-time-last-week value
  • Red dotted line is the SLO threshold indicator value


Service Availability

  • Shows the percentage of requests that are responding as healthy
  • White dotted line is same-time-last-week


QPS Operation Rates for the service, with predicted normal boundaries


Try it yourself

Edited by Andrew Newdigate

Merge request reports