Andrew Newdigate requested to merge dashboard-experiment into master Mar 20, 2019

This change allows us to keep some of our dashboards in the runbooks project.

Example of a dashboard generated with this change: https://dashboards.gitlab.net/d/XIXT9Tqik/test-general-service-metrics?orgId=1

On master builds, the dashboards will be uploaded to https://dashboards.gitlab.com. Any local changes to these dashboards on the Grafana instance will be overwritten.

The dashboards are kept in grafonnet format, which is based on the jsonnet template language.

Local Development

Install jsonnet, jq and curl
- On a Mac, jsonnet can be installed with brew install jsonnet
- On Linux, you'll need to build the binary yourself, or use the docker image docker run --rm registry.gitlab.com/gitlab-com/runbooks/jsonnet:latest

Editting Files

Dashboards should be kept in files with the following name: /dashboards/[grafana_folder_name]/[name].dashboard.jsonnet
- grafana_folder_name refers to the grafana folder where the files will be uploaded to. Note that the folder must already be created.
Obtain a API key to the Grafana instance and export it in GRAFANA_API_TOKEN:
- export GRAFANA_API_TOKEN=123
To upload the files, run ./dashboards/upload.sh

The `jsonnet` docker image

Google does not maintain official docker images for jsonnet.
For this reason, we have a manual build step to build the registry.gitlab.com/gitlab-com/runbooks/jsonnet:latest image.
To update the image, run this job in the CI build manually

Features

As a first dashboard-as-code, I've rewritten the general service metric dashboards. I've also added some new features to them, documented below.

Triage Dashboard

The triage page shows:

Services with current alerts, colour-coded and ordered by severity, click through links to service dashboards

Indicators for each metric to indicator whether the metric is in violation of it's SLO at that point in the chart

Service Dashboard

Current alerts for the service, colour-coded and ordered by severity, click through links to alert manager

Apdex Scores for the Service, with current values, value from same-time-last-week, expected-normal boundaries and an latency SLO indicator for the service

White dotted line is same-time-last-week value
Red dotted line is the SLO threshold indicator value

Error ratios (% requests in error) with same-time-last-week value, expected-normal boundaries and an error-rate SLO indicator for the service

White dotted line is same-time-last-week value
Red dotted line is the SLO threshold indicator value

Service Availability

Shows the percentage of requests that are responding as healthy
White dotted line is same-time-last-week

QPS Operation Rates for the service, with predicted normal boundaries

Try it yourself

Triage page: https://dashboards.gitlab.net/d/ZUei7TkWz/platform-metrics
Service page https://dashboards.gitlab.net/d/26q8nTzZz/service-platform-metrics

Edited Apr 26, 2019 by Andrew Newdigate

Grafana-dashboards-as-code