Support additional Prometheus metrics

Resources

PM @joshlambert | FE @kushalpandya | UX @pedroms

Description

As part of GitLab 9.0, we are shipping support for two metrics as our MVP. CPU and Memory utilization, pulled from Kubernetes. While these two metrics are critical pieces of information to have, there are a large number of other metrics available which customers will want to keep an eye on. In order to support a broader set of metrics, like request and error rates, we need to expand our support for additional metrics.

There are two main categories of metrics:

Common metrics from well known Exporters, as defined on the Prometheus Exporter page.
Customer specific metrics which a customer may have added themselves to our app, much like we have done with gitlab-monitor.

For this issue, we will focus on the common metrics that are included in the list of well-known Prometheus Exporters.

Common Metrics

In most cases, customers will be using a metric from a well known Exporter. These are by far the most common, and likely to be used. In order to make this as easy as possible, we should offer a "Metric Library" which contains a preset list of queries for well defined metric names. When you first configure the Prometheus server, we can then attempt to perform auto detection of the metrics which are being monitored, based on this library.

Library Format

The library of metrics should be specified in a YAML file. The YAML file should be a list of Services (like "HA Proxy", Apache", etc.) which each have a collection of metrics.

Service Name: Name of the service, for example "HA Proxy" or "Apache"
Priority: Relative priority to show on the page. (Higher has more priority and should show first)
Array of Metrics

Each metric then has its own properties:

Metric: Base metric name, for auto detection purposes
Metric Name: English language explanation of metric. For example, "CPU Utilization" or "Error Rate".
Array of Queries
- Query: Prometheus query to be used, with variables included.
- Query Name: Name of the query, for example "Average CPU Utilization".
- Query Units: Unit type returned by the query. For example "MB" or "Requests/sec".
Weight: Float integer from 0-1 which indicates relative importance metric. Used when attempting to decide which metric to show where we have to choose a small set, like the Merge Request flow.

For now, we will limit the number of queries per metric to one, but we should plan for more within the same format for future use.

Variables

In some cases, the Prometheus server may be picking up more than a single environment. A good example of this is the Kubernetes exporter, which will report data on the entire cluster. In cases like these, we need to provide for a way to distinguish one environment from another. To do this, we need to support substitution of the CI flag CI_ENVIRONMENT_SLUG. This is the identifier used across GitLab to identify an environment, and we should continue to use that here.

Autodetection Process

We will run auto detection on every cache fill. (Currently 30s)

Detection method:

Retrieve the list of scraped metric names from the Prometheus server.
For each metric on the Prometheus server that matches one in the Library
Perform a simple query on that metric, to return all entries of that metric. Use the maximum supported time scale (currently 8h) to filter out old entries.
Search all entries for ones that have a label matching CI_ENVIRONMENT_SLUG
Save to the cache any matches.

Once a metric has been successfully detected, it should be added to the monitoring list for this environment and scraped.

In the event a cached metric is not returning data, we should attempt another 9 times (for a total of 10, which is 5 minutes) before purging it from the cache.

API

As part of this, we should ensure we continue to allow full configuration via the API as well. So that would include being able to add new queries, etc.

Designs

🔍 View design specs (for spacing, sizes, colors and text copying) — Hide notes in the top-right corner

Empty	Loading	Filled (default)	Filled (missing panel expanded)

For the empty state, use this illustration SVG
The “More information” links to the documentation section about metrics auto-detection logic
The “Missing environment variable” panel is collapsed by default and only shown if there are exporters with missing environment variable

Documentation blurb

As part of GitLab 9.0 we launched application performance management integrated with CI/CD deployments, monitoring deployed applications on Kubernetes by tracking CPU and Memory utilization. This was a great first step, and with GitLab 9.3 we are excited to launch significantly expanded support for other metrics and deployment types.

Now with 9.3, GitLab will automatically detect common system services like web servers and databases, tracking key metrics like throughput and load. With support for such a wider set of metrics, performance monitoring is now available for all deployments.

Edited Jul 03, 2017 by silv