Skip to content

Query usage data from bundled Prometheus

Matthias Käppler requested to merge 217666-prometheus-api-client into master

What does this MR do?

See #217666 (closed)

In order to collect memory and topology statistics from on-prem customers, we need the ability to query their Prometheus server for the metrics we produce (e.g. from node_exporter) and transmit them back via usage pings. This will allow us to map out our customer base better in terms of which reference architecture they fall under, how much memory their setups consume etc.

For an MVC, this first MR adds the following:

  • A single metric as a proof-of-concept that is queried via Prometheus and submitted as a Usage Ping.

I decided to report only a single metric for now, node_memory_total_bytes as a proof-of-concept. This metric requires a node_exporter to run.

Questions:

  • Should this be behind a feature toggle?
    • Decided it's probably not useful, since the change will only impact single-node self-managed deployments, over which we have no control anyway

Notes

Data structure

The data structure has been added to the top-level and looks as follows:

{
  topology: {
    nodes: [
      {
        node_memory_total_bytes: 1024
      }
    ]
  }
}

The final structure is still TBD.

Error handling

We expect the topology structure to get a lot more complex, so this is still TBD, but I currently simply fall back to an empty Hash {} whenever we failed to connect to Prometheus, or Prometheus wasn't enabled by the customer, or any error is thrown.

If no error was raised, but we simply didn't find any results, we will fall back to default values. For now that is only the empty array [] if we fail to collect any node data.

Reach

The change will only work for (and affect) single-node Omnibus deployments. That's because we cannot currently locate a Prometheus instance that is not running on the same node as the application that submits the Usage Ping. This will change in the future. Furthermore, those customers will have to have Prometheus enabled of course, and a node_exporter running for any data to come through to us.

Screenshots

From Admin Area > Metrics and profiling > Usage statistics

No prometheus configured / Prometheus down Prometheus available
Screenshot_from_2020-05-29_15-58-32 Screenshot_from_2020-05-29_16-39-22

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

I've tested this against a local prometheus and it worked.

I tried to also test this on a review-app, but Usage Ping is disabled and greyed out: https://gitlab-review-217666-pro-tg17m7.gitlab-review.app/admin/application_settings/metrics_and_profiling

It sounds like this would have to be enabled via gitlab.rb? I also tried instrumenting it from a Rails console, but the task-runner constantly fails with what looks like an OOM when trying to launch a gitlab-rails console:

$ kubectl exec review-217666-pro-tg17m7-task-runner-675949d75f-qxd55 --namespace review-apps-ee -c task-runner -it -- gitlab-rails console
Fetching cluster endpoint and auth data.
kubeconfig entry generated for review-apps-ee.
--------------------------------------------------------------------------------
 GitLab:       13.1.0-pre () EE
 GitLab Shell: 13.2.0
 PostgreSQL:   10.9
--------------------------------------------------------------------------------
/usr/local/bin/gitlab-rails: line 5:    14 Killed                  $rails_dir/bin/bundle exec rails "$@"
Edited by 🤖 GitLab Bot 🤖

Merge request reports