Telemetry: Make topology data reports more robust

Problem

We decided that Prometheus is the most promising vehicle for collecting memory & topology data for self-managed customers and querying it back prior to submitting it via a usage ping.

For the MVC we decided to associate and aggregate topology metric data by using Prometheus labels, specifically job and instance, largely because that was what's readily available in a vanilla Omnibus setup:

This approach has several drawbacks:

Accuracy. The job label can be fairly broad. For instance, on .com we also use a type label to further disect gitlab-rails targets into e.g. web, api or git nodes. It does not carry information about the web server used either (puma or unicorn.)
Correctness. A larger issue is that we may be misreporting numbers in cases where there are further dimensions to consider that we might be unaware of. For instance, if a customer would emit staging and prod metrics to the Prometheus instance we query for data, then we would aggregate over both, even though prod would likely be a lot more representative of e.g. memory consumption. Conversely, if a customer were to unrelated process_* metrics over the given target endpoint, we would accidentally fold that into the report.
Stability & availability. Label config can be changed at will and hence is not very reliable, or at least not in the way it is used now. If a customer were to change their label config, they might break our topology usage ping without us knowing about it.

Proposal/Options

We should look into options for how to make this setup more reliable, so that we can expect topology pings to report correct metrics with higher confidence. Some ideas that had come up:

Use recording rules. For query minimization, we had a plan from a long time ago to follow the same metrics catalog pattern we use for gitlab.com. By using recording rules in the omnibus, we can produce service level metrics that abstract away the differences. This also makes it easier for us and admins to see what data is recorded and shipped. #216660 (comment 358952794)
Refine Consul metrics. Since we use Consul Service Discovery for multi-node, we could also describe the meta data more completely in Consul then do some label re-writing in Prometheus.
Don't rely on labels as much. Another idea was to only use the instance label to map metrics to a given node, but not use the job label. Instead, we could query a given instance for the metric names it has data for, so as to draw conclusions about which services are running on it. For example, an instance that has sidekiq_ metrics is clearly running Sidekiq and not Puma. A major concern would be query performance, but perhaps it can be combined with recording rules for better efficiency.
Don't rely on node-exporter. We currently use the node_ metrics as the source of truth for which instances/nodes are running. We use these data to key into service-level metrics. However, if not all nodes are running node-exporter, but still run other gitlab services, we will then get no data whatsoever for that node. We should make this more resilient so that in those case we would still get node_services data, just not any top-level node_* metrics.

Edited Jun 25, 2020 by Matthias Käppler