Telemetry: Make topology data reports more robust
Problem
We decided that Prometheus is the most promising vehicle for collecting memory & topology data for self-managed customers and querying it back prior to submitting it via a usage ping.
For the MVC we decided to associate and aggregate topology metric data by using Prometheus labels, specifically job
and instance
, largely because that was what's readily available in a vanilla Omnibus setup:
This approach has several drawbacks:
-
Accuracy. The
job
label can be fairly broad. For instance, on .com we also use atype
label to further disectgitlab-rails
targets into e.g.web
,api
orgit
nodes. It does not carry information about the web server used either (puma
orunicorn
.) -
Correctness. A larger issue is that we may be misreporting numbers in cases where there are further dimensions to consider that we might be unaware of. For instance, if a customer would emit
staging
andprod
metrics to the Prometheus instance we query for data, then we would aggregate over both, even thoughprod
would likely be a lot more representative of e.g. memory consumption. Conversely, if a customer were to unrelatedprocess_*
metrics over the giventarget
endpoint, we would accidentally fold that into the report. - Stability & availability. Label config can be changed at will and hence is not very reliable, or at least not in the way it is used now. If a customer were to change their label config, they might break our topology usage ping without us knowing about it.
Proposal/Options
We should look into options for how to make this setup more reliable, so that we can expect topology pings to report correct metrics with higher confidence. Some ideas that had come up:
- Use recording rules. For query minimization, we had a plan from a long time ago to follow the same metrics catalog pattern we use for gitlab.com. By using recording rules in the omnibus, we can produce service level metrics that abstract away the differences. This also makes it easier for us and admins to see what data is recorded and shipped. #216660 (comment 358952794)
- Refine Consul metrics. Since we use Consul Service Discovery for multi-node, we could also describe the meta data more completely in Consul then do some label re-writing in Prometheus.
-
Don't rely on labels as much. Another idea was to only use the
instance
label to map metrics to a given node, but not use the job label. Instead, we could query a given instance for the metric names it has data for, so as to draw conclusions about which services are running on it. For example, aninstance
that hassidekiq_
metrics is clearly running Sidekiq and not Puma. A major concern would be query performance, but perhaps it can be combined with recording rules for better efficiency. -
Don't rely on node-exporter. We currently use the
node_
metrics as the source of truth for which instances/nodes are running. We use these data to key into service-level metrics. However, if not all nodes are running node-exporter, but still run other gitlab services, we will then get no data whatsoever for that node. We should make this more resilient so that in those case we would still getnode_services
data, just not any top-levelnode_*
metrics.
Edited by Matthias Käppler