Telemetry: Increase reliability of node to service mappings
For the topology
usage ping, we currently map node and service data we query from Prometheus to each other based on the instance
and job
labels.
The problem with instance
is that:
- it might contain port numbers, so multiple
localhost
addresses won't match (we currently rewrite these in-app to drop the port number) - the same physical location might be reported in different ways
For instance, if node_exporter
reports memory metrics for localhost
, and if on that same node a service reports its metrics for 127.0.0.1
(the IPv4 address for the loopback interface i.e. localhost
), then we wouldn't correctly assign this data even though it's for the same host. This could happen for multi-node as well in cases where host names are used in one case and IP addresses in others.
Moreover, we currently rely on node_exporter
metrics as the SSOT for which instances/nodes exist. This means that if node_exporter is not running, we get 0 metrics. This is just due to how the implementation currently works (a simplification) not for any fundamental reason. We should therefore:
-
normalize instance
values at least so thatlocalhost
/127.0.0.1
/0.0.0.0
all map to the same node -
consider all instance
label values of all queries to get theinstance
set, not justnode_exporter
queries