Telemetry: Increase reliability of node to service mappings

For the topology usage ping, we currently map node and service data we query from Prometheus to each other based on the instance and job labels.

The problem with instance is that:

it might contain port numbers, so multiple localhost addresses won't match (we currently rewrite these in-app to drop the port number)
the same physical location might be reported in different ways

For instance, if node_exporter reports memory metrics for localhost, and if on that same node a service reports its metrics for 127.0.0.1 (the IPv4 address for the loopback interface i.e. localhost), then we wouldn't correctly assign this data even though it's for the same host. This could happen for multi-node as well in cases where host names are used in one case and IP addresses in others.

Moreover, we currently rely on node_exporter metrics as the SSOT for which instances/nodes exist. This means that if node_exporter is not running, we get 0 metrics. This is just due to how the implementation currently works (a simplification) not for any fundamental reason. We should therefore:

normalize instance values at least so that localhost/127.0.0.1/0.0.0.0 all map to the same node
consider all instance label values of all queries to get the instance set, not just node_exporter queries

Edited Jul 14, 2020 by Matthias Käppler