Increase reliability of node to service mappings (!35756) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 222530-improve-instance-label-handling into master Jul 01, 2020

What does this MR do?

We submit topology information pulled from Prometheus with Usage Ping. Currently we target single-node topologies only, i.e. all Omnibus services will run on the same node.

In order to determine which GitLab service runs on which node/host, we currently inspect the instance label returned with Prometheus queries. This can be tricky for the following reasons:

Services running on the same node might all report their instance in different ways. For example, a service binding itself to 0.0.0.0 to listen on all interfaces will be exported with that IP address. Moreover, localhost services can sometimes be reported as localhost, sometimes with the lo IP addresses (127.0.0.1 and ::1 for IPv4 and IPv6 respectively), yet they all run on the same host machine.
There is a problem with the current implementation where we only consider instances to be valid when returned from node_exporter metrics (since that in some sense is the authority for node metrics.) However, users can turn node_exporters off in Omnibus; this would currently mean that we would get no metrics whatsoever. We should instead consider all instance labels from all metrics we pull when forming the "candidate set" of observed instances/nodes.

This MR aims to address both of these issues.

Example

Pulled from an Omnibus container:

"topology": {
    "application_requests_per_hour": 140,
    "nodes": [
      {
        "node_memory_total_bytes": 33269903360,
        "node_cpus": 16,
        "node_services": [
          {
            "name": "prometheus",
            "process_count": 1,
            "process_memory_rss": 139467434
          },
          {
            "name": "web",
            "process_count": 16,
            "process_memory_rss": 772499797,
            "process_memory_uss": 100584448,
            "process_memory_pss": 142191872,
            "server": "puma"
          },
          {
            "name": "sidekiq",
            "process_count": 3,
            "process_memory_rss": 755548160,
            "process_memory_uss": 736721578,
            "process_memory_pss": 739047196
          },
          {
            "name": "node-exporter",
            "process_count": 1,
            "process_memory_rss": 15381845
          },
          {
            "name": "redis",
            "process_count": 1,
            "process_memory_rss": 13333845
          },
          {
            "name": "postgres",
            "process_count": 1,
            "process_memory_rss": 16414037
          },
          {
            "name": "workhorse",
            "process_count": 1,
            "process_memory_rss": 33464320
          },
          {
            "name": "gitaly",
            "process_count": 1,
            "process_memory_rss": 34036394
          }
        ]
      }
    ],
    "duration_s": 0.03569750499809743,
    "failures": [

    ]
  }

The fix here is visible in that I am binding prometheus to 0.0.0.0. instead of localhost. It is now included in the topology payload, when previously, it was not, as can be seen e.g. here: !35618 (merged)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
[-] Tested in all supported browsers
[-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Test in Omnibus

Closes #222530 (closed)

Edited Jul 03, 2020 by Matthias Käppler

Increase reliability of node to service mappings

What does this MR do?

Example

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Merge request reports