Skip to content

Add missing topology metrics to usage ping

Matthias Käppler requested to merge 216660-add-basic-topology-metrics into master

What does this MR do?

This is a follow-up to !32315 (merged), which was the initial iteration that added Prometheus support to usage pings, plus a single node metric.

This MR adds all remaining topology metrics that we would like to track via Usage Ping for the MVC, specifically:

  • the number of CPU cores per node
  • which Ruby services are running on each node, plus:
    • service process count
    • process memory RSS (resident set size)
    • process memory USS (unique set size)
    • process memory PSS (proportional set size)

This will give us good initial idea of how customers deploy GitLab and how much memory is consumed by the primary services (the Rails services -- we are looking to extend this later on to other components as well.)

NOTE that as with the original MR, all of this will only apply to single-node installations for now, since we do not yet have the capabilities to locate an external Prometheus node. This will change at some point in the future though, so can never hurt to look at this through the "future looking glass" 🔭

I also decided to extract all topology related usage data collection into a Concern, since it was getting far too complex to continue living in UsageData itself. That makes it easier to test, too.

See also #216660 (closed)

Example

Here's what the payload will look like (pulled from QA preview container):

{
  "topology": {
    "nodes": [
      {
        "node_memory_total_bytes": 33269903360,
        "node_cpus": 16,
        "node_services": [
          {
            "name": "gitlab_rails",
            "process_count": 16,
            "process_memory_pss": 233349888,
            "process_memory_rss": 788220927,
            "process_memory_uss": 195295487
          },
          {
            "name": "gitlab_sidekiq",
            "process_count": 1,
            "process_memory_pss": 734080000,
            "process_memory_rss": 750051328,
            "process_memory_uss": 731533312
          }
        ]
      }
    ],
    "duration_s": 0.013836685999194742
  }
}

I'm looking for feedback on this data structure as well (~"group::telemetry").

Performance impact

Querying Prometheus as part of usage ping opens questions around the performance impact of course. I benchmarked the impact of running the 4 queries currently used against our production Prometheus servers. The benchmark can be found here: https://gitlab.com/gitlab-org/gitlab/snippets/1983636

The results of sending all 4 queries swing quite wildly between .9 seconds to 5 seconds:

Rehearsal ------------------------------------------------
app queries    0.043742   0.000242   0.043984 (  0.699953)
main queries   0.034275   0.000000   0.034275 (  0.581425)
all queries    0.081021   0.003966   0.084987 (  1.252002)
--------------------------------------- total: 0.163246sec

                   user     system      total        real
app queries    0.024574   0.000000   0.024574 (  0.586431)
main queries   0.017995   0.003984   0.021979 (  0.473990)
all queries    0.042974   0.000185   0.043159 (  1.151759)
Rehearsal ------------------------------------------------
app queries    0.022172   0.000135   0.022307 (  0.475476)
main queries   0.012620   0.008028   0.020648 (  0.465121)
all queries    0.040645   0.000042   0.040687 (  0.954856)
--------------------------------------- total: 0.083642sec

                   user     system      total        real
app queries    0.045079   0.000069   0.045148 (  0.491550)
main queries   0.038586   0.000155   0.038741 (  0.479245)
all queries    0.046054   0.003882   0.049936 (  5.009503)
Rehearsal ------------------------------------------------
app queries    0.020487   0.000270   0.020757 (  0.503571)
main queries   0.021728   0.000336   0.022064 (  0.571416)
all queries    0.041445   0.000320   0.041765 (  1.104775)
--------------------------------------- total: 0.084586sec

                   user     system      total        real
app queries    0.019996   0.003766   0.023762 (  0.466840)
main queries   0.021804   0.000271   0.022075 (  0.486701)
all queries    0.040141   0.000236   0.040377 (  4.955309)
Rehearsal ------------------------------------------------
app queries    0.016684   0.003810   0.020494 (  0.461673)
main queries   0.025718   0.000325   0.026043 (  4.465426)
all queries    0.039920   0.000223   0.040143 (  0.995094)
--------------------------------------- total: 0.086680sec

                   user     system      total        real
app queries    0.023683   0.000000   0.023683 (  0.530871)
main queries   0.016763   0.003819   0.020582 (  0.513517)
all queries    0.037971   0.004302   0.042273 (  1.195756)
Rehearsal ------------------------------------------------
app queries    0.017204   0.004286   0.021490 (  0.474670)
main queries   0.020740   0.000262   0.021002 (  0.565457)
all queries    0.040274   0.000015   0.040289 (  1.125658)
--------------------------------------- total: 0.082781sec

                   user     system      total        real
app queries    0.023898   0.000214   0.024112 (  0.445421)
main queries   0.022364   0.000091   0.022455 (  0.448126)
all queries    0.042450   0.000000   0.042450 (  0.942314)

Considering that this job only runs once a week, and considering furthermore that this is querying gitlab.com data which I presume is much vaster than any of our clients', it is probably acceptable.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Notes on testing:

To test this locally, I have published an Omnibus image via CI that can be pulled like so:

docker pull registry.gitlab.com/gitlab-org/build/omnibus-gitlab-mirror/gitlab-ee:02233b62afd6d122236ecb5ff118cc47fe7bc062

When running this container, you can preview the Usage Ping payload as normally from the Admin Area > Usage statistics panel.

Edited by 🤖 GitLab Bot 🤖

Merge request reports