Skip to content

Usage Ping recording rules: Fix potential CPU overcounting problem

Matthias Käppler requested to merge mk/usage-ping-fix-overcounting into master

See gitlab#229457 (comment 382089422)

For usage ping queries, there was a potential issue where if there are extra dimensions in node_exporter metrics that we don't account for, we might have overcounted CPUs. This is in response some submissions we were seeing with very high number of reported cores, which seems unlikely for single-node deployments.

This MR addresses this by:

  • relying on the existing instance:node_cpus:count metric, which aggregates via without not by
  • taking the max by instance over that value

On the client, we already then take the max_over_time from this to get a rolling 7 day max value.

I also changed the node memory query from being an average to taking the maximum instead. This will make it easier to slot submissions into our reference architectures since otherwise, if the amount of physical memory changes between 2 submissions, we would take the average of the 2 which is not helpful. This will require a small change in the client as well.

Edited by Matthias Käppler

Merge request reports