Skip to content

Submit failures with topology usage ping

Matthias Käppler requested to merge 222344-usage-ping-topology-status into master

What does this MR do?

When collecting data for the weekly Usage Ping, we currently follow a very defensive approach that swallows exceptions and maps them to default values. This is to reduce knock-on effects where one component in the Usage Ping will fail the entire submission.

For the topology ping that we are sending as part of &3209, we collect usage data from Prometheus. There are several reasons why this may fail:

  • The customer has disabled the embedded Prometheus
  • Prometheus is enabled, but cannot be reached
  • Prometheus can be reached, but
    • did not return any results for a given query
    • failed for the given query
    • returned an unexpected response
  • Something else went wrong that we did not anticipate

This MR aims at capturing all of these failure modes by:

  1. Tracking a new top-level Usage Ping field prometheus_enabled (true|false)
  2. Introducing a new failures field on the topology element so we can track failures on a per-query level

Implementation

The failures field is the main addition here, and looks as follows:

{
  "topology": {
    ...
    "failures": [
      { "node_memory": "Gitlab::PrometheusClient::ConnectionError" },
      { "service_process_count": "empty_result" } 
    ]
  }
}

In this example, two queries failed. One with a connection error, another because no results were found (that should not happen; we should always have data for the things we query for.)

We can then collect this information in the data warehouse and create visualizations for what kinds of errors occur and how often. This should give us a sense of the most likely failure modes.

There were two major concerns I had when implementing this:

  1. Cardinality of different value sets. We want to make sure that we do not produce too many different values since that would make it difficult to aggregate this data downstream. This is why I decided to e.g. not include the exception message, only the exception type, and the number of possible exceptions on this code path is fairly limited (perhaps a dozen or so).
  2. Data privacy. Exceptions can potentially carry sensitive data. We must make sure to not accidentally leak and upload this data to gitlab.com, which is another reason why I decided not to include error messages here.

The list of possible failure keys representing the data sets we query for is:

  • node_memory
  • node_cpus
  • app_requests
  • service_rss
  • service_uss
  • service_pss
  • service_process_count
  • other

The list of possible failure values is:

  • any fully qualified Exception class
  • empty_result

In order to get better insight into different Prometheus failure modes, I also decided to break up its generic Error class into several different kinds that communicate better what the error was caused by.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Closes #222344 (closed)

Edited by Matthias Käppler

Merge request reports