Submit failures with topology usage ping (!35435) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 222344-usage-ping-topology-status into master Jun 26, 2020

What does this MR do?

When collecting data for the weekly Usage Ping, we currently follow a very defensive approach that swallows exceptions and maps them to default values. This is to reduce knock-on effects where one component in the Usage Ping will fail the entire submission.

For the topology ping that we are sending as part of &3209, we collect usage data from Prometheus. There are several reasons why this may fail:

The customer has disabled the embedded Prometheus
Prometheus is enabled, but cannot be reached
Prometheus can be reached, but
- did not return any results for a given query
- failed for the given query
- returned an unexpected response
Something else went wrong that we did not anticipate

This MR aims at capturing all of these failure modes by:

Tracking a new top-level Usage Ping field prometheus_enabled (true|false)
Introducing a new failures field on the topology element so we can track failures on a per-query level

Implementation

The failures field is the main addition here, and looks as follows:

{
  "topology": {
    ...
    "failures": [
      { "node_memory": "Gitlab::PrometheusClient::ConnectionError" },
      { "service_process_count": "empty_result" } 
    ]
  }
}

In this example, two queries failed. One with a connection error, another because no results were found (that should not happen; we should always have data for the things we query for.)

We can then collect this information in the data warehouse and create visualizations for what kinds of errors occur and how often. This should give us a sense of the most likely failure modes.

There were two major concerns I had when implementing this:

Cardinality of different value sets. We want to make sure that we do not produce too many different values since that would make it difficult to aggregate this data downstream. This is why I decided to e.g. not include the exception message, only the exception type, and the number of possible exceptions on this code path is fairly limited (perhaps a dozen or so).
Data privacy. Exceptions can potentially carry sensitive data. We must make sure to not accidentally leak and upload this data to gitlab.com, which is another reason why I decided not to include error messages here.

The list of possible failure keys representing the data sets we query for is:

node_memory
node_cpus
app_requests
service_rss
service_uss
service_pss
service_process_count
other

The list of possible failure values is:

any fully qualified Exception class
empty_result

In order to get better insight into different Prometheus failure modes, I also decided to break up its generic Error class into several different kinds that communicate better what the error was caused by.

Does this MR meet the acceptance criteria?

Conformity

[-] Changelog entry Changes to Usage Ping payloads are covered by our Privacy Policy and do not need changelog entries.
Documentation (if required)
Code review guidelines
Merge request performance guidelines
Style guides
[-] Database guides
[-] Separation of EE specific content

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
[-] Tested in all supported browsers
[-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Tested in GCK container

Closes #222344 (closed)

Edited Jul 02, 2020 by Matthias Käppler

Submit failures with topology usage ping

What does this MR do?

Implementation

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Merge request reports