Submit failures with topology usage ping
What does this MR do?
When collecting data for the weekly Usage Ping, we currently follow a very defensive approach that swallows exceptions and maps them to default values. This is to reduce knock-on effects where one component in the Usage Ping will fail the entire submission.
For the topology
ping that we are sending as part of &3209, we collect usage data from Prometheus. There are several reasons why this may fail:
- The customer has disabled the embedded Prometheus
- Prometheus is enabled, but cannot be reached
- Prometheus can be reached, but
- did not return any results for a given query
- failed for the given query
- returned an unexpected response
- Something else went wrong that we did not anticipate
This MR aims at capturing all of these failure modes by:
- Tracking a new top-level Usage Ping field
prometheus_enabled
(true|false
) - Introducing a new
failures
field on thetopology
element so we can track failures on a per-query level
Implementation
The failures
field is the main addition here, and looks as follows:
{
"topology": {
...
"failures": [
{ "node_memory": "Gitlab::PrometheusClient::ConnectionError" },
{ "service_process_count": "empty_result" }
]
}
}
In this example, two queries failed. One with a connection error, another because no results were found (that should not happen; we should always have data for the things we query for.)
We can then collect this information in the data warehouse and create visualizations for what kinds of errors occur and how often. This should give us a sense of the most likely failure modes.
There were two major concerns I had when implementing this:
- Cardinality of different value sets. We want to make sure that we do not produce too many different values since that would make it difficult to aggregate this data downstream. This is why I decided to e.g. not include the exception message, only the exception type, and the number of possible exceptions on this code path is fairly limited (perhaps a dozen or so).
- Data privacy. Exceptions can potentially carry sensitive data. We must make sure to not accidentally leak and upload this data to gitlab.com, which is another reason why I decided not to include error messages here.
The list of possible failure
keys representing the data sets we query for is:
node_memory
node_cpus
app_requests
service_rss
service_uss
service_pss
service_process_count
other
The list of possible failure
values is:
- any fully qualified
Exception
class empty_result
In order to get better insight into different Prometheus failure modes, I also decided to break up its generic Error
class into several different kinds that communicate better what the error was caused by.
Does this MR meet the acceptance criteria?
Conformity
- [-] Changelog entry Changes to Usage Ping payloads are covered by our Privacy Policy and do not need changelog entries.
-
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides - [-] Database guides
- [-] Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. - [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
-
Tested in GCK container
Closes #222344 (closed)