Implement framework for loggers and observers using sampling / thresholds

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Description

We use Elastic / Kibana and Prometheus extensively. The volume of logs we need to ingest and index / the cardinality explosion in Prometheus is slowly starting to cause problems.

We need better workflows that will make it easier to make effective use of Elastic / Prometheus without overloading the system.

Assumption

We rarely need completely accurate logging / metrics. Recent situations, like the one with instrumenting subtransactions, are rare. In this case we need to inspect every transaction to find a SAVEPOINT, but this is an infrequent need.

💡 We could benefit from Prometheus metrics / Elastic logs even if we only log / observe a subset of traffic / requests / measurements.

💡 We need a generic framework and documentation engineers will be able to use easily.

Problem

A caller calls some area of code hundreds of times per second, but I want to log something / observe something with a rate of 1 log / measurement per second.

Proposal

➡️ Implement "sampled / limited / rate limited" Prometheus observations. This can use feature flags. This might be useful in cases when we do have 500 observations per second, but we only need to make a few to make a metric useful to us. Even better if we make it possible to use this technique to limit observations to some subset of the fleet to reduce cardinality even further.

➡️ Implement "sampled / limited / rate limited" Elastic logger. Perhaps something like Application JSON Logger with Threshold to limit the number of logs created based on the configured number. For example: /chatops run feature set application_json_logging_threshold_[some_feature] 10 to drop 90% of logs.

Thoughts @andrewn?

Edited by 🤖 GitLab Bot 🤖