Break the metrics-catalog down into a multi-stage build process
This issue is very far from complete.. a very early stage proposal, but writing it down here as a placekeeper.
It's been a while since I made any major changes to the Metrics-Catalog.
I have several initial impressions, given my time away from this project:
- It's complicated. I already knew this, but coming back, it's not trivial to get started. The Jsonnet is deep and quite impenetrable, which makes small changes difficult.
- Related to 1, it's huge. So much code, tightly coupled.
- It takes an awfully long time to run
make generatethese days. This makes it very difficult to do anything, as I have enough time to make a coffee whilemake generateruns. - Pipeline execution time is really slow
- Documentation on the features and capabilities of the metrics-catalog is scarce and difficult to integrate
- Testing is quite hard
How can we improve things?
At the moment, there isn't a lot of structure in the Metrics Catalog. We really have a huge ball of mud.
We run make generate and a massive amount of configuration is generated.
Lots of Jsonnet -> Lots of configuration
Proposal: multi-stage build process
Instead of a single process for generating all recording rules for all targets, all environments, Thanos, Prometheus and Mimir, and GitLab.com and the Reference Architectures used by GitLab Dedicated in a single program, we move to a multi-stage process.
- Each stage generates the configuration for the next stage.
- Not all stages may be used in the final configuration.
- Intermediate stages are anaemic: straight YAML configuration which is used to generating follow-on configuration.
- Consider using something like the Grafana kind-registry: https://github.com/grafana/kind-registry. A single set of cue files can be used to generate JSON Schema, Golang client libs etc
- The interface for each stage is built using a schema. The upstream stage produces the config, which is validated and then processed.
- Final stages generate config for Prometheus, Grafana, Helm, etc etc etc
- Intermediate stages are validated as Json-Schema or OpenAPI (via a cue schema?. This helps with fast-failing (better feedback for developers), but also can be used to generate documentation for features and capabilities.
- Each stage can be tested independently of each other stage
- Caching becomes much more effective as intermediate stages that are not affected by a change can be skipped from regeneration. At present, if a file is changed that is used by all configurations, all configurations need to be regenerated. With intermediate stage configurations, it'll be easier to skip intermediate configurations that haven't changed.
Edited by Andrew Newdigate