Modeling and sharing of observability data
We currently have automatically generated dashboards, based on well-known Prometheus exporters like NGINX Ingress. This is nice and convenient, but we need to go much further and expand this model to any metrics at an organization, or beyond.
A project/microservice should be able to define the metrics it outputs, as well as a set of charts that represent the key indicators of performance and health. Then any group/dashboard/workflow which is trying to make sense of metrics data, can leverage these models.
Further, it should also describes the tags in use, as well as any mapping information if the tag names vary between errors/logs/traces/metrics.
Sample use case
For example:
- A service map typically represents a DAG of microservices. When clicking on one, frequently a basic set of metrics is displayed. Usually throughput, error rate, and latency.
- While those are nice metrics, it may be more descriptive and relevant to show other metrics. For example your credit card processing service may want to expose additional state information, like failed, rejected, and successful transactions.
- With modeled metrics, the Service Map could display the most relevant metrics for each service.
- If a particular node or host is generating a lot of failed transactions, one could click on the host tag and pull up relevant errors, logs, and traces even if the tag names are slightly different across these sources.
Proposal
- Each project defines their metrics and canonical dashboard snippet. (Say 3-9 charts)
- Stored as a JSON blob. Maybe in the repo, for version control and easy import/export.
- These metrics and charts are then shared group or instance wide.
- SLI, SLO and SLA thresholds could also be defined.
- Each project also defines their tags, providing descriptions as well the names of the tag across each source (in the event they are different).
- The project can then opt to share these instance wide, group wide, or not at all. (Maybe based on project permissions?)
- Could also consider building a Grafana JSON importer
Many other parts of GitLab could then leverage this data:
- Service Map could use it to render relevant info for each microservice
- Cross linking between sources would be easier
- We could do validation and alerting on unknown tags, to try and encourage following the definitions (and avoid bad behavior)
- Importing metrics from other projects could be as easy as vendoring in a file. (From public open source projects like PG, Redis, etc.) Or potentially importing a Grafana dashboard.