Analyze Grafana capabilities to gather on a single dashboard Metrics, Traces and Logs
Context
After Pipeline Observability capabilites and creating and extracting new metrics as part of &889 (closed) we are looking for the best visualization tool that would allow us to understand the metrics and traces we are producing, how they correlate together and give us the ability to drill down to logs/metrics/traces around particular events for us to understand better culprits and bottlenecks (and take action to address them).
Problem
Currently, we are extracting metrics that are stored in Infrastructure Prometheus, and we are querying these metrics through the Infrastructure Grafana installation (e.g.: https://dashboards.gitlab.net/d/delivery-release_management_toil/delivery-release-management-toil?orgId=1).
In addition, we are extracting traces from our pipelines and exporting them to GitLab Observability backend in a OpenTelemetry format. These traces are currently analyzed using the Jaeger installation the GitLab Observabilty team offers.
Logs are instead stored in ElasticSearch (e.g.: https://nonprod-log.gitlab.net/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(),filters:!(),hideChart:!f,index:b35d9ca0-6c67-11eb-968b-c18082d502f4,interval:auto,query:(language:kuery,query:''),sort:!(!(time,desc)))).
We are using three different tools, one for metrics, one for traces, and one for logs. Therefore, analyzing particular events requires much more effort to understand timeframes, query different tools, and build a correlation between metrics/logs/traces becomes a very demanding activity.
Ideal Outcome
Grafana is the de facto tool used at GitLab to visualize metrics and to create dashboards; having a solution, Grafana based that would allow correlation of metrics/logs/traces in a single place would create a high value for the team and Release Managers.
Around Grafana there is a wide constellation of tools that would allow having everything in one place. To mention some (from a very quick search):
- https://grafana.com/oss/tempo/
- https://grafana.com/docs/grafana/latest/datasources/elasticsearch/
- https://grafana.com/grafana/plugins/grafana-clickhouse-datasource/
- https://github.com/metrico/qryn
Other resources:
- Blog Post on How to successfully correlate metrics, logs, and traces in Grafana
- New in Grafana 9.1: Link between traces and metrics
Goal
-
Evaluate Tooling available to have a one-stop-shop for metrics, traces and logs coming from Delivery Pipelines -
Discuss with the team various solution -
Decide on a solution that fits Delivery needs and document it in an issue -
Compare with the offering of GitLab Observability