Grafana apdex tracking is not agreeing with logs in Kibana

The Goal

I'm trying to improve the adpex score of the background jobs maintained by grouppipeline execution, and I'm using this Grafana table to track the current performance of the various workers:

image

The Problem

On the back of a napkin, I calculate 35,000 adpex violations by the Ci::BuildTraceChunkFlushWorker in the past 24h. The error rate being 0.0%, those 35k jobs have either a queueing or an execution duration SLO violation. So I hop over into Kibana to track down those jobs:

image

and I come up with one, absolutely brutally long queue time for a single job, in the last 24h. Nothing else though!

So what's the truth?

A quick chat with @cmcfarland and @mattmi informed me that what Kibana shows is a very straightforward copy of actual log files, not aggregated or reformed in any special ways. Whereas Grafana renders aggregations of aggregations, so figuring out where the figures come from is less straightforward, and I don't really know how to dig into that.

I'm concerned that these metrics are using inaccurate data to populate the Grafana SLO dashboards that our team uses to prioritize infradev and maintenanceperformance work. If anyone has any answers, or is willing to help me investigate, I would appreciate it!