Collection of observability problems in Thanos

The past few weeks, we've had several reports of people not being able to load data on dashboards. I think the causes are often different and hard to track down. The common themes of questions we get are related to:

1. Gaps in metrics

Most things the consumers of Error Budgets For Stage groups are recordings. Often they see gaps in the metrics they're viewing. This appears to be gaps in the recordings that are shown in panels. For example:

Source

Here we can see a gap in the recordings for the stageGroupSLIs aggregation set.

2. Panels not loading

Some panels aren't loading with errors. Sometimes this is concerning high cardinality metrics & ranges that are too big to load for those. However, often this is not the case. Trying to load the metrics in Thanos works just fine.

This hard to reproduce, the error reported in Grafana is this one:

<html><head> <meta http-equiv="content-type" content="text/html;charset=utf-8"> <title>502 Server Error</title> </head> <body text=#000000 bgcolor=#ffffff> <h1>Error: Server Error</h1> <h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2> <h2></h2> </body></html>

Subsequently loading the one of the graphs directly in Thanos seems to work. I think this could be caused by an internal rate-limit being hit: it seems to happen more often when expanding an SLI's details row, which tries to load 7 panels. A couple of refreshes seems to fix this.

I've also anecdotally seen that cancelling a tamland run, if there was a simultaneous run, helps.

3. Incorrect recordings visible on dashboards

Partly discovered in scalability#1702

I see that sometimes recordings become incorrect. I suspect this is because we have to use partial_response_strategy: warn in our recordings for Thanos. This causes some metrics to jump around. Example is visible in Thanos

This graph is the bottom graph in the the Thanos link. All lines should be equal, they are the sources of the recording rules. But we can see that the aggregation="stage_group" jumps around, drops from the source.

Ideas

Besides the problems reported by users, it is hard to subsequently look back in metrics what happened. I believe that &630 would go a long way in getting on top of this.
I also think having separate thanos-query frontends could help (scalability#1996), we could separate automated workloads (tamland) out. Would it be possible to have multiple thanos-query for serving Grafana could help.
Reduce cardinality of metrics &330. The highest cardinality metrics aren't the ones that we're loading in dashboards. Those tend to already be aggregations. So we'd be mostly removing metrics from the application that we aren't really using for monitoring. Would reducing cardinality help query performance? Will it help with gaps in recording rules? How will we deal when we inevitably add more series again as we grow?
Only use series globally aggregated (in thanos-rule) for displaying on dashboards: right now the SLI details rows use aggregations from Prometheus. We could push this one more level up and add recordings for all of them using monitor='global' to display on dashboards.

Edited Jan 31, 2023 by Bob Van Landuyt