Evaluate retention policy for Kibana production logs (requests)

Context

Once in a while a team may find itself in a situation where error budget is exceeded and it requires some investigation. Grafana / Prometheus logs are not very helpful here as they're by design high level aggregates (and very slow at longer time ranges) without much detail, so we usually resort to Kibana which tracks each request. This works fine for checking immediate or very short timeframe, but retention policy is only 7d, making it impossible to compare performance across larger timeframe - monthly, quarterly or even year to year. In my own experience, this is crucial to understand changes in performance, find possible regressions including the date when they happened. With just 7 days, we can only do this if we check almost immediately, and sometimes change is not traceable or visible in such a short period.

I understand the reasons why we have such demanding policy in place - we generate millions of records per hour - so the cost and performance impact on ELK stack is huge. But if we enforce error budgets, we need proper tooling to allow teams to profile their endpoints properly and currently I think it's greatly limited.

Expectation

I think to properly address any error budget breach, we need a way to track performance of endpoint over time and be able to see historic performance at least few months back, ideally - years. This data doesn't have to be very granular, in fact it should be enough to know the endpoint and min/median/max/p90/p99 durations (full, db, redis, gitaly, view..) and counts (db, redis) with a given resolution (daily?).

Without this, we're really limiting ourselves to how we can respond to degraded performance, because we only have current state of things (-7days). If things are not caught immediately, we have no way of detecting when the performance drop happened, whether it's periodic/recurring or following some other pattern.

💡 Ideas

Rollup/downscale

Instead moving data from ELK stack to other storage which is not easily accessible, we could aggregate the data and store it in more compacted form. ELK itself has several solutions for this it seems:

Rollup: https://www.elastic.co/guide/en/elasticsearch/reference/current/rollup-overview.html (seems to be technical preview)
Downsampling: https://www.elastic.co/guide/en/elasticsearch/reference/current/downsampling.html

Other aggregates

Aggregate, but instead of using built-in ELK capabilities, we do it on our own, pushing the data to some other storage (or even different index within ELK) - this could work if we're not happy with what ELK provides or unable to use it (version mismatch etc)

Separate retention policies

Remove part of the (less usful) data while leaving most of it for longer (or split indices based on some dimension). I did a quick check over 7 day period and this is distribution of data per "caller_id" which we can assume to be endpoint/controller:

This indicates that 60% of our requests is contributed by just 9 endpoints, and many of them are GraphQL endpoints (which we filter out in Kibana queries when profiling anyway). We could either split the dataset and put the "other" into less strict retention policy, or have different retention policies for these records (not sure it's possible, haven't done such thing myself)

This unfortunately penalizes the top9 endpoints, but as mentioned - we're often filtering them out anyway. So it would improve the common usage.

Edited Jan 17, 2023 by Kamil Niechajewicz