Investigate regular thanos-store OOMs in gprd

Problem

Since fixing thanos-store / memcached integration (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12949), we've had regular alerts for elevated error rates from grafana / thanos-query. See production#4074 (closed) for work up to this point.

These are often correlated with OOM storms in thanos-store gprd:

In each environment, we're running thanos-store using hashmod sharding, with 2 replicas per shard for redundancy. We run 2 memcached clusters per environment for the stores: one as a "chunk bucket", one as an index cache. The clusters are shared across all store shards.

We've scaled the store shards and the memcached clusters up and out substantially, but still encounter huge memory surges that cause store pods to OOM. I'm not 100% sure yet, but these surges appear due to be due to requests from thanos-query. Some nuggets from a thread I started on the thanos slack channel (https://cloud-native.slack.com/archives/CK5RSSC10/p1617021433108300):

Desired outcome

thanos-store pods are stable, and our grafana latency is improved.

Acceptance criteria

  • ...
Edited by Craig Furman