Investigate regular thanos-store OOMs in gprd
Problem
Since fixing thanos-store / memcached integration (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12949), we've had regular alerts for elevated error rates from grafana / thanos-query. See production#4074 (closed) for work up to this point.
These are often correlated with OOM storms in thanos-store gprd:
- query: https://thanos.gitlab.net/graph?g0.range_input=6h&g0.max_source_resolution=0s&g0.expr=sum%20by%20(pod)%20(container_memory_usage_bytes%7Benv%3D%22gprd%22%2C%20container%3D%22thanos-store%22%7D)&g0.tab=0
- this can be hard to catch in thanos (for obvious reasons), so investigating recent occurences should be done in https://prometheus-gke.gprd.gitlab.net.
In each environment, we're running thanos-store using hashmod sharding, with 2 replicas per shard for redundancy. We run 2 memcached clusters per environment for the stores: one as a "chunk bucket", one as an index cache. The clusters are shared across all store shards.
We've scaled the store shards and the memcached clusters up and out substantially, but still encounter huge memory surges that cause store pods to OOM. I'm not 100% sure yet, but these surges appear due to be due to requests from thanos-query. Some nuggets from a thread I started on the thanos slack channel (https://cloud-native.slack.com/archives/CK5RSSC10/p1617021433108300):
- Store only essentially downloads data from remote object storage & streams it to Thanos Query, it's not hard to list all of the main sources of RAM consumption. One thing is: https://github.com/thanos-io/thanos/blob/main/docs/operating/binary-index-header.md#impact-on-cpu-memory-and-disk.
- memcached helps with series/postings lookups but it's not the only thing using RAM
😄 As the (above) link says, Store could suddenly load more postings offsets from the binary index headers which you have on your disk. It depends on your queries & what has been queried before: https://github.com/thanos-io/thanos/blob/90015469cc81e7532a995527aa903262988d0199/pkg/block/indexheader/binary_reader.go#L559-L560 - Settings that we might be interested in tuning (see https://thanos.io/v0.18/components/store.md/):
- store.grpc.series-sample-limit
- series-max-concurrency
- max-query-parallelism
- query.max-concurrent
- query.max-concurrent-select
Desired outcome
thanos-store pods are stable, and our grafana latency is improved.
Acceptance criteria
-
...