Audit thanos resources and OOMKills

We need to run a fresh audit of our thanos resource usage and optimize where possible.

There is also frequent conditions where various thanos components can be OOMKilled due to a large query.
Ideally we would like to prevent OOMKill's in our pods where possible as it can have a cascading effect on subsequent queries while the system recovers.

We currently have two environments for thanos, these both live in the ops cluster due to networking requirements, but are separated via namespace.

production: namespace: thanos.
staging: namespace: thanos-staging.

Staging currently shares storegateways with production, as these can consume a large amount of resources and we do leverage staging for testing production workloads. rather than duplicating the storegateway and their resource requirements in staging, the staging query components use the production gateways for now.

Details

Point of contact for this request: @user
If a call is needed, what is the proposed date and time of the call: Date and Time
Additional call details (format, type of call): additional details

SRE Support Needed Support Request Details

Edited Oct 13, 2023 by Nick Duff