Audit thanos resources and OOMKills
We need to run a fresh audit of our thanos resource usage and optimize where possible.
There is also frequent conditions where various thanos components can be OOMKilled due to a large query.
Ideally we would like to prevent OOMKill's in our pods where possible as it can have a cascading effect on subsequent queries while the system recovers.
We currently have two environments for thanos, these both live in the ops
cluster due to networking requirements, but are separated via namespace.
production: namespace: thanos
.
staging: namespace: thanos-staging
.
Staging currently shares storegateways with production, as these can consume a large amount of resources and we do leverage staging for testing production workloads. rather than duplicating the storegateway and their resource requirements in staging, the staging query components use the production gateways for now.
Details
- Point of contact for this request: @user
- If a call is needed, what is the proposed date and time of the call: Date and Time
- Additional call details (format, type of call): additional details
SRE Support Needed Support Request Details