Audit prometheus instance resources
We have observed a few recent events of prometheus instances terminating unexpectedly.
Lets look into the resources configured for prometheus, ensure these are optimal for the workloads, and additional make sure our liveness/readiness probes are configured appropriately.
In the example of gprd
prometheus instances, when they are terminated, it can take 15-20mins for it to replay its WAL and become functional. Optimising our probes to ensure these are only terminated when absolutely necessary would help prevent unnecessary restarts.
Details
- Point of contact for this request: @rnaveiras
- If a call is needed, what is the proposed date and time of the call: Date and Time
- Additional call details (format, type of call): additional details
SRE Support Needed Support Request Details
Edited by Raúl Naveiras