Audit prometheus instance resources

We have observed a few recent events of prometheus instances terminating unexpectedly.

Lets look into the resources configured for prometheus, ensure these are optimal for the workloads, and additional make sure our liveness/readiness probes are configured appropriately.

In the example of gprd prometheus instances, when they are terminated, it can take 15-20mins for it to replay its WAL and become functional. Optimising our probes to ensure these are only terminated when absolutely necessary would help prevent unnecessary restarts.

Details

Point of contact for this request: @rnaveiras
If a call is needed, what is the proposed date and time of the call: Date and Time
Additional call details (format, type of call): additional details

SRE Support Needed Support Request Details

Edited Oct 25, 2023 by Raúl Naveiras