Skip to content

Audit prometheus instance resources

We have observed a few recent events of prometheus instances terminating unexpectedly.

Lets look into the resources configured for prometheus, ensure these are optimal for the workloads, and additional make sure our liveness/readiness probes are configured appropriately.

In the example of gprd prometheus instances, when they are terminated, it can take 15-20mins for it to replay its WAL and become functional. Optimising our probes to ensure these are only terminated when absolutely necessary would help prevent unnecessary restarts.

Details

  • Point of contact for this request: @rnaveiras
  • If a call is needed, what is the proposed date and time of the call: Date and Time
  • Additional call details (format, type of call): additional details

SRE Support Needed Support Request Details

Edited by Raúl Naveiras