Increase Prometheus node resources
Summary
Prometheus pods are frequently being OOMkilled due to resource constraints, WAL replays take too long and often lead to missing metries due to both pods being affected.
Related Incident(s)
Originating issue(s): production#6408 (closed)
Desired Outcome/Acceptance criteria
Upgrade nodes which Prometheus is allocated to.
Associated Services
- Prometheus
Corrective Action Issue Checklist
-
link the incident(s) this corrective action arose out of -
give context for what problem this corrective action is trying to prevent from re-occurring -
assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
assign a priority (this will default to 'priority::4')
Edited by Steve Xuereb