Corrective action: The Horizontal Pod Autoscaler Desired Replicas resource of the sidekiq service (main stage) has a saturation exceeding SLO and is close to its capacity limit.
Summary
The Catchall deployment is hitting it's HPA replica limit, or running close to it. We need to do more work with less pods, or have more pods doing more work.
Related Incident(s)
Originating issue(s): production#6615 (closed)
Desired Outcome/Acceptance Criteria
I think our target should be hitting about 80% HPA replica limit during peak hours. While it is easy to just add more HPA replicas to the limits and possibly add more nodes to the node pools, serious consideration should be made to try and tune the scaling metrics and CPU allocations per pod to try and squeeze more work out of the current nodes.
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'priority::4')