elastic log prod cluster master node not available
Summary
elastic log prod cluster master node not available
Timeline
All times UTC.
2020-05-12
- 11:10 - rails log queue starts growing
- 11:14 - warning that node 45 got removed from cluster (in cluster alerts shown by monitoring cluster)
- 11:16 - Alert
Production logging cluster is unhealthy
- 11:27 - Incident declared from Slack
- 11:31 - EOC holding back prod deploy until log visibility is restored
- 11:32 - Elastic Support case opened
- 14:22 - We've started decreasing PubSub message queue and are slowly catching up.
Details
Symptoms:
- Master nodes not available
- log queues growing
- ILM failures
- high memory usage on warm nodes
- node 45 getting constantly removed and added back to the cluster since 11:14
Source
Incident declared by hphilipps in Slack via /incident declare
command.
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by AnthonySandoval