OOM kills in the ES cluster
Summary
OOM kills in the ES cluster
Timeline
All times UTC.
2020-05-12
- 07:45 - OOM kills occur on 4 hot nodes in 2 AZs
- 08:10 - ES support notices there's a problem with the cluster
- 08:51 - Incident declared from Slack
- 09:15 - all 4 nodes are back to an operational state, the cluster is still struggling to recover due to timeouts on the master
- 09:17 - backlog starts to go down rapidly
- 11:02 - backlog is cleared
Details
4 hot nodes OOM killed and they are struggling to rejoin the cluster. ES support increased the severity of the case to 1 a few minutes ago and they are actively working on this.
The nodes were spread across multiple zones and unfortunately they held shards from the same indices. This means that a lot of shards within single indices failed and that had the effect of lowering the indexing rate for multiple indices. The most badly impacted are gitaly, workhorse and pages:
Due to the massive cluster state size, there were a lot of timeouts on the master. This had a negative impact on the cluster's ability to recover.
Source
Incident declared by t4cc0re in Slack via /incident declare
command.
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by Michal Wasilewski