OOM kills in the ES cluster

Summary

Timeline

All times UTC.

2020-05-12

07:45 - OOM kills occur on 4 hot nodes in 2 AZs
08:10 - ES support notices there's a problem with the cluster
08:51 - Incident declared from Slack
09:15 - all 4 nodes are back to an operational state, the cluster is still struggling to recover due to timeouts on the master
09:17 - backlog starts to go down rapidly
11:02 - backlog is cleared

Details

4 hot nodes OOM killed and they are struggling to rejoin the cluster. ES support increased the severity of the case to 1 a few minutes ago and they are actively working on this.

The nodes were spread across multiple zones and unfortunately they held shards from the same indices. This means that a lot of shards within single indices failed and that had the effect of lowering the indexing rate for multiple indices. The most badly impacted are gitaly, workhorse and pages:

Due to the massive cluster state size, there were a lot of timeouts on the master. This had a negative impact on the cluster's ability to recover.

Source

Incident declared by t4cc0re in Slack via /incident declare command.

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited May 12, 2020 by Michal Wasilewski